Subscribe
SQL Pivot: Converting Rows to Columns
Pivot was first introduced in Apache Spark 1.6 as a new DataFrame feature that allows users to rotate a table-valued expression by turning the unique values from one column into individual columns.
The upcoming Apache Spark 2.4 release extends this powerful functionality of pivoting data to our SQL users as well. In this blog, using temperatures recordings in Seattle, we’ll show how we can use this common SQL Pivot feature to achieve complex data Temp (°F) To combine this table with the previous table of daily high temperatures, we could join these two tables on the “Date” column. However, since we are going to use pivot, which performs grouping on the dates, we can simply concatenate the two tables using Now let’s try our pivot query with the new combined table: As a result, we get the average high and average low for each month of the past 4 years in one table. Note that we need to include the column … … 08-01-2018 59 08-02-2018 58 08-03-2018 59 08-04-2018 58 08-05-2018 59 08-06-2018 59 … … UNION ALL
. And you’ll see later, this approach also provides us with more flexibility:SELECT date, temp, 'H' as flag FROM high_temps UNION ALL SELECT date, temp, 'L' as flag FROM low_temps
SELECT * FROM ( SELECT year(date) year, month(date) month, temp, flag `H/L` FROM ( SELECT date, temp, 'H' as flag FROM high_temps UNION ALL SELECT date, temp, 'L' as flag FROM low_temps ) WHERE date BETWEEN DATE '2015-01-01' AND DATE '2018-08-31' ) PIVOT ( CAST(avg(temp) AS DECIMAL(4, 1)) FOR month in (6 JUN, 7 JUL, 8 AUG, 9 SEP) ) ORDER BY year DESC, `H/L` ASC flag
in the pivot query, otherwise the expression avg(temp)
would be based on a mix of high and low temperatures.
year | H/L | JUN | JUL | AUG | SEP |
---|---|---|---|---|---|
2018 | H | 71.9 | 82.8 | 79.1 | NULL |
2018 | L | 53.4 | 58.5 | 58.5 | NULL |
2017 | H | 72.1 | 78.3 | 81.5 | 73.8 |
2017 | L | 53.7 | 56.3 | 59.0 | 55.6 |
2016 | H | 73.1 | 76.0 | 79.5 | 69.9 |
2016 | L | 53.9 | 57.6 | 59.9 | 52.9 |
2015 | H | 78.9 | 82.6 | 79.0 | 68.5 |
2015 | L | 56.4 | 59.9 | 58.5 | 52.5 |
You might have noticed that now we have two rows for each year, one for the high temperatures and the other for low temperatures. That’s because we have included one more column, flag
, in the pivot input, which in turn becomes another implicit grouping column in addition to the original column year
.
Alternatively, instead of being a grouping column, the flag
can also serve as a pivot column. So now we have two pivot columns, month
and flag
:
SELECT * FROM ( SELECT year(date) year, month(date) month, temp, flag FROM ( SELECT date, temp, 'H' as flag FROM high_temps UNION ALL SELECT date, temp, 'L' as flag FROM low_temps ) WHERE date BETWEEN DATE '2015-01-01' AND DATE '2018-08-31' ) PIVOT ( CAST(avg(temp) AS DECIMAL(4, 1)) FOR (month, flag) in ( (6, 'H') JUN_hi, (6, 'L') JUN_lo, (7, 'H') JUL_hi, (7, 'L') JUL_lo, (8, 'H') AUG_hi, (8, 'L') AUG_lo, (9, 'H') SEP_hi, (9, 'L') SEP_lo ) ) ORDER BY year DESC
This query presents us with a different layout of the same data, with one row for each year, but two columns for each month.
year | JUN_hi | JUN_lo | JUL_hi | JUL_lo | AUG_hi | AUG_lo | SEP_hi | SEP_lo |
---|---|---|---|---|---|---|---|---|
2018 | 71.9 | 53.4 | 82.8 | 58.5 | 79.1 | 58.5 | NULL | NULL |
2017 | 72.1 | 53.7 | 78.3 | 56.3 | 81.5 | 59.0 | 73.8 | 55.6 |
2016 | 73.1 | 53.9 | 76.0 | 57.6 | 79.5 | 57.9 | 69.6 | 52.9 |
2015 | 78.9 | 56.4 | 82.6 | 59.9 | 79.0 | 58.5 | 68.5 | 52.5 |
What’s Next
To run the query examples used in this blog, please check the pivot SQL examples in this accompanying blog.
Thanks to the Apache Spark community contributors for their contributions!
Databricks Inc.
160 Spear Street, 13th Floor
San Francisco, CA 94105
1-866-330-0121
© Databricks 2018. All rights reserved. Apache, Apache Spark, Spark and the Spark logo are trademarks of the Apache Software Foundation.
Privacy Policy | Terms of Use