SQL Pivot: Converting Rows to Columns

Try this notebook in Databricks

Pivot was first introduced in Apache Spark 1.6 as a new DataFrame feature that allows users to rotate a table-valued expression by turning the unique values from one column into individual columns.

The upcoming Apache Spark 2.4 release extends this powerful functionality of pivoting data to our SQL users as well. In this blog, using temperatures recordings in Seattle, we’ll show how we can use this common SQL Pivot feature to achieve complex data Temp (°F)……08-01-20185908-02-20185808-03-20185908-04-20185808-05-20185908-06-201859……

To combine this table with the previous table of daily high temperatures, we could join these two tables on the “Date” column. However, since we are going to use pivot, which performs grouping on the dates, we can simply concatenate the two tables using UNION ALL. And you’ll see later, this approach also provides us with more flexibility:

SELECT date, temp, 'H' as flag FROM high_temps UNION ALL SELECT date, temp, 'L' as flag FROM low_temps

Now let’s try our pivot query with the new combined table:


SELECT * FROM (   SELECT year(date) year, month(date) month, temp, flag `H/L`   FROM (     SELECT date, temp, 'H' as flag     FROM high_temps     UNION ALL     SELECT date, temp, 'L' as flag     FROM low_temps   )   WHERE date BETWEEN DATE '2015-01-01' AND DATE '2018-08-31' ) PIVOT (   CAST(avg(temp) AS DECIMAL(4, 1))   FOR month in (6 JUN, 7 JUL, 8 AUG, 9 SEP) ) ORDER BY year DESC, `H/L` ASC

As a result, we get the average high and average low for each month of the past 4 years in one table. Note that we need to include the column flag in the pivot query, otherwise the expression avg(temp) would be based on a mix of high and low temperatures.

year	H/L	JUN	JUL	AUG	SEP
2018	H	71.9	82.8	79.1	NULL
2018	L	53.4	58.5	58.5	NULL
2017	H	72.1	78.3	81.5	73.8
2017	L	53.7	56.3	59.0	55.6
2016	H	73.1	76.0	79.5	69.9
2016	L	53.9	57.6	59.9	52.9
2015	H	78.9	82.6	79.0	68.5
2015	L	56.4	59.9	58.5	52.5

You might have noticed that now we have two rows for each year, one for the high temperatures and the other for low temperatures. That’s because we have included one more column, flag, in the pivot input, which in turn becomes another implicit grouping column in addition to the original column year.

Alternatively, instead of being a grouping column, the flag can also serve as a pivot column. So now we have two pivot columns, month and flag:

SELECT * FROM (   SELECT year(date) year, month(date) month, temp, flag   FROM (     SELECT date, temp, 'H' as flag     FROM high_temps     UNION ALL     SELECT date, temp, 'L' as flag     FROM low_temps   )   WHERE date BETWEEN DATE '2015-01-01' AND DATE '2018-08-31' ) PIVOT (   CAST(avg(temp) AS DECIMAL(4, 1))   FOR (month, flag) in (     (6, 'H') JUN_hi, (6, 'L') JUN_lo,     (7, 'H') JUL_hi, (7, 'L') JUL_lo,     (8, 'H') AUG_hi, (8, 'L') AUG_lo,     (9, 'H') SEP_hi, (9, 'L') SEP_lo   ) ) ORDER BY year DESC

This query presents us with a different layout of the same data, with one row for each year, but two columns for each month.