SQL window function with “ORDER BY” and without “ORDER BY”
2024
- Between
SUM()
withORDER BY
andSUM()
withoutORDER BY
in a window function
Dataset Example:
Assume we have the following sales data for different employees across several months:
emp_id | sales | month |
---|---|---|
1 | 100 | Jan |
2 | 200 | Jan |
3 | 300 | Jan |
4 | 150 | Feb |
5 | 250 | Feb |
6 | 350 | Mar |
7 | 400 | Mar |
Now, let’s compare two scenarios:
a) Using SUM()
with ORDER BY
(Cumulative or Running Total):
In this case, SUM()
with an ORDER BY
clause inside the OVER
statement will give you a cumulative sum (or running total). It calculates the sum for each row by adding up the sales from the previous rows within each partition.
Query:
SELECT emp_id, sales, month,
SUM(sales) OVER (PARTITION BY month ORDER BY sales) AS running_total
FROM sales_data;
Breakdown:
PARTITION BY month
: Groups the rows by each month.ORDER BY sales
: Orders the rows within each month bysales
, and calculates a running total.
Result (Running Total Example):
emp_id | sales | month | running_total |
---|---|---|---|
1 | 100 | Jan | 100 |
2 | 200 | Jan | 300 |
3 | 300 | Jan | 600 |
4 | 150 | Feb | 150 |
5 | 250 | Feb | 400 |
6 | 350 | Mar | 350 |
7 | 400 | Mar | 750 |
Key Points:
- The
running_total
column shows a cumulative sum of sales within each month. - For example, in January, the first sale is 100, the second sale adds to 300, and the third sale results in 600.
- The rows are ordered by
sales
, so it calculates the sum row-by-row in ascending order of sales.
b) Using SUM()
without ORDER BY
(Total Sum Per Partition):
If you remove the ORDER BY
inside the OVER
clause, SUM()
will calculate the total sum of sales for each month without accumulating as it moves row by row.
Query:
SELECT emp_id, sales, month,
SUM(sales) OVER (PARTITION BY month) AS total_sales_per_month
FROM sales_data;
Breakdown:
PARTITION BY month
: Groups the rows by each month.- No
ORDER BY
: It simply returns the total sum for each month for every row.
Result (Total Sum Per Partition Example):
emp_id | sales | month | total_sales_per_month |
---|---|---|---|
1 | 100 | Jan | 600 |
2 | 200 | Jan | 600 |
3 | 300 | Jan | 600 |
4 | 150 | Feb | 400 |
5 | 250 | Feb | 400 |
6 | 350 | Mar | 750 |
7 | 400 | Mar | 750 |
Key Points:
- The
total_sales_per_month
column shows the same total for each row in that month. - For January, the total sum is 600, and it repeats that for every row in January.
- There’s no row-by-row accumulation; it directly calculates the total sum for each partition (month).
Summary of Differences:
- With
ORDER BY
:- Calculates a cumulative sum (running total).
- The sum grows row-by-row as it moves through the partition.
- Without
ORDER BY
:- Calculates the same total for all rows in the partition.
- No row-by-row accumulation, just the total for the partition.
Visual Example:
i. With ORDER BY
(Running Total):
For January:
- 100 → Running total: 100
- 200 → Running total: 100 + 200 = 300
- 300 → Running total: 100 + 200 + 300 = 600
For February:
- 150 → Running total: 150
- 250 → Running total: 150 + 250 = 400
ii. Without ORDER BY
(Total Sum):
For January:
- Each row has the same total: 100 + 200 + 300 = 600.
For February:
- Each row has the same total: 150 + 250 = 400.
I hope this clears up the difference between using ORDER BY
and not using it within window functions! Let me know if you’d like further clarification.
2) Between AVG()
with ORDER BY
and AVG()
without ORDER BY
The difference between AVG()
with ORDER BY
and AVG()
without ORDER BY
in a window function is primarily related to how the average is calculated and what result you get for each row.
Let’s break it down with examples.
a). AVG()
Without ORDER BY
(Simple Partitioned Average):
When you use AVG()
in a window function without ORDER BY
, it calculates the average of all rows in the partition (group), and the same average value is assigned to each row in that partition.
Query Example:
SELECT emp_id, sales, month,
AVG(sales) OVER (PARTITION BY month) AS avg_sales_per_month
FROM employee;
Explanation:
PARTITION BY month
: This groups the rows bymonth
.- No
ORDER BY
: It calculates the average sales for the entire partition and assigns the same average to each row.
Result:
emp_id | sales | month | avg_sales_per_month |
---|---|---|---|
1 | 100 | Jan | 200 |
2 | 200 | Jan | 200 |
3 | 300 | Jan | 200 |
4 | 150 | Feb | 200 |
5 | 250 | Feb | 200 |
6 | 350 | Mar | 375 |
7 | 400 | Mar | 375 |
- In the case of January, the average of 100, 200, and 300 is 200, and that same value is repeated for every row in January.
b) AVG()
With ORDER BY
(Cumulative or Running Average):
When you use AVG()
in a window function with ORDER BY
, it calculates a cumulative (running) average for each row, which means the average is recalculated row by row based on the current and preceding rows.
Query Example:
SELECT emp_id, sales, month,
AVG(sales) OVER (PARTITION BY month ORDER BY sales) AS running_avg_sales
FROM employee;
Explanation:
PARTITION BY month
: This groups the rows bymonth
.ORDER BY sales
: It orders the rows by sales and calculates the average up to the current row, so each row shows a running or cumulative average.
Result:
emp_id | sales | month | running_avg_sales |
---|---|---|---|
1 | 100 | Jan | 100 |
2 | 200 | Jan | 150 |
3 | 300 | Jan | 200 |
4 | 150 | Feb | 150 |
5 | 250 | Feb | 200 |
6 | 350 | Mar | 350 |
7 | 400 | Mar | 375 |
- In the case of January:
- The first row has an average of just the first sale (100).
- The second row recalculates the average as
(100 + 200) / 2 = 150
. - The third row includes all three sales and calculates
(100 + 200 + 300) / 3 = 200
.
Key Differences:
i. AVG()
Without ORDER BY
:
- Computes the same average for all rows in the partition (e.g., all rows in a month).
- It does not depend on the order of rows.
- Each row gets the final average value for the entire group/partition.
ii. AVG()
With ORDER BY
:
- Computes a running (cumulative) average, updating the average value as it moves through the ordered rows.
- Each row shows the average of the current and preceding rows.
- The order of rows affects how the average is calculated.
Example Comparison:
Using the January data:
emp_id | sales | month |
---|---|---|
1 | 100 | Jan |
2 | 200 | Jan |
3 | 300 | Jan |
AVG()
Without ORDER BY
:
SELECT emp_id, sales, month,
AVG(sales) OVER (PARTITION BY month) AS avg_sales
FROM employee;
- Output for January:
- The average for all sales in January: (100 + 200 + 300) / 3 = 200.
- Every row for January will have
avg_sales
= 200.
AVG()
With ORDER BY
:
SELECT emp_id, sales, month,
AVG(sales) OVER (PARTITION BY month ORDER BY sales) AS running_avg_sales
FROM employee;
- Output for January:
- First row:
100
→ Running average = 100. - Second row:
100, 200
→ Running average = (100 + 200) / 2 = 150. - Third row:
100, 200, 300
→ Running average = (100 + 200 + 300) / 3 = 200.
- First row:
Summary:
- Without
ORDER BY
: Provides the same average for all rows in the partition, treating them as a single group. - With
ORDER BY
: Provides a running/cumulative average, recalculating for each row based on the preceding data.
Note: Query with ROW_NUMBER()
and AVG()
:
SELECT emp_id,
sales,
month,
AVG(sales) OVER (PARTITION BY month ORDER BY sales) AS running_avg_sales,
ROW_NUMBER() OVER (PARTITION BY month ORDER BY sales) AS row_num
FROM employee;
xplanation:
AVG(sales) OVER (PARTITION BY month ORDER BY sales)
: This calculates the running average ofsales
for eachmonth
, ordered bysales
.ROW_NUMBER() OVER (PARTITION BY month ORDER BY sales)
: This assigns a unique row number to each row within eachmonth
, ordered bysales
. The row numbers will start from 1 and increase for each subsequent row within the same month.
Result Example:
Assume the following data:
emp_id | sales | month |
---|---|---|
1 | 100 | Jan |
2 | 200 | Jan |
3 | 300 | Jan |
4 | 150 | Feb |
5 | 250 | Feb |
6 | 350 | Mar |
7 | 400 | Mar |
The result would be:
emp_id | sales | month | running_avg_sales | row_num |
---|---|---|---|---|
1 | 100 | Jan | 100 | 1 |
2 | 200 | Jan | 150 | 2 |
3 | 300 | Jan | 200 | 3 |
4 | 150 | Feb | 150 | 1 |
5 | 250 | Feb | 200 | 2 |
6 | 350 | Mar | 350 | 1 |
7 | 400 | Mar | 375 | 2 |
Breakdown:
- For January, the running average starts with the first row (
100
), then updates after the second ((100 + 200) / 2 = 150
), and so on. TheROW_NUMBER()
is incremented sequentially (1, 2, 3). - For February, the running average is recalculated similarly, and the
ROW_NUMBER()
restarts at 1 for each newmonth
. - For March, the same process applies, with both the running average and row number starting over for the new month.
Key Points:
- The
AVG()
calculates the cumulative average, changing row by row. - The
ROW_NUMBER()
gives a unique number to each row within the partition, based on the ordering.