SQL window function with “ORDER BY” and without “ORDER BY”

Between SUM() with ORDER BY and SUM() without ORDER BY in a window function

Dataset Example:

Assume we have the following sales data for different employees across several months:

emp_id	sales	month
1	100	Jan
2	200	Jan
3	300	Jan
4	150	Feb
5	250	Feb
6	350	Mar
7	400	Mar

Now, let’s compare two scenarios:

a) Using `SUM()` with `ORDER BY` (Cumulative or Running Total):

In this case, SUM() with an ORDER BY clause inside the OVER statement will give you a cumulative sum (or running total). It calculates the sum for each row by adding up the sales from the previous rows within each partition.

Query:

SELECT emp_id, sales, month, 
       SUM(sales) OVER (PARTITION BY month ORDER BY sales) AS running_total
FROM sales_data;

Breakdown:

PARTITION BY month: Groups the rows by each month.
ORDER BY sales: Orders the rows within each month by sales, and calculates a running total.

Result (Running Total Example):

emp_id	sales	month	running_total
1	100	Jan	100
2	200	Jan	300
3	300	Jan	600
4	150	Feb	150
5	250	Feb	400
6	350	Mar	350
7	400	Mar	750

Key Points:

The running_total column shows a cumulative sum of sales within each month.
For example, in January, the first sale is 100, the second sale adds to 300, and the third sale results in 600.
The rows are ordered by sales, so it calculates the sum row-by-row in ascending order of sales.

b) Using `SUM()` without `ORDER BY` (Total Sum Per Partition):

If you remove the ORDER BY inside the OVER clause, SUM() will calculate the total sum of sales for each month without accumulating as it moves row by row.

Query:

SELECT emp_id, sales, month, 
       SUM(sales) OVER (PARTITION BY month) AS total_sales_per_month
FROM sales_data;

Breakdown:

PARTITION BY month: Groups the rows by each month.
No ORDER BY: It simply returns the total sum for each month for every row.

Result (Total Sum Per Partition Example):

emp_id	sales	month	total_sales_per_month
1	100	Jan	600
2	200	Jan	600
3	300	Jan	600
4	150	Feb	400
5	250	Feb	400
6	350	Mar	750
7	400	Mar	750

Key Points:

The total_sales_per_month column shows the same total for each row in that month.
For January, the total sum is 600, and it repeats that for every row in January.
There’s no row-by-row accumulation; it directly calculates the total sum for each partition (month).

Summary of Differences:

With ORDER BY:
- Calculates a cumulative sum (running total).
- The sum grows row-by-row as it moves through the partition.
Without ORDER BY:
- Calculates the same total for all rows in the partition.
- No row-by-row accumulation, just the total for the partition.

Visual Example:

i. With `ORDER BY` (Running Total):

For January:

100 → Running total: 100
200 → Running total: 100 + 200 = 300
300 → Running total: 100 + 200 + 300 = 600

For February:

150 → Running total: 150
250 → Running total: 150 + 250 = 400

ii. Without `ORDER BY` (Total Sum):

For January:

Each row has the same total: 100 + 200 + 300 = 600.

For February:

Each row has the same total: 150 + 250 = 400.

I hope this clears up the difference between using ORDER BY and not using it within window functions! Let me know if you’d like further clarification.

2) Between AVG() with ORDER BY and AVG() without ORDER BY

The difference between AVG() with ORDER BY and AVG() without ORDER BY in a window function is primarily related to how the average is calculated and what result you get for each row.

Let’s break it down with examples.

a). `AVG()` Without `ORDER BY` (Simple Partitioned Average):

When you use AVG() in a window function without ORDER BY, it calculates the average of all rows in the partition (group), and the same average value is assigned to each row in that partition.

Query Example:

SELECT emp_id, sales, month,
       AVG(sales) OVER (PARTITION BY month) AS avg_sales_per_month
FROM employee;

Explanation:

PARTITION BY month: This groups the rows by month.
No ORDER BY: It calculates the average sales for the entire partition and assigns the same average to each row.

Result:

emp_id	sales	month	avg_sales_per_month
1	100	Jan	200
2	200	Jan	200
3	300	Jan	200
4	150	Feb	200
5	250	Feb	200
6	350	Mar	375
7	400	Mar	375

In the case of January, the average of 100, 200, and 300 is 200, and that same value is repeated for every row in January.

b) `AVG()` With `ORDER BY` (Cumulative or Running Average):

When you use AVG() in a window function with ORDER BY, it calculates a cumulative (running) average for each row, which means the average is recalculated row by row based on the current and preceding rows.

Query Example:

SELECT emp_id, sales, month,
       AVG(sales) OVER (PARTITION BY month ORDER BY sales) AS running_avg_sales
FROM employee;

Explanation:

PARTITION BY month: This groups the rows by month.
ORDER BY sales: It orders the rows by sales and calculates the average up to the current row, so each row shows a running or cumulative average.

Result:

emp_id	sales	month	running_avg_sales
1	100	Jan	100
2	200	Jan	150
3	300	Jan	200
4	150	Feb	150
5	250	Feb	200
6	350	Mar	350
7	400	Mar	375

In the case of January:
- The first row has an average of just the first sale (100).
- The second row recalculates the average as (100 + 200) / 2 = 150.
- The third row includes all three sales and calculates (100 + 200 + 300) / 3 = 200.

Key Differences:

i. `AVG()` Without `ORDER BY`:

Computes the same average for all rows in the partition (e.g., all rows in a month).
It does not depend on the order of rows.
Each row gets the final average value for the entire group/partition.

ii. `AVG()` With `ORDER BY`:

Computes a running (cumulative) average, updating the average value as it moves through the ordered rows.
Each row shows the average of the current and preceding rows.
The order of rows affects how the average is calculated.

Example Comparison:

Using the January data:

emp_id	sales	month
1	100	Jan
2	200	Jan
3	300	Jan

`AVG()` Without `ORDER BY`:

SELECT emp_id, sales, month,
       AVG(sales) OVER (PARTITION BY month) AS avg_sales
FROM employee;

Output for January:
- The average for all sales in January: (100 + 200 + 300) / 3 = 200.
- Every row for January will have avg_sales = 200.

`AVG()` With `ORDER BY`:

SELECT emp_id, sales, month,
       AVG(sales) OVER (PARTITION BY month ORDER BY sales) AS running_avg_sales
FROM employee;

Output for January:
- First row: 100 → Running average = 100.
- Second row: 100, 200 → Running average = (100 + 200) / 2 = 150.
- Third row: 100, 200, 300 → Running average = (100 + 200 + 300) / 3 = 200.

Summary:

Without ORDER BY: Provides the same average for all rows in the partition, treating them as a single group.
With ORDER BY: Provides a running/cumulative average, recalculating for each row based on the preceding data.

Note: Query with ROW_NUMBER() and AVG():

SELECT emp_id, 
       sales, 
       month,
       AVG(sales) OVER (PARTITION BY month ORDER BY sales) AS running_avg_sales,
       ROW_NUMBER() OVER (PARTITION BY month ORDER BY sales) AS row_num
FROM employee;

xplanation:

AVG(sales) OVER (PARTITION BY month ORDER BY sales): This calculates the running average of sales for each month, ordered by sales.
ROW_NUMBER() OVER (PARTITION BY month ORDER BY sales): This assigns a unique row number to each row within each month, ordered by sales. The row numbers will start from 1 and increase for each subsequent row within the same month.

Result Example:

Assume the following data:

emp_id	sales	month
1	100	Jan
2	200	Jan
3	300	Jan
4	150	Feb
5	250	Feb
6	350	Mar
7	400	Mar

The result would be:

emp_id	sales	month	running_avg_sales	row_num
1	100	Jan	100	1
2	200	Jan	150	2
3	300	Jan	200	3
4	150	Feb	150	1
5	250	Feb	200	2
6	350	Mar	350	1
7	400	Mar	375	2

Breakdown:

For January, the running average starts with the first row (100), then updates after the second ((100 + 200) / 2 = 150), and so on. The ROW_NUMBER() is incremented sequentially (1, 2, 3).
For February, the running average is recalculated similarly, and the ROW_NUMBER() restarts at 1 for each new month.
For March, the same process applies, with both the running average and row number starting over for the new month.

Key Points:

The AVG() calculates the cumulative average, changing row by row.
The ROW_NUMBER() gives a unique number to each row within the partition, based on the ordering.

emp_id	sales	month	running_total
1	100	Jan	100
2	200	Jan	300
3	300	Jan	600
4	150	Feb	150
5	250	Feb	400
6	350	Mar	350
7	400	Mar	750

emp_id	sales	month	total_sales_per_month
1	100	Jan	600
2	200	Jan	600
3	300	Jan	600
4	150	Feb	400
5	250	Feb	400
6	350	Mar	750
7	400	Mar	750

emp_id	sales	month	avg_sales_per_month
1	100	Jan	200
2	200	Jan	200
3	300	Jan	200
4	150	Feb	200
5	250	Feb	200
6	350	Mar	375
7	400	Mar	375

emp_id	sales	month	running_avg_sales
1	100	Jan	100
2	200	Jan	150
3	300	Jan	200
4	150	Feb	150
5	250	Feb	200
6	350	Mar	350
7	400	Mar	375

emp_id	sales	month	running_avg_sales	row_num
1	100	Jan	100	1
2	200	Jan	150	2
3	300	Jan	200	3
4	150	Feb	150	1
5	250	Feb	200	2
6	350	Mar	350	1
7	400	Mar	375	2

emp_id	sales	month	running_total
1	100	Jan	100
2	200	Jan	300
3	300	Jan	600
4	150	Feb	150
5	250	Feb	400
6	350	Mar	350
7	400	Mar	750

emp_id	sales	month	total_sales_per_month
1	100	Jan	600
2	200	Jan	600
3	300	Jan	600
4	150	Feb	400
5	250	Feb	400
6	350	Mar	750
7	400	Mar	750

emp_id	sales	month	avg_sales_per_month
1	100	Jan	200
2	200	Jan	200
3	300	Jan	200
4	150	Feb	200
5	250	Feb	200
6	350	Mar	375
7	400	Mar	375

emp_id	sales	month	running_avg_sales
1	100	Jan	100
2	200	Jan	150
3	300	Jan	200
4	150	Feb	150
5	250	Feb	200
6	350	Mar	350
7	400	Mar	375

emp_id	sales	month	running_avg_sales	row_num
1	100	Jan	100	1
2	200	Jan	150	2
3	300	Jan	200	3
4	150	Feb	150	1
5	250	Feb	200	2
6	350	Mar	350	1
7	400	Mar	375	2

Data Engineer Labs

SQL window function with “ORDER BY” and without “ORDER BY”

Dataset Example:

a) Using SUM() with ORDER BY (Cumulative or Running Total):

Query:

Breakdown:

Result (Running Total Example):

Key Points:

b) Using SUM() without ORDER BY (Total Sum Per Partition):

Query:

Breakdown:

Result (Total Sum Per Partition Example):

Key Points:

Summary of Differences:

Visual Example:

i. With ORDER BY (Running Total):

ii. Without ORDER BY (Total Sum):

a). AVG() Without ORDER BY (Simple Partitioned Average):

Query Example:

Explanation:

Result:

b) AVG() With ORDER BY (Cumulative or Running Average):

Query Example:

Explanation:

Result:

Key Differences:

i. AVG() Without ORDER BY:

ii. AVG() With ORDER BY:

Example Comparison:

AVG() Without ORDER BY:

AVG() With ORDER BY:

Summary:

xplanation:

Result Example:

Breakdown:

Key Points:

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Archives

Categories

a) Using `SUM()` with `ORDER BY` (Cumulative or Running Total):

b) Using `SUM()` without `ORDER BY` (Total Sum Per Partition):

i. With `ORDER BY` (Running Total):

ii. Without `ORDER BY` (Total Sum):

a). `AVG()` Without `ORDER BY` (Simple Partitioned Average):

b) `AVG()` With `ORDER BY` (Cumulative or Running Average):

i. `AVG()` Without `ORDER BY`:

ii. `AVG()` With `ORDER BY`:

`AVG()` Without `ORDER BY`:

`AVG()` With `ORDER BY`:

emp_id	sales	month	running_total
1	100	Jan	100
2	200	Jan	300
3	300	Jan	600
4	150	Feb	150
5	250	Feb	400
6	350	Mar	350
7	400	Mar	750

emp_id	sales	month	total_sales_per_month
1	100	Jan	600
2	200	Jan	600
3	300	Jan	600
4	150	Feb	400
5	250	Feb	400
6	350	Mar	750
7	400	Mar	750

emp_id	sales	month	avg_sales_per_month
1	100	Jan	200
2	200	Jan	200
3	300	Jan	200
4	150	Feb	200
5	250	Feb	200
6	350	Mar	375
7	400	Mar	375

emp_id	sales	month	running_avg_sales
1	100	Jan	100
2	200	Jan	150
3	300	Jan	200
4	150	Feb	150
5	250	Feb	200
6	350	Mar	350
7	400	Mar	375

emp_id	sales	month	running_avg_sales	row_num
1	100	Jan	100	1
2	200	Jan	150	2
3	300	Jan	200	3
4	150	Feb	150	1
5	250	Feb	200	2
6	350	Mar	350	1
7	400	Mar	375	2