SQL window function with “ORDER BY” and without “ORDER BY”

  1. Between SUM() with ORDER BY and SUM() without ORDER BY in a window function

Dataset Example:

Assume we have the following sales data for different employees across several months:

emp_idsalesmonth
1100Jan
2200Jan
3300Jan
4150Feb
5250Feb
6350Mar
7400Mar

Now, let’s compare two scenarios:

a) Using SUM() with ORDER BY (Cumulative or Running Total):

In this case, SUM() with an ORDER BY clause inside the OVER statement will give you a cumulative sum (or running total). It calculates the sum for each row by adding up the sales from the previous rows within each partition.

Query:

SELECT emp_id, sales, month, 
       SUM(sales) OVER (PARTITION BY month ORDER BY sales) AS running_total
FROM sales_data;

Breakdown:

  • PARTITION BY month: Groups the rows by each month.
  • ORDER BY sales: Orders the rows within each month by sales, and calculates a running total.

Result (Running Total Example):

emp_idsalesmonthrunning_total
1100Jan100
2200Jan300
3300Jan600
4150Feb150
5250Feb400
6350Mar350
7400Mar750

Key Points:

  • The running_total column shows a cumulative sum of sales within each month.
  • For example, in January, the first sale is 100, the second sale adds to 300, and the third sale results in 600.
  • The rows are ordered by sales, so it calculates the sum row-by-row in ascending order of sales.

b) Using SUM() without ORDER BY (Total Sum Per Partition):

If you remove the ORDER BY inside the OVER clause, SUM() will calculate the total sum of sales for each month without accumulating as it moves row by row.

Query:

SELECT emp_id, sales, month, 
       SUM(sales) OVER (PARTITION BY month) AS total_sales_per_month
FROM sales_data;

Breakdown:

  • PARTITION BY month: Groups the rows by each month.
  • No ORDER BY: It simply returns the total sum for each month for every row.

Result (Total Sum Per Partition Example):

emp_idsalesmonthtotal_sales_per_month
1100Jan600
2200Jan600
3300Jan600
4150Feb400
5250Feb400
6350Mar750
7400Mar750

Key Points:

  • The total_sales_per_month column shows the same total for each row in that month.
  • For January, the total sum is 600, and it repeats that for every row in January.
  • There’s no row-by-row accumulation; it directly calculates the total sum for each partition (month).

Summary of Differences:

  • With ORDER BY:
    • Calculates a cumulative sum (running total).
    • The sum grows row-by-row as it moves through the partition.
  • Without ORDER BY:
    • Calculates the same total for all rows in the partition.
    • No row-by-row accumulation, just the total for the partition.

Visual Example:

i. With ORDER BY (Running Total):

For January:

  • 100 → Running total: 100
  • 200 → Running total: 100 + 200 = 300
  • 300 → Running total: 100 + 200 + 300 = 600

For February:

  • 150 → Running total: 150
  • 250 → Running total: 150 + 250 = 400

ii. Without ORDER BY (Total Sum):

For January:

  • Each row has the same total: 100 + 200 + 300 = 600.

For February:

  • Each row has the same total: 150 + 250 = 400.

I hope this clears up the difference between using ORDER BY and not using it within window functions! Let me know if you’d like further clarification.

2) Between AVG() with ORDER BY and AVG() without ORDER BY

The difference between AVG() with ORDER BY and AVG() without ORDER BY in a window function is primarily related to how the average is calculated and what result you get for each row.

Let’s break it down with examples.


a). AVG() Without ORDER BY (Simple Partitioned Average):

When you use AVG() in a window function without ORDER BY, it calculates the average of all rows in the partition (group), and the same average value is assigned to each row in that partition.

Query Example:

SELECT emp_id, sales, month,
       AVG(sales) OVER (PARTITION BY month) AS avg_sales_per_month
FROM employee;

Explanation:

  • PARTITION BY month: This groups the rows by month.
  • No ORDER BY: It calculates the average sales for the entire partition and assigns the same average to each row.

Result:

emp_idsalesmonthavg_sales_per_month
1100Jan200
2200Jan200
3300Jan200
4150Feb200
5250Feb200
6350Mar375
7400Mar375
  • In the case of January, the average of 100, 200, and 300 is 200, and that same value is repeated for every row in January.

b) AVG() With ORDER BY (Cumulative or Running Average):

When you use AVG() in a window function with ORDER BY, it calculates a cumulative (running) average for each row, which means the average is recalculated row by row based on the current and preceding rows.

Query Example:

SELECT emp_id, sales, month,
       AVG(sales) OVER (PARTITION BY month ORDER BY sales) AS running_avg_sales
FROM employee;

Explanation:

  • PARTITION BY month: This groups the rows by month.
  • ORDER BY sales: It orders the rows by sales and calculates the average up to the current row, so each row shows a running or cumulative average.

Result:

emp_idsalesmonthrunning_avg_sales
1100Jan100
2200Jan150
3300Jan200
4150Feb150
5250Feb200
6350Mar350
7400Mar375
  • In the case of January:
    • The first row has an average of just the first sale (100).
    • The second row recalculates the average as (100 + 200) / 2 = 150.
    • The third row includes all three sales and calculates (100 + 200 + 300) / 3 = 200.

Key Differences:

i. AVG() Without ORDER BY:

  • Computes the same average for all rows in the partition (e.g., all rows in a month).
  • It does not depend on the order of rows.
  • Each row gets the final average value for the entire group/partition.

ii. AVG() With ORDER BY:

  • Computes a running (cumulative) average, updating the average value as it moves through the ordered rows.
  • Each row shows the average of the current and preceding rows.
  • The order of rows affects how the average is calculated.

Example Comparison:

Using the January data:

emp_idsalesmonth
1100Jan
2200Jan
3300Jan

AVG() Without ORDER BY:

SELECT emp_id, sales, month,
       AVG(sales) OVER (PARTITION BY month) AS avg_sales
FROM employee;
  • Output for January:
    • The average for all sales in January: (100 + 200 + 300) / 3 = 200.
    • Every row for January will have avg_sales = 200.

AVG() With ORDER BY:

SELECT emp_id, sales, month,
       AVG(sales) OVER (PARTITION BY month ORDER BY sales) AS running_avg_sales
FROM employee;
  • Output for January:
    • First row: 100 → Running average = 100.
    • Second row: 100, 200 → Running average = (100 + 200) / 2 = 150.
    • Third row: 100, 200, 300 → Running average = (100 + 200 + 300) / 3 = 200.

Summary:

  • Without ORDER BY: Provides the same average for all rows in the partition, treating them as a single group.
  • With ORDER BY: Provides a running/cumulative average, recalculating for each row based on the preceding data.

Note: Query with ROW_NUMBER() and AVG():

SELECT emp_id, 
       sales, 
       month,
       AVG(sales) OVER (PARTITION BY month ORDER BY sales) AS running_avg_sales,
       ROW_NUMBER() OVER (PARTITION BY month ORDER BY sales) AS row_num
FROM employee;

xplanation:

  • AVG(sales) OVER (PARTITION BY month ORDER BY sales): This calculates the running average of sales for each month, ordered by sales.
  • ROW_NUMBER() OVER (PARTITION BY month ORDER BY sales): This assigns a unique row number to each row within each month, ordered by sales. The row numbers will start from 1 and increase for each subsequent row within the same month.

Result Example:

Assume the following data:

emp_idsalesmonth
1100Jan
2200Jan
3300Jan
4150Feb
5250Feb
6350Mar
7400Mar

The result would be:

emp_idsalesmonthrunning_avg_salesrow_num
1100Jan1001
2200Jan1502
3300Jan2003
4150Feb1501
5250Feb2002
6350Mar3501
7400Mar3752

Breakdown:

  • For January, the running average starts with the first row (100), then updates after the second ((100 + 200) / 2 = 150), and so on. The ROW_NUMBER() is incremented sequentially (1, 2, 3).
  • For February, the running average is recalculated similarly, and the ROW_NUMBER() restarts at 1 for each new month.
  • For March, the same process applies, with both the running average and row number starting over for the new month.

Key Points:

  • The AVG() calculates the cumulative average, changing row by row.
  • The ROW_NUMBER() gives a unique number to each row within the partition, based on the ordering.

Leave a Reply

Your email address will not be published. Required fields are marked *

Deprecated: htmlspecialchars(): Passing null to parameter #1 ($string) of type string is deprecated in /var/www/html/wp-includes/formatting.php on line 4720