SQL question -time interval (seven-day streak of visiting the same URL)

Given a table with event logs, find the percentage of users that had at least one seven-day streak of visiting the same URL.

Note: Round the results to 2 decimal places. For example, if the result is 35.67% return 0.35.

Example:

Input:

events table

Column	Type
`user_id`	INTEGER
`created_at`	DATETIME
`url`	VARCHAR

Output

Column	Type
`percent_of_users`	FLOAT

https://www.interviewquery.com/questions/seven-day-streak?via=venkata&ref=VENKATA

WITH ranked_events AS (
    SELECT 
        user_id, 
        url, 
        created_at, 
        ROW_NUMBER() OVER (PARTITION BY user_id, url ORDER BY created_at) AS row_num,
        -- Generate group_id based on day difference and row number
        (created_at / 86400) - ROW_NUMBER() OVER (PARTITION BY user_id, url ORDER BY created_at) AS group_id
    FROM 
        events
), streaks AS (
    -- Identify streaks by counting consecutive days in each group
    SELECT 
        user_id, 
        url, 
        COUNT(DISTINCT created_at) AS streak_length
    FROM 
        ranked_events
    GROUP BY 
        user_id, url, group_id
    HAVING 
        COUNT(DISTINCT created_at) >= 7  -- Streaks of at least 7 days
), users_with_streak AS (
    -- Get distinct users who had at least one streak
    SELECT DISTINCT 
        user_id
    FROM 
        streaks
)
-- Calculate the percentage of users with a 7-day streak
SELECT 
    ROUND(COUNT(DISTINCT s.user_id) * 1.0 / COUNT(DISTINCT e.user_id), 2) AS percent_of_users
FROM 
    events e
LEFT JOIN 
    users_with_streak s ON e.user_id = s.user_id;

Summary of Steps:

Convert created_at to days (if it’s a Unix timestamp).
Subtract ROW_NUMBER() from the day-level value to create group_id.
Group by group_id to find streaks of consecutive days.
Filter users with at least one streak of 7 days.
Calculate the percentage of such use

Problem Overview

We need to identify when a user visits the same URL on consecutive days for at least 7 days in a row. To do that, we need to:

Group visits to the same URL by the user.
Identify streaks of consecutive days for that URL.

The tricky part here is identifying consecutive days. We can’t just look at the created_at field (which might be a timestamp) alone. Instead, we use a combination of ROW_NUMBER() and created_at to assign a group_id to consecutive visits.

Key Idea Behind `group_id`

We use group_id to group consecutive days together. For example, if a user visited a URL on 2024-09-01, 2024-09-02, and 2024-09-03, they should be part of the same group (same group_id). However, if they skipped a day and visited again on 2024-09-05, the next visit should be in a new group (different group_id).

We can achieve this by comparing:

The actual day of the visit (created_at, converted to days).
The sequential order of the visit (ROW_NUMBER()).

How `group_id` Works

Row Number (ROW_NUMBER()): This assigns a sequential number to each visit by user_id and url, ordered by created_at. It helps track the position of each visit within the series of visits.
Convert created_at to Days: We convert created_at to a day-level value (by dividing it by 86400 if it’s a Unix timestamp) to compare dates. Visits on consecutive days will have consecutive values here.
Group Consecutive Days: By subtracting the ROW_NUMBER() from the day-level created_at, we generate a group_id. This works because the difference between the day and the row number will be constant for consecutive days.

Example with Detailed Explanation

Let’s walk through an example. Assume the following visit data for user_id = 1:

user_id	created_at	Date	ROW_NUMBER	Day (created_at/86400)	group_id (`Day - ROW_NUMBER`)
1	1695801600 (Sep 27)	2024-09-27	1	19622	19621
1	1695888000 (Sep 28)	2024-09-28	2	19623	19621
1	1695974400 (Sep 29)	2024-09-29	3	19624	19621
1	1696060800 (Sep 30)	2024-09-30	4	19625	19621
1	1696233600 (Oct 2)	2024-10-02	5	19627	19622
1	1696320000 (Oct 3)	2024-10-03	6	19628	19622

Explanation of Columns:

user_id: Identifies the user.
created_at: The timestamp of the visit.
Date: The actual calendar date.
ROW_NUMBER: The sequential number for each visit by the user to the same URL.
Day: The day-level value of created_at (converted by dividing the timestamp by 86400).
group_id: The difference between Day and ROW_NUMBER().

Why `group_id` Groups Consecutive Days:

For the first 4 visits (Sep 27 to Sep 30):
- The difference between Day and ROW_NUMBER() is the same (19621). This means these 4 visits are part of the same group and represent a streak.
For the next visit (Oct 2):
- The difference changes to 19622, indicating that this visit is not part of the previous streak. It starts a new group (new group_id).
For the visit on Oct 3:
- It has the same group_id as Oct 2, indicating that these two visits form a streak.

Why This Works:

For consecutive days, the day-level value (created_at / 86400) increases by 1, and the ROW_NUMBER() also increases by 1. Subtracting ROW_NUMBER() from the day-level value results in a constant difference (same group_id).
When there’s a gap between days, the day-level value increases more than the row number. This changes the group_id, breaking the streak.

Data Engineer Labs

SQL question -time interval (seven-day streak of visiting the same URL)

Summary of Steps:

Problem Overview

Key Idea Behind `group_id`

How `group_id` Works

Example with Detailed Explanation

Explanation of Columns:

Why `group_id` Groups Consecutive Days:

Why This Works:

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Data Engineer Labs

SQL question -time interval (seven-day streak of visiting the same URL)

Summary of Steps:

Problem Overview

Key Idea Behind group_id

How group_id Works

Example with Detailed Explanation

Explanation of Columns:

Why group_id Groups Consecutive Days:

Why This Works:

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Key Idea Behind `group_id`

How `group_id` Works

Why `group_id` Groups Consecutive Days: