SQL question -time interval (seven-day streak of visiting the same URL)
2024
Given a table with event logs, find the percentage of users that had at least one seven-day streak of visiting the same URL.
Note: Round the results to 2 decimal places. For example, if the result is 35.67% return 0.35.
Example:
Input:
events
table
Column | Type |
---|---|
user_id | INTEGER |
created_at | DATETIME |
url | VARCHAR |
Output
Column | Type |
---|---|
percent_of_users | FLOAT |
https://www.interviewquery.com/questions/seven-day-streak?via=venkata&ref=VENKATA
WITH ranked_events AS (
SELECT
user_id,
url,
created_at,
ROW_NUMBER() OVER (PARTITION BY user_id, url ORDER BY created_at) AS row_num,
-- Generate group_id based on day difference and row number
(created_at / 86400) - ROW_NUMBER() OVER (PARTITION BY user_id, url ORDER BY created_at) AS group_id
FROM
events
), streaks AS (
-- Identify streaks by counting consecutive days in each group
SELECT
user_id,
url,
COUNT(DISTINCT created_at) AS streak_length
FROM
ranked_events
GROUP BY
user_id, url, group_id
HAVING
COUNT(DISTINCT created_at) >= 7 -- Streaks of at least 7 days
), users_with_streak AS (
-- Get distinct users who had at least one streak
SELECT DISTINCT
user_id
FROM
streaks
)
-- Calculate the percentage of users with a 7-day streak
SELECT
ROUND(COUNT(DISTINCT s.user_id) * 1.0 / COUNT(DISTINCT e.user_id), 2) AS percent_of_users
FROM
events e
LEFT JOIN
users_with_streak s ON e.user_id = s.user_id;
Summary of Steps:
- Convert
created_at
to days (if it’s a Unix timestamp). - Subtract
ROW_NUMBER()
from the day-level value to creategroup_id
. - Group by
group_id
to find streaks of consecutive days. - Filter users with at least one streak of 7 days.
- Calculate the percentage of such use
Problem Overview
We need to identify when a user visits the same URL on consecutive days for at least 7 days in a row. To do that, we need to:
- Group visits to the same URL by the user.
- Identify streaks of consecutive days for that URL.
The tricky part here is identifying consecutive days. We can’t just look at the created_at
field (which might be a timestamp) alone. Instead, we use a combination of ROW_NUMBER()
and created_at
to assign a group_id
to consecutive visits.
Key Idea Behind group_id
We use group_id
to group consecutive days together. For example, if a user visited a URL on 2024-09-01, 2024-09-02, and 2024-09-03, they should be part of the same group (same group_id
). However, if they skipped a day and visited again on 2024-09-05, the next visit should be in a new group (different group_id
).
We can achieve this by comparing:
- The actual day of the visit (
created_at
, converted to days). - The sequential order of the visit (
ROW_NUMBER()
).
How group_id
Works
- Row Number (
ROW_NUMBER()
): This assigns a sequential number to each visit byuser_id
andurl
, ordered bycreated_at
. It helps track the position of each visit within the series of visits. - Convert
created_at
to Days: We convertcreated_at
to a day-level value (by dividing it by86400
if it’s a Unix timestamp) to compare dates. Visits on consecutive days will have consecutive values here. - Group Consecutive Days: By subtracting the
ROW_NUMBER()
from the day-levelcreated_at
, we generate agroup_id
. This works because the difference between the day and the row number will be constant for consecutive days.
Example with Detailed Explanation
Let’s walk through an example. Assume the following visit data for user_id = 1
:
user_id | created_at | Date | ROW_NUMBER | Day (created_at/86400) | group_id (Day - ROW_NUMBER ) |
---|---|---|---|---|---|
1 | 1695801600 (Sep 27) | 2024-09-27 | 1 | 19622 | 19621 |
1 | 1695888000 (Sep 28) | 2024-09-28 | 2 | 19623 | 19621 |
1 | 1695974400 (Sep 29) | 2024-09-29 | 3 | 19624 | 19621 |
1 | 1696060800 (Sep 30) | 2024-09-30 | 4 | 19625 | 19621 |
1 | 1696233600 (Oct 2) | 2024-10-02 | 5 | 19627 | 19622 |
1 | 1696320000 (Oct 3) | 2024-10-03 | 6 | 19628 | 19622 |
Explanation of Columns:
user_id
: Identifies the user.created_at
: The timestamp of the visit.Date
: The actual calendar date.ROW_NUMBER
: The sequential number for each visit by the user to the same URL.Day
: The day-level value ofcreated_at
(converted by dividing the timestamp by 86400).group_id
: The difference betweenDay
andROW_NUMBER()
.
Why group_id
Groups Consecutive Days:
- For the first 4 visits (Sep 27 to Sep 30):
- The difference between
Day
andROW_NUMBER()
is the same (19621
). This means these 4 visits are part of the same group and represent a streak.
- The difference between
- For the next visit (Oct 2):
- The difference changes to
19622
, indicating that this visit is not part of the previous streak. It starts a new group (newgroup_id
).
- The difference changes to
- For the visit on Oct 3:
- It has the same
group_id
as Oct 2, indicating that these two visits form a streak.
- It has the same
Why This Works:
- For consecutive days, the day-level value (
created_at / 86400
) increases by 1, and theROW_NUMBER()
also increases by 1. SubtractingROW_NUMBER()
from the day-level value results in a constant difference (samegroup_id
). - When there’s a gap between days, the day-level value increases more than the row number. This changes the
group_id
, breaking the streak.