SQL question -time interval (seven-day streak of visiting the same URL)

Given a table with event logs, find the percentage of users that had at least one seven-day streak of visiting the same URL.

Note: Round the results to 2 decimal places. For example, if the result is 35.67% return 0.35.

Example:

Input:

events table

ColumnType
user_idINTEGER
created_atDATETIME
urlVARCHAR

Output

ColumnType
percent_of_usersFLOAT

https://www.interviewquery.com/questions/seven-day-streak?via=venkata&ref=VENKATA

WITH ranked_events AS (
    SELECT 
        user_id, 
        url, 
        created_at, 
        ROW_NUMBER() OVER (PARTITION BY user_id, url ORDER BY created_at) AS row_num,
        -- Generate group_id based on day difference and row number
        (created_at / 86400) - ROW_NUMBER() OVER (PARTITION BY user_id, url ORDER BY created_at) AS group_id
    FROM 
        events
), streaks AS (
    -- Identify streaks by counting consecutive days in each group
    SELECT 
        user_id, 
        url, 
        COUNT(DISTINCT created_at) AS streak_length
    FROM 
        ranked_events
    GROUP BY 
        user_id, url, group_id
    HAVING 
        COUNT(DISTINCT created_at) >= 7  -- Streaks of at least 7 days
), users_with_streak AS (
    -- Get distinct users who had at least one streak
    SELECT DISTINCT 
        user_id
    FROM 
        streaks
)
-- Calculate the percentage of users with a 7-day streak
SELECT 
    ROUND(COUNT(DISTINCT s.user_id) * 1.0 / COUNT(DISTINCT e.user_id), 2) AS percent_of_users
FROM 
    events e
LEFT JOIN 
    users_with_streak s ON e.user_id = s.user_id;

Summary of Steps:

  1. Convert created_at to days (if it’s a Unix timestamp).
  2. Subtract ROW_NUMBER() from the day-level value to create group_id.
  3. Group by group_id to find streaks of consecutive days.
  4. Filter users with at least one streak of 7 days.
  5. Calculate the percentage of such use

Problem Overview

We need to identify when a user visits the same URL on consecutive days for at least 7 days in a row. To do that, we need to:

  1. Group visits to the same URL by the user.
  2. Identify streaks of consecutive days for that URL.

The tricky part here is identifying consecutive days. We can’t just look at the created_at field (which might be a timestamp) alone. Instead, we use a combination of ROW_NUMBER() and created_at to assign a group_id to consecutive visits.


Key Idea Behind group_id

We use group_id to group consecutive days together. For example, if a user visited a URL on 2024-09-01, 2024-09-02, and 2024-09-03, they should be part of the same group (same group_id). However, if they skipped a day and visited again on 2024-09-05, the next visit should be in a new group (different group_id).

We can achieve this by comparing:

  • The actual day of the visit (created_at, converted to days).
  • The sequential order of the visit (ROW_NUMBER()).

How group_id Works

  1. Row Number (ROW_NUMBER()): This assigns a sequential number to each visit by user_id and url, ordered by created_at. It helps track the position of each visit within the series of visits.
  2. Convert created_at to Days: We convert created_at to a day-level value (by dividing it by 86400 if it’s a Unix timestamp) to compare dates. Visits on consecutive days will have consecutive values here.
  3. Group Consecutive Days: By subtracting the ROW_NUMBER() from the day-level created_at, we generate a group_id. This works because the difference between the day and the row number will be constant for consecutive days.

Example with Detailed Explanation

Let’s walk through an example. Assume the following visit data for user_id = 1:

user_idcreated_atDateROW_NUMBERDay (created_at/86400)group_id (Day - ROW_NUMBER)
11695801600 (Sep 27)2024-09-2711962219621
11695888000 (Sep 28)2024-09-2821962319621
11695974400 (Sep 29)2024-09-2931962419621
11696060800 (Sep 30)2024-09-3041962519621
11696233600 (Oct 2)2024-10-0251962719622
11696320000 (Oct 3)2024-10-0361962819622

Explanation of Columns:

  • user_id: Identifies the user.
  • created_at: The timestamp of the visit.
  • Date: The actual calendar date.
  • ROW_NUMBER: The sequential number for each visit by the user to the same URL.
  • Day: The day-level value of created_at (converted by dividing the timestamp by 86400).
  • group_id: The difference between Day and ROW_NUMBER().

Why group_id Groups Consecutive Days:

  • For the first 4 visits (Sep 27 to Sep 30):
    • The difference between Day and ROW_NUMBER() is the same (19621). This means these 4 visits are part of the same group and represent a streak.
  • For the next visit (Oct 2):
    • The difference changes to 19622, indicating that this visit is not part of the previous streak. It starts a new group (new group_id).
  • For the visit on Oct 3:
    • It has the same group_id as Oct 2, indicating that these two visits form a streak.

Why This Works:

  • For consecutive days, the day-level value (created_at / 86400) increases by 1, and the ROW_NUMBER() also increases by 1. Subtracting ROW_NUMBER() from the day-level value results in a constant difference (same group_id).
  • When there’s a gap between days, the day-level value increases more than the row number. This changes the group_id, breaking the streak.

Leave a Reply

Your email address will not be published. Required fields are marked *

Deprecated: htmlspecialchars(): Passing null to parameter #1 ($string) of type string is deprecated in /var/www/html/wp-includes/formatting.php on line 4720