连续登录问题原创
# 连续登录问题
# 1. 题目
有一个用户登录表dayly_sql.user_login_log,记录用户登录ID及登录日期,求最近7天内连续登录3天以上的客户清单,表结构及示例数据如下:
User_id | Login_Date |
---|---|
000001 | 2024-03-29 |
000001 | 2024-03-31 |
000001 | 2024-04-01 |
000001 | 2024-04-02 |
000001 | 2024-04-02 |
000001 | 2024-04-04 |
000001 | 2024-04-05 |
000002 | 2024-04-04 |
000002 | 2024-04-05 |
000002 | 2024-04-06 |
000002 | 2024-04-07 |
000002 | 2024-04-01 |
000002 | 2024-04-02 |
000002 | 2024-04-04 |
# 2. 解题思路
- 对表按用户及登录去重并限制最近7天(假设今天是2024-04-07);
- 使用row_number按User_id分组,按Login_Date降序,得到对应序号RN;
- 将Login_Date减去RN,得到DATE2:dateidff(Login_Date,rn);
- 按User_id、DATE2分组,取count(1) >=3的即为最近7天连续登录3天以上用户。
第一步:限制日期及去重程:
SELECT user_id,login_date
FROM dayly_sql.user_login_log
WHERE login_date BETWEEN date_sub('2024-04-07',7) AND '2024-04-07'
GROUP BY user_id,login_date;
1
2
3
4
2
3
4
得到以下数据集:
User_id | Login_Date |
---|---|
000001 | 2024-03-31 |
000001 | 2024-04-01 |
000001 | 2024-04-02 |
000001 | 2024-04-04 |
000001 | 2024-04-05 |
000002 | 2024-04-01 |
000002 | 2024-04-02 |
000002 | 2024-04-04 |
000002 | 2024-04-05 |
000002 | 2024-04-06 |
000002 | 2024-04-07 |
第二步:日期排序及算出date2重点是date_sub(t2.login_date,t2.rn) as date2
SELECT t2.user_id,t2.login_date,date_sub(t2.login_date,t2.rn) as date2,t2.rn
FROM
(
SELECT t1.user_id, t1.login_date
,row_number() over (partition by user_id order by t1.login_date ) as rn
FROM
(
SELECT user_id,login_date
FROM dayly_sql.user_login_log
WHERE login_date BETWEEN date_sub('2024-04-07',7) AND '2024-04-07'
GROUP BY user_id,login_date
)t1
)t2
;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
2
3
4
5
6
7
8
9
10
11
12
13
14
得到以下数据集:
user_id | login_date | date2 | rn |
---|---|---|---|
000001 | 2024-03-31 | 2024-03-30 | 1 |
000001 | 2024-04-01 | 2024-03-30 | 2 |
000001 | 2024-04-02 | 2024-03-30 | 3 |
000001 | 2024-04-04 | 2024-03-31 | 4 |
000001 | 2024-04-05 | 2024-03-31 | 5 |
000002 | 2024-04-01 | 2024-03-31 | 1 |
000002 | 2024-04-02 | 2024-03-31 | 2 |
000002 | 2024-04-04 | 2024-04-01 | 3 |
000002 | 2024-04-05 | 2024-04-01 | 4 |
000002 | 2024-04-06 | 2024-04-01 | 5 |
000002 | 2024-04-07 | 2024-04-01 | 6 |
第三步:按user_id,date2分组,count(*) >= 3 的即为连续登录天数大于等于3的用户跟对应的登录前1天:
SELECT t3.user_id,t3.date2,count(*) as dm
FROM
(
SELECT t2.user_id,t2.login_date,date_sub(t2.login_date,t2.rn) as date2,t2.rn
FROM
(
SELECT t1.user_id, t1.login_date
,row_number() over (partition by user_id order by t1.login_date ) as rn
FROM
(
SELECT user_id,login_date
FROM dayly_sql.user_login_log
WHERE login_date BETWEEN date_sub('2024-04-07',7) AND '2024-04-07'
GROUP BY user_id,login_date
)t1
)t2
)t3
GROUP BY t3.user_id, t3.date2
having count(*) >= 3
;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
得到结果,其中date2是连续登录的前1天:
user_id | date2 | dm |
---|---|---|
000001 | 2024-03-30 | 3 |
000002 | 2024-04-01 | 4 |
# 3. 变化1:求用户最大连续登录天数
求某用户最近7天最大连续登录天数。
思路:先限制用户ID,然后去重。后面计算跟求所有用户连续登录天数一样,再取max(dm)即可
SELECT t4.user_id,max(t4.dm) as dm
FROM
(
SELECT t3.user_id,t3.date2,count(*) as dm
FROM
(
SELECT t2.user_id,t2.login_date,date_sub(t2.login_date,t2.rn) as date2,t2.rn
FROM
(
SELECT t1.user_id, t1.login_date
,row_number() over (partition by user_id order by t1.login_date ) as rn
FROM
(
SELECT user_id,login_date
FROM dayly_sql.user_login_log
WHERE login_date BETWEEN date_sub('2024-04-07',7) AND '2024-04-07'
AND user_id = '000001'
GROUP BY user_id,login_date
)t1
)t2
)t3
GROUP BY t3.user_id, t3.date2
)t4
GROUP BY t4.user_id
;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
得到连续登录天数:
user_id | dm |
---|---|
000001 | 3 |
# 4. 变化2:求用户连续登录的日期段
第一步:按上面步骤,日期排序及算出date2
SELECT t2.user_id,t2.login_date,date_sub(t2.login_date,t2.rn) as date2,t2.rn
FROM
(
SELECT t1.user_id, t1.login_date
,row_number() over (partition by user_id order by t1.login_date ) as rn
FROM
(
SELECT user_id,login_date
FROM dayly_sql.user_login_log
WHERE login_date BETWEEN date_sub('2024-04-07',7) AND '2024-04-07'
GROUP BY user_id,login_date
)t1
)t2
;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
2
3
4
5
6
7
8
9
10
11
12
13
14
得到以下数据集:
user_id | login_date | date2 | rn |
---|---|---|---|
000001 | 2024-03-31 | 2024-03-30 | 1 |
000001 | 2024-04-01 | 2024-03-30 | 2 |
000001 | 2024-04-02 | 2024-03-30 | 3 |
000001 | 2024-04-04 | 2024-03-31 | 4 |
000001 | 2024-04-05 | 2024-03-31 | 5 |
000002 | 2024-04-01 | 2024-03-31 | 1 |
000002 | 2024-04-02 | 2024-03-31 | 2 |
000002 | 2024-04-04 | 2024-04-01 | 3 |
000002 | 2024-04-05 | 2024-04-01 | 4 |
000002 | 2024-04-06 | 2024-04-01 | 5 |
000002 | 2024-04-07 | 2024-04-01 | 6 |
第二步:算出用户连续登录天数
with tmp_rn as (
SELECT t2.user_id,t2.login_date,date_sub(t2.login_date,t2.rn) as date2,t2.rn
FROM
(
SELECT t1.user_id, t1.login_date
,row_number() over (partition by user_id order by t1.login_date ) as rn
FROM
(
SELECT user_id,login_date
FROM dayly_sql.user_login_log
WHERE login_date BETWEEN date_sub('2024-04-07',7) AND '2024-04-07'
GROUP BY user_id,login_date
)t1
)t2
),
tmp_dm as (
SELECT t3.user_id,t3.date2,count(*) as dm
FROM
tmp_rn t3
GROUP BY t3.user_id, t3.date2
)
SELECT * FROM tmp_dm;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
得到以下数据集,其中date2为连续登录的前1天,dm为连续登录天数:
user_id | date2 | dm |
---|---|---|
000001 | 2024-03-31 | 2 |
000001 | 2024-03-30 | 3 |
000002 | 2024-03-31 | 2 |
000002 | 2024-04-01 | 4 |
第三步:第一步的结果tmp_rn按user_id,date2分组,取max(rn) as day_add_num,得出需要加的天数,然后关联tmp_dm:
with tmp_rn as (
SELECT t2.user_id,t2.login_date,date_sub(t2.login_date,t2.rn) as date2,t2.rn
FROM
(
SELECT t1.user_id, t1.login_date
,row_number() over (partition by user_id order by t1.login_date ) as rn
FROM
(
SELECT user_id,login_date
FROM dayly_sql.user_login_log
WHERE login_date BETWEEN date_sub('2024-04-07',7) AND '2024-04-07'
GROUP BY user_id,login_date
)t1
)t2
),
tmp_dm as (
SELECT t3.user_id,t3.date2,count(*) as dm
FROM
tmp_rn t3
GROUP BY t3.user_id, t3.date2
)
SELECT *
FROM
(
SELECT user_id,date2,max(rn) as day_add_num FROM tmp_rn GROUP BY user_id,date2
)t1
INNER JOIN
(
SELECT * FROM tmp_dm
)t2
ON t1.user_id = t2.user_id AND t1.date2 = t2.date2
ORDER BY t1.user_id,t1.day_add_num
;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
得出以下结果,其中date2为连续登录开始前1天,date_add(date2,day_add_num)为连续登录结束日期:
t1.user_id | t1.date2 | t1.day_add_num | t2.user_id | t2.date2 | dm |
---|---|---|---|---|---|
000001 | 2024-03-30 | 3 | 000001 | 2024-03-30 | 3 |
000001 | 2024-03-31 | 5 | 000001 | 2024-03-31 | 2 |
000002 | 2024-03-31 | 2 | 000002 | 2024-03-31 | 2 |
000002 | 2024-04-01 | 6 | 000002 | 2024-04-01 | 4 |
第四步:算出连续登录开始日期及结束日期
with tmp_rn as (
SELECT t2.user_id,t2.login_date,date_sub(t2.login_date,t2.rn) as date2,t2.rn
FROM
(
SELECT t1.user_id, t1.login_date
,row_number() over (partition by user_id order by t1.login_date ) as rn
FROM
(
SELECT user_id,login_date
FROM dayly_sql.user_login_log
WHERE login_date BETWEEN date_sub('2024-04-07',7) AND '2024-04-07'
GROUP BY user_id,login_date
)t1
)t2
),
tmp_dm as (
SELECT t3.user_id,t3.date2,count(*) as dm
FROM
tmp_rn t3
GROUP BY t3.user_id, t3.date2
)
SELECT t1.user_id
,date_add(t1.date2,cast(1 AS INT)) AS start_date
,date_add(t1.date2,cast(t1.day_add_num AS INT)) AS end_date
FROM
(
SELECT user_id,date2,max(rn) as day_add_num FROM tmp_rn GROUP BY user_id,date2
)t1
INNER JOIN
(
SELECT * FROM tmp_dm
)t2
ON t1.user_id = t2.user_id AND t1.date2 = t2.date2
ORDER BY t1.user_id,date_add(t1.date2,cast(1 AS INT))
;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
结果:
user_id | start_date | end_date |
---|---|---|
000001 | 2024-03-31 | 2024-04-02 |
000001 | 2024-04-01 | 2024-04-05 |
000002 | 2024-04-01 | 2024-04-02 |
000002 | 2024-04-02 | 2024-04-07 |