已知有用户登录记录表包含登录日期和登录用户ID,请查询出截止到当前日期累积登录用户数及登陆用户列表
样例数据
+-------------+----------+
| log_date | user_id |
+-------------+----------+
| 2024-01-01 | a |
| 2024-01-02 | a |
| 2024-01-02 | b |
| 2024-01-03 | b |
| 2024-01-04 | c |
| 2024-01-05 | b |
| 2024-01-05 | c |
| 2024-01-05 | d |
| 2024-01-05 | e |
+-------------+----------+
期望结果
+-------------+-----------+------------------------+
| log_date | user_cnt | user_list |
+-------------+-----------+------------------------+
| 2024-01-01 | 1 | ["a"] |
| 2024-01-02 | 2 | ["a","b"] |
| 2024-01-03 | 2 | ["a","b"] |
| 2024-01-04 | 3 | ["a","b","c"] |
| 2024-01-05 | 5 | ["a","b","c","d","e"] |
+-------------+-----------+------------------------+
统计截止到当前行的登录用户数,考察的是聚合函数开窗函数;查询用户列表考察的数据对数据的聚合、数组去重、数组排序等操作。属于深度考察开窗函数、数组操作等知识内容。
维度 | 评分 |
---|---|
题目难度 | ⭐️⭐️⭐️⭐️ |
题目清晰度 | ⭐️⭐️⭐️⭐️⭐ |
业务常见度 | ⭐️⭐️⭐️⭐️ |
执行SQL
select log_date,
user_id,
count(user_id) over (order by log_date asc) as user_cnt,
collect_list(user_id) over (order by log_date asc) as user_list
from t_user_login
查询结果
+-------------+----------+-----------+----------------------------------------+
| log_date | user_id | user_cnt | user_list |
+-------------+----------+-----------+----------------------------------------+
| 2024-01-01 | a | 1 | ["a"] |
| 2024-01-02 | b | 3 | ["a","b","a"] |
| 2024-01-02 | a | 3 | ["a","b","a"] |
| 2024-01-03 | b | 4 | ["a","b","a","b"] |
| 2024-01-04 | c | 5 | ["a","b","a","b","c"] |
| 2024-01-05 | e | 9 | ["a","b","a","b","c","e","d","c","b"] |
| 2024-01-05 | d | 9 | ["a","b","a","b","c","e","d","c","b"] |
| 2024-01-05 | c | 9 | ["a","b","a","b","c","e","d","c","b"] |
| 2024-01-05 | b | 9 | ["a","b","a","b","c","e","d","c","b"] |
+-------------+----------+-----------+----------------------------------------+
注意
这里我们加上去重,使用collect_set替换掉collect_list
执行SQL
select log_date,
user_id,
count(distinct user_id) over (order by log_date asc) as user_cnt,
collect_set(user_id) over (order by log_date asc) as user_list
from t_user_login
查询结果
+-------------+----------+-----------+------------------------+
| log_date | user_id | user_cnt | user_list |
+-------------+----------+-----------+------------------------+
| 2024-01-01 | a | 1 | ["a"] |
| 2024-01-02 | b | 2 | ["a","b"] |
| 2024-01-02 | a | 2 | ["a","b"] |
| 2024-01-03 | b | 2 | ["a","b"] |
| 2024-01-04 | c | 3 | ["a","b","c"] |
| 2024-01-05 | e | 5 | ["a","b","c","e","d"] |
| 2024-01-05 | d | 5 | ["a","b","c","e","d"] |
| 2024-01-05 | c | 5 | ["a","b","c","e","d"] |
| 2024-01-05 | b | 5 | ["a","b","c","e","d"] |
+-------------+----------+-----------+------------------------+
上面数据观察可知相同日期的结果相同我们只要对用户列表进行排序,然后使用group by 进行去重即可得到最终结果
执行SQL
select log_date,
user_cnt,
user_list
from (select log_date,
user_id,
count(distinct user_id) over (order by log_date asc) as user_cnt,
sort_array(collect_set(user_id) over (order by log_date asc)) as user_list
from t_user_login) t
group by log_date, user_cnt, user_list
查询结果
+-------------+-----------+------------------------+
| log_date | user_cnt | user_list |
+-------------+-----------+------------------------+
| 2024-01-01 | 1 | ["a"] |
| 2024-01-02 | 2 | ["a","b"] |
| 2024-01-03 | 2 | ["a","b"] |
| 2024-01-04 | 3 | ["a","b","c"] |
| 2024-01-05 | 5 | ["a","b","c","d","e"] |
+-------------+-----------+------------------------+
-- 建表语句
CREATE TABLE t_user_login
(
log_date STRING,
user_id STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
-- 数据插入语句
INSERT INTO t_user_login
(log_date, user_id) VALUES
('2024-01-01','a'),
('2024-01-02','a'),
('2024-01-02','b'),
('2024-01-03','b'),
('2024-01-04','c'),
('2024-01-05','b'),
('2024-01-05','c'),
('2024-01-05','d'),
('2024-01-05','e');