有一张用户操作行为记录表 t_act_log_035 包含用户ID(user_id),操作编号(op_id),操作时间(op_time)
要求:
(1)统计每天符合以下条件的用户数:A操作之后是B操作,AB操作必须相邻;
(2)统计每天用户行为序列为A-B-D的用户数;其中:A-B之间可以有任何其他浏览记录(如C,E等),B-D之间除了C记录可以有任何其他浏览记录(如A,E等)
+----------+--------+----------------------+
| user_id | op_id | op_time |
+----------+--------+----------------------+
| 1 | A | 2023-10-18 12:01:03 |
| 2 | A | 2023-10-18 12:01:04 |
| 3 | A | 2023-10-18 12:01:05 |
| 1 | B | 2023-10-18 12:03:03 |
| 1 | A | 2023-10-18 12:04:03 |
| 1 | C | 2023-10-18 12:06:03 |
| 1 | D | 2023-10-18 12:11:03 |
| 2 | A | 2023-10-18 12:07:04 |
| 3 | C | 2023-10-18 12:02:05 |
| 2 | C | 2023-10-18 12:09:03 |
| 2 | A | 2023-10-18 12:10:03 |
| 4 | A | 2023-10-18 12:01:03 |
| 4 | C | 2023-10-18 12:11:05 |
| 4 | D | 2023-10-18 12:15:05 |
| 1 | A | 2023-10-19 12:01:03 |
| 2 | A | 2023-10-19 12:01:04 |
| 3 | A | 2023-10-19 12:01:05 |
| 1 | B | 2023-10-19 12:03:03 |
| 1 | A | 2023-10-19 12:04:03 |
| 1 | C | 2023-10-19 12:06:03 |
| 2 | A | 2023-10-19 12:07:04 |
| 3 | B | 2023-10-19 12:08:05 |
| 3 | E | 2023-10-19 12:09:05 |
| 3 | D | 2023-10-19 12:11:05 |
| 2 | C | 2023-10-19 12:09:03 |
| 4 | E | 2023-10-19 12:05:03 |
| 4 | B | 2023-10-19 12:06:03 |
| 4 | E | 2023-10-19 12:07:03 |
| 2 | A | 2023-10-19 12:10:03 |
+----------+--------+----------------------+
本题分为两问,只看第一个问题可以看做计算相邻两行数据,lag()函数可能能解决,但是第二个问题,明显无法用lag(),更像是对字符串内容的匹配,所以该题目考察内容是按照顺序对字符串进行拼接,然后筛选出符合条件的字符串;
(1)按照顺序拼接字符串
(1)包含'A,B'
(2)包含’A%B%D'并且不能是‘A%B%C%D’
维度 | 评分 |
---|---|
题目难度 | ⭐️⭐️⭐️⭐️ |
题目清晰度 | ⭐️⭐️⭐️⭐️⭐️ |
业务常见度 | ⭐️⭐️⭐️⭐️ |
首先我们知道collect_list拼接字符串是无序的,即便按照顺序将原始数据排好,也不能保证结果有序。所以我们需要将时间op_time和op_id进行拼接,拼接完成之后,时间在前,对整个字符串进行排序,然后再把op_time去掉,保留op_id的拼接字符串,然后对字符串进行验证。
1)拼接op_time和op_id,然后根据用户和日期进行分组,collect_list聚合出每天用户的行为,使用sort_array保证拼接后的字符串有序。为了方便看结果,最后进加了order by
select
user_id,
dt,
concat_ws(',',sort_array(collect_list(op_str))) as op_sort
from
(select
user_id,
op_id,
to_date(op_time) as dt,
op_time,
concat_ws('|',op_time,op_id) as op_str
from t_act_log_035
)t
group by user_id,dt
order by user_id,dt
查询结果
2)将结果进行字符串替换,去掉op_time及添加的|,这里使用regexp_replace()函数进行替换。
select
user_id,
dt,
regexp_replace(concat_ws(',',sort_array(collect_list(op_str))),'(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}\\|)','') as op_sort
from
(select
user_id,
op_id,
to_date(op_time) as dt,
op_time,
concat_ws('|',op_time,op_id) as op_str
from t_act_log_035
)t
group by user_id,dt
order by user_id,dt
查询结果
3)使用like查询包含'A,B'的记录
with tmp as
(
select
user_id,
dt,
regexp_replace(concat_ws(',',sort_array(collect_list(op_str))),'(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}\\|)','') as op_sort
from
(select
user_id,
op_id,
to_date(op_time) as dt,
op_time,
concat_ws('|',op_time,op_id) as op_str
from t_act_log_035
)t
group by user_id,dt
)
select
user_id,
dt,
op_sort
from tmp
where op_sort like '%A,B%'
查询结果
4)按照日期分组,计算每天符合条件的用户数量
with tmp as
(
select
user_id,
dt,
regexp_replace(concat_ws(',',sort_array(collect_list(op_str))),'(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}\\|)','') as op_sort
from
(select
user_id,
op_id,
to_date(op_time) as dt,
op_time,
concat_ws('|',op_time,op_id) as op_str
from t_act_log_035
)t
group by user_id,dt
)
select
dt,count(1)
from tmp
where op_sort like '%A,B%'
group by dt
查询结果
5)使用like查询满足包含’A%B%D'但是不包含‘A%B%C%D’的记录
with tmp as
(
select
user_id,
dt,
regexp_replace(concat_ws(',',sort_array(collect_list(op_str))),'(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}\\|)','') as op_sort
from
(select
user_id,
op_id,
to_date(op_time) as dt,
op_time,
concat_ws('|',op_time,op_id) as op_str
from t_act_log_035
)t
group by user_id,dt
)
select
user_id,
dt,
op_sort
from tmp
where op_sort like '%A%B%D%'
and op_sort not like '%A%B%C%D%'
查询结果
6)按照日期分组,计算每天符合条件的用户数量
with tmp as
(
select
user_id,
dt,
regexp_replace(concat_ws(',',sort_array(collect_list(op_str))),'(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}\\|)','') as op_sort
from
(select
user_id,
op_id,
to_date(op_time) as dt,
op_time,
concat_ws('|',op_time,op_id) as op_str
from t_act_log_035
)t
group by user_id,dt
)
select
dt,count(1)
from tmp
where op_sort like '%A%B%D%'
and op_sort not like '%A%B%C%D%'
group by dt
查询结果
create table t_act_log_035(
user_id bigint,
op_id string,
op_time string
)row format delimited fields terminated by '\t';
insert into t_act_log_035(user_id, op_id, op_time) values
(1, 'A', '2023-10-18 12:01:03'),
(2, 'A', '2023-10-18 12:01:04'),
(3, 'A', '2023-10-18 12:01:05'),
(1, 'B', '2023-10-18 12:03:03'),
(1, 'A', '2023-10-18 12:04:03'),
(1, 'C', '2023-10-18 12:06:03'),
(1, 'D', '2023-10-18 12:11:03'),
(2, 'A', '2023-10-18 12:07:04'),
(3, 'C', '2023-10-18 12:02:05'),
(2, 'C', '2023-10-18 12:09:03'),
(2, 'A', '2023-10-18 12:10:03'),
(4, 'A', '2023-10-18 12:01:03'),
(4, 'C', '2023-10-18 12:11:05'),
(4, 'D', '2023-10-18 12:15:05'),
(1, 'A', '2023-10-19 12:01:03'),
(2, 'A', '2023-10-19 12:01:04'),
(3, 'A', '2023-10-19 12:01:05'),
(1, 'B', '2023-10-19 12:03:03'),
(1, 'A', '2023-10-19 12:04:03'),
(1, 'C', '2023-10-19 12:06:03'),
(2, 'A', '2023-10-19 12:07:04'),
(3, 'B', '2023-10-19 12:08:05'),
(3, 'E', '2023-10-19 12:09:05'),
(3, 'D', '2023-10-19 12:11:05'),
(2, 'C', '2023-10-19 12:09:03'),
(4, 'E', '2023-10-19 12:05:03'),
(4, 'B', '2023-10-19 12:06:03'),
(4, 'E', '2023-10-19 12:07:03'),
(2, 'A', '2023-10-19 12:10:03');