需求
求所有用户和活跃用户的总数及平均年龄
建表语句
create table user_age(
dt string,
user_id string,
age int
)
row format delimited fields terminated by '\t'
;
数据
insert overwrite table user_age values
("2019-02-11","test_1","23"),
("2019-02-11","test_2","19"),
("2019-02-11","test_3","39"),
("2019-02-11","test_1","23"),
("2019-02-11","test_3","39"),
("2019-02-11","test_1","23"),
("2019-02-12","test_2","19"),
("2019-02-13","test_1","23"),
("2019-02-15","test_2","19"),
("2019-02-16","test_2","19");
实现
select
0 user_total_count,
0 user_total_avg_age,
count(*) twice_count,
cast(sum(age)/count(*) as decimal(10,2)) twice_count_avg_age
from
(
select
user_id,
min(age) age
from
(select
user_id,
min(age) age
from
(select
user_id,
age,
date_sub(dt,rk) flag
from
(select
dt,
user_id,
min(age) age,
rank() over(partition by user_id order by dt) rk
from
user_age
group by
dt,user_id
)t1
)t2
group by
user_id,flag
having
count(*)>=2
)t3
group by
user_id
)t4
union all
select
count(*) user_total_count,
cast(sum(age)/count(*) as decimal(10,2)),
0 twice_count,
0 twice_count_avg_age
from(
select
user_id,
min(age) as age
from
user_age
group by user_id
)t5
;
结果
OK
_u1.user_total_count _u1.user_total_avg_age _u1._c2 _u1._c3
0 0.0 2 19.00
3 27.0 0 0.00
分析
1、这里有两次去重,第一次去重是一个用户一天内多次访问,只算一次,第二次一个用户有多个连续登录,那么实际情况中,第一次去重是常规操作,第二次去重根据公司实际要求来做
2、活跃用户是指连续两天及以上,所以count(*)>=2
3、date_sub(dt,rank) flag这个是核心逻辑,语言不好描述,举例说明:(03-22,1)(03-27,2),(03-28,3),(03-29,4),通过日期相减后得到的值都是结果是03-21,03-25,只有03-25的结果是日期是连续的
4、这里需要计算所有用户的平均年龄和活跃用户的平均年龄,维度不同,在不同的列展示,而union all需要保证列数相同,所以这里需要补默认值
扩展
min(age)作用:
1、满足分组时把age从子查询带出
2、用户每天登录,年龄可能会随着日期变化,所以潜在的计算规则中是每次根据用户和日期分组时取最小年龄
知识点
1、date_sub(日期,数值)表示用日期-数值,返回日期字符串
2、cast(数据类型1 as 数据类型2)表示将数据类型1强转成数据类型2
3、decimal(10,2)是整数位长度为10,小数位长度为2,如果没有小数位,默认补0