表: t5_numbers
中保存数字的值及其频率
+----------+-------------+
| Number | Frequency |
+----------+-------------|
| 0 | 7 |
| 1 | 1 |
| 2 | 3 |
| 3 | 1 |
+----------+-------------+
在此表中,数字为 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 3
,所以中位数是 (0 + 0) / 2 = 0
。
+--------+
| median |
+--------|
| 0.0000 |
+--------+
请编写一个查询来查找所有数字的中位数并将结果命名为 median
。
中位数(Median)是描述一个数据集中心位置的统计量,它是将数据集从小到大排序后位于中间位置的数值。如果数据集中的元素数量是奇数,那么中位数就是正中间的那个数;如果是偶数,中位数则是中间两个数的平均值。
本题较查询中位数更加复杂的点在给出了频次,需要将频次计算在内。相应解法:1.将所有频次生成对应的行数的数值,之后就按照正常求取中位数的方法求取即可;2.根据频次计数,基数找到对应的位置即为中位数,偶数则需要找到对应的两个位置,然后分别计算出对应的值,求取平均值。
维度 | 评分 |
---|---|
题目难度 | ⭐️⭐️⭐️ |
题目清晰度 | ⭐️⭐️⭐️⭐️⭐️ |
业务常见度 | ⭐️⭐️⭐️ |
该方式属于比较“笨”的方式,或者说不够取巧,但是这属于按照计算方式直接计算。小数据量直接生成,然后计算完全没问题,但是如果频次很高会导致数据量激增。
使用lateral view、explode、space等函数原始数据炸开
执行SQL
select number, frequency
from t5_numbers
lateral view explode(split(space(frequency - 1), ' ')) t as a
执行结果
+---------+------------+
| number | frequency |
+---------+------------+
| 0 | 7 |
| 0 | 7 |
| 0 | 7 |
| 0 | 7 |
| 0 | 7 |
| 0 | 7 |
| 0 | 7 |
| 1 | 1 |
| 2 | 3 |
| 2 | 3 |
| 2 | 3 |
| 3 | 1 |
+---------+------------+
常规计算中位数方法:Hive基础知识07-求取中位数
执行SQL
select avg(number) as midian
from (select number
from (select number, row_number() over (order by number) as rn, count(1) over () as cnt
from t5_numbers
lateral view explode(split(space(frequency - 1), ' ')) t as a) t
where t.rn in (cnt / 2, (cnt + 1) / 2, (cnt + 2) / 2)) tt
SQL结果
+---------+
| midian |
+---------+
| 0.0 |
+---------+
由于给出了频次,频次相加得到总的数字个数,然后找到对应位置的数字,求取中位数即可。
使用sum()over()聚合函数开窗,分别计算出 total_cnt:数字总个数 order_pre_cnt:该数字开始位置(不含) order_cnt:该数字结束位置(含)
执行SQL
select
number,frequency,
sum(frequency)over() as total_cnt,
nvl(sum(frequency)over(order by number rows between unbounded preceding and 1 preceding),0) as order_pre_cnt,
sum(frequency)over(order by number) as order_cnt
from t5_numbers
SQL结果
+---------+------------+------------+----------------+------------+
| number | frequency | total_cnt | order_pre_cnt | order_cnt |
+---------+------------+------------+----------------+------------+
| 0 | 7 | 12 | 0 | 7 |
| 1 | 1 | 12 | 7 | 8 |
| 2 | 3 | 12 | 8 | 11 |
| 3 | 1 | 12 | 11 | 12 |
+---------+------------+------------+----------------+------------+
如果数字个数N是奇数,则中位数的位置是(N+1)/2, 如果数字个数N是偶数 则中位数是N/2和(N+2)/2位置的平均值。我们判断N是否为偶数,选取对应的位置,判断所在位置的数字是否参与计算。
执行SQL
select number,
case
when total_cnt % 2 = 0 then
--偶数个计算方式
case
when (total_cnt / 2 > order_pre_cnt and total_cnt / 2 <= order_cnt) or
((total_cnt + 2) / 2 > order_pre_cnt and (total_cnt + 2) / 2 <= order_cnt) then 1
else 0 end
else
--奇数个计算方式
case when
(total_cnt + 1) / 2 > order_pre_cnt and (total_cnt + 1) / 2 <= order_cnt then 1
else 0 end end as is_midian_row,
total_cnt,
order_pre_cnt,
order_cnt
from (select number,
frequency,
sum(frequency) over () as total_cnt,
nvl(sum(frequency) over (order by number rows between unbounded preceding and 1 preceding),
0) as order_pre_cnt,
sum(frequency) over (order by number) as order_cnt
from t5_numbers) t
SQL结果
+---------+----------------+------------+----------------+------------+
| number | is_midian_row | total_cnt | order_pre_cnt | order_cnt |
+---------+----------------+------------+----------------+------------+
| 0 | 1 | 12 | 0 | 7 |
| 1 | 0 | 12 | 7 | 8 |
| 2 | 0 | 12 | 8 | 11 |
| 3 | 0 | 12 | 11 | 12 |
+---------+----------------+------------+----------------+------------+
根据上一步结果,is_midian_row = 1 代表该数字参与中位数计算,这里可能有一行或者两行是1,限定为1然后使用avg计算得到最终结果
执行SQL
select avg(number) as midian
from (select number,
case
when total_cnt % 2 = 0 then
--偶数个计算方式
case
when (total_cnt / 2 > order_pre_cnt and total_cnt / 2 <= order_cnt) or
((total_cnt + 2) / 2 > order_pre_cnt and (total_cnt + 2) / 2 <= order_cnt) then 1
else 0 end
else
--奇数个计算方式
case
when
(total_cnt + 1) / 2 > order_pre_cnt and (total_cnt + 1) / 2 <= order_cnt then 1
else 0 end end as is_midian_row,
total_cnt,
order_pre_cnt,
order_cnt
from (select number,
frequency,
sum(frequency) over () as total_cnt,
nvl(sum(frequency) over (order by number rows between unbounded preceding and 1 preceding),
0) as order_pre_cnt,
sum(frequency) over (order by number) as order_cnt
from t5_numbers) t) tt
where tt.is_midian_row = 1
SQL结果
+---------+
| midian |
+---------+
| 0.0 |
+---------+
--建表语句
CREATE TABLE t5_numbers(
number bigint,
frequency bigint
) COMMENT '数字频次表'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
;
-- 插入数据
insert into t5_numbers(number,frequency)
values
(0,7),
(1,1),
(2,3),
(3,1);