需求
查询前20%时间的订单信息
建表语句
create table business(
name string,
orderdate string,
cost int
)
row format delimited fields terminated by '\t'
;
数据
insert overwrite table business values
("jack","2017-01-01",10),
("tony","2017-01-02",15),
("jack","2017-02-03",23),
("tony","2017-01-04",29),
("jack","2017-01-05",46),
("jack","2017-04-06",42),
("tony","2017-01-07",50),
("jack","2017-01-08",55),
("mart","2017-04-08",62),
("mart","2017-04-09",68),
("neil","2017-05-10",12),
("mart","2017-04-11",75),
("neil","2017-06-12",80),
("mart","2017-04-13",94);
实现
select
name,
orderdate,
cost,
ntile(5) over(order by orderdate) n
from
business
;
结果
Total MapReduce CPU Time Spent: 3 seconds 190 msec
OK
t1.name t1.orderdate t1.cost t1.n
jack 2017-01-01 10 1
tony 2017-01-02 15 1
tony 2017-01-04 29 1
Time taken: 14.82 seconds, Fetched: 3 row(s)
分析
1、查询前20%时间的订单信息,没有特指某个用户,所以不用指定分区
2、开窗里面对时间排序
3、ntile进行分桶,按照时间顺序分成5份,因为这里说的是20%
4、先用子查询分桶,然后再进行过滤即n=1
扩展
如果不进行排序,则按读取的数据是默认有序
select
name,
orderdate,
cost,
ntile(5) over() n
from
business
;
Total MapReduce CPU Time Spent: 2 seconds 260 msec
OK
name orderdate cost n
mart 2017-04-13 94 1
neil 2017-06-12 80 1
mart 2017-04-11 75 1
neil 2017-05-10 12 2
mart 2017-04-09 68 2
mart 2017-04-08 62 2
jack 2017-01-08 55 3
tony 2017-01-07 50 3
jack 2017-04-06 42 3
jack 2017-01-05 46 4
tony 2017-01-04 29 4
jack 2017-02-03 23 4
tony 2017-01-02 15 5
jack 2017-01-01 10 5
Time taken: 14.127 seconds, Fetched: 14 row(s)
知识点
ntile:把有序的数据集合平均分配到指定的数据量个桶中,将桶号分配给每一行。
如果不能平均分配,则优先分配较小编号的桶,并且各个桶中能放的行数最多相差1。