开窗函数 spark sql实现累加、累积计数、累乘

数据仓库晨曦

发布于 2025-03-04 06:22:15

20900

代码可运行

文章被收录于专栏：数据仓库技术数据仓库技术

运行总次数：0

代码可运行

前面对开窗函数做了基础知识介绍，今天我们对其应用做详细介绍。通过开窗函数实现累积求和（累加），累积计数，累乘（累积相乘）。

有一定难度，建议先思考然后再阅读

1.样例数据

+-----+-----------+---------+-------------+
| id  | group_id  | amount  |   c_date    |
+-----+-----------+---------+-------------+
| 1   | a         | 1.00    | 2025-01-01  |
| 2   | a         | 1.00    | 2025-01-02  |
| 3   | a         | 2.00    | 2025-01-03  |
| 4   | a         | 3.00    | 2025-01-03  |
| 5   | b         | -1.00   | 2025-01-01  |
| 6   | b         | 1.00    | 2025-01-02  |
| 7   | b         | 0.50    | 2025-01-03  |
| 8   | b         | 2.00    | 2025-01-03  |
| 9   | b         | 1.00    | 2025-01-03  |
+-----+-----------+---------+-------------+

建表语句及数据插入语句

CREATE TABLE t_accure
(
    id       BIGINT,
    group_id STRING,
    amount   decimal(16, 2),
    c_date   STRING
);

INSERT INTO t_accure
VALUES (1,'a',1,'2025-01-01'),
       (2,'a',1,'2025-01-02'),
       (3,'a',2,'2025-01-03'),
       (4,'a',3,'2025-01-03'),
       (5,'b',-1,'2025-01-01'),
       (6,'b',1,'2025-01-02'),
       (7,'b',0.5,'2025-01-03'),
       (8,'b',2,'2025-01-03'),
       (9,'b',1,'2025-01-03');

2. 累加

题目

按照group_id进行分组，根据c_date顺序从早到晚对amount进行累积求和。

累加有两种方式，在对于排序相同的数据进行累加的时候，由于排序相同（且随机）,所以出现两种方式：1.计算到当前行；2.计算到与当前排序值相同的最后一行。

select id,
       group_id,
       amount,
       c_date,
       sum(amount)
           over (partition by group_id order by c_date rows between unbounded preceding and current row) as accure_add1,
        sum(amount)
           over (partition by group_id order by c_date range between unbounded preceding and current row) as accure_add2,
        sum(amount)
           over (partition by group_id order by c_date) as accure_add3
from t_accure

执行结果

+-----+-----------+---------+-------------+--------------+--------------+--------------+
| id  | group_id  | amount  |   c_date    | accure_add1  | accure_add2  | accure_add3  |
+-----+-----------+---------+-------------+--------------+--------------+--------------+
| 1   | a         | 1.00    | 2025-01-01  | 1.00         | 1.00         | 1.00         |
| 2   | a         | 1.00    | 2025-01-02  | 2.00         | 2.00         | 2.00         |
| 3   | a         | 2.00    | 2025-01-03  | 4.00         | 7.00         | 7.00         |
| 4   | a         | 3.00    | 2025-01-03  | 7.00         | 7.00         | 7.00         |
| 5   | b         | -1.00   | 2025-01-01  | -1.00        | -1.00        | -1.00        |
| 6   | b         | 1.00    | 2025-01-02  | 0.00         | 0.00         | 0.00         |
| 7   | b         | 0.50    | 2025-01-03  | 0.50         | 3.50         | 3.50         |
| 8   | b         | 2.00    | 2025-01-03  | 2.50         | 3.50         | 3.50         |
| 9   | b         | 1.00    | 2025-01-03  | 3.50         | 3.50         | 3.50         |
+-----+-----------+---------+-------------+--------------+--------------+--------------+

注意：

1.注意第3行和4行的结果；
2.注意滴7，8，9行的结果；
3.注意窗口框架方位省略后的结果:accure_add3

3.累积计数

题目 1.按照group_id进行分组，根据c_date顺序从早到晚对c_date进行累积计数； 2.按照group_id进行分组，根据c_date顺序从早到晚对c_date进行累积计数，要求去重；

3.1累积计数

select id,
       group_id,
       amount,
       c_date,
       count(c_date)
             over (partition by group_id order by c_date rows between unbounded preceding and current row )  as accure_count1,
       count(c_date)
             over (partition by group_id order by c_date range between unbounded preceding and current row ) as accure_count2,
       count(c_date)
             over (partition by group_id order by c_date )                                                   as accure_count3
from t_accure

执行结果

+-----+-----------+---------+-------------+----------------+----------------+----------------+
| id  | group_id  | amount  |   c_date    | accure_count1  | accure_count2  | accure_count3  |
+-----+-----------+---------+-------------+----------------+----------------+----------------+
| 1   | a         | 1.00    | 2025-01-01  | 1              | 1              | 1              |
| 2   | a         | 1.00    | 2025-01-02  | 2              | 2              | 2              |
| 3   | a         | 2.00    | 2025-01-03  | 3              | 4              | 4              |
| 4   | a         | 3.00    | 2025-01-03  | 4              | 4              | 4              |
| 5   | b         | -1.00   | 2025-01-01  | 1              | 1              | 1              |
| 6   | b         | 1.00    | 2025-01-02  | 2              | 2              | 2              |
| 7   | b         | 0.50    | 2025-01-03  | 3              | 5              | 5              |
| 8   | b         | 2.00    | 2025-01-03  | 4              | 5              | 5              |
| 9   | b         | 1.00    | 2025-01-03  | 5              | 5              | 5              |
+-----+-----------+---------+-------------+----------------+----------------+----------------+

3.2 累积去重计数

我们首先想到的是直接使用count(disitnct amount) 的方式来完成，注意：在hive中支持count(distinct amount)over() 这种方式，但是在spark中不支持这种写法.

count(distinct amount)over()方式实现

select id,
       group_id,
       amount,
       c_date,
       count(distinct c_date)
           over (partition by group_id order by c_date rows between unbounded preceding and current row ) as accure_count
from t_accure

collect_set方式实现

select id,
       group_id,
       amount,
       c_date,
       size(c_date_set) as accure_count
from (select id,
             group_id,
             amount,
             c_date,
             collect_set(c_date)
                         over (partition by group_id order by c_date rows between unbounded preceding and current row ) as c_date_set
      from t_accure) t

执行结果

+-----+-----------+---------+-------------+---------------+
| id  | group_id  | amount  |   c_date    | accure_count  |
+-----+-----------+---------+-------------+---------------+
| 1   | a         | 1.00    | 2025-01-01  | 1             |
| 2   | a         | 1.00    | 2025-01-02  | 2             |
| 3   | a         | 2.00    | 2025-01-03  | 3             |
| 4   | a         | 3.00    | 2025-01-03  | 3             |
| 5   | b         | -1.00   | 2025-01-01  | 1             |
| 6   | b         | 1.00    | 2025-01-02  | 2             |
| 7   | b         | 0.50    | 2025-01-03  | 3             |
| 8   | b         | 2.00    | 2025-01-03  | 3             |
| 9   | b         | 1.00    | 2025-01-03  | 3             |
+-----+-----------+---------+-------------+---------------+

4.累乘（累积相乘）

题目按照group_id进行分组，根据c_date顺序从早到晚对amount进行累积相乘；

select id,
       group_id,
       amount,
       c_date,
       amount_list,
       aggregate(amount_list, cast(1 as decimal(16, 2)), (amount_list, x) -> cast(amount_list * x as decimal(16,
                 2))) as accure_mul
from (select id,
             group_id,
             amount,
             c_date,
             collect_list(amount)
                          over (partition by group_id order by c_date rows between unbounded preceding and current row ) as amount_list
      from t_accure) t

+-----+-----------+---------+-------------+------------------------------+-------------+
| id  | group_id  | amount  |   c_date    |         amount_list          | accure_mul  |
+-----+-----------+---------+-------------+------------------------------+-------------+
| 1   | a         | 1.00    | 2025-01-01  | [1.00]                       | 1.00        |
| 2   | a         | 1.00    | 2025-01-02  | [1.00,1.00]                  | 1.00        |
| 3   | a         | 2.00    | 2025-01-03  | [1.00,1.00,2.00]             | 2.00        |
| 4   | a         | 3.00    | 2025-01-03  | [1.00,1.00,2.00,3.00]        | 6.00        |
| 5   | b         | -1.00   | 2025-01-01  | [-1.00]                      | -1.00       |
| 6   | b         | 1.00    | 2025-01-02  | [-1.00,1.00]                 | -1.00       |
| 7   | b         | 0.50    | 2025-01-03  | [-1.00,1.00,0.50]            | -0.50       |
| 8   | b         | 2.00    | 2025-01-03  | [-1.00,1.00,0.50,2.00]       | -1.00       |
| 9   | b         | 1.00    | 2025-01-03  | [-1.00,1.00,0.50,2.00,1.00]  | -1.00       |
+-----+-----------+---------+-------------+------------------------------+-------------+