在数据开发过程中,经常遇到对比数据的情况,例如上游给出一张新表,我们需要进行验证是否可替换;我们切换依赖表后验证数据是否一致等场景。
表中不存在完全重复的数据,样例数据如下
样例数据
--t1表数据
+-----+-------+--------+
| id | name | score |
+-----+-------+--------+
| 1 | aa | 90 |
| 2 | bb | 87 |
+-----+-------+--------+
-- t1_1表数据
+-----+-------+--------+
| id | name | score |
+-----+-------+--------+
| 1 | aa | 90 |
| 2 | bb | 87 |
+-----+-------+--------+
--t1_2表数据
+-----+-------+--------+
| id | name | score |
+-----+-------+--------+
| 1 | aa | 90 |
| 2 | bb | 70 |
+-----+-------+--------+
-- t1_3表数据
+-----+-------+--------+
| id | name | score |
+-----+-------+--------+
| 1 | aa | 90 |
| 3 | cc | 80 |
+-----+-------+--------+
t1表数据与t1_1中的数据完全一致;
t1_2表中bb的成绩与t1表的成绩不一致;
t1_3表中增加了cc的数据,但是没有bb的数据;
原始数据中每行数据均不重复,我们期望两个表完全一致,我们使用union all将两表进行合并,group by 全部字段,对行数进行统计,查询不等于2的数据,如果结果数据为空,则两表完全一致,否则数据存在差异。
对比SQL
--t1与t1_1比较
select id,
name,
score,
count(1) as row_cnt
from (select *
from t1
union all
select *
from t1_1) t
group by id, name, score
having count(1) <> 2
--执行结果
+-----+-------+--------+----------+
| id | name | score | row_cnt |
+-----+-------+--------+----------+
+-----+-------+--------+----------+
--t1与t1_2比较
select id,
name,
score,
count(1) as row_cnt
from (select *
from t1
union all
select *
from t1_2) t
group by id, name, score
having count(1) <> 2
--执行结果
+-----+-------+--------+----------+
| id | name | score | row_cnt |
+-----+-------+--------+----------+
| 2 | bb | 87 | 1 |
| 2 | bb | 70 | 1 |
+-----+-------+--------+----------+
--t1与t1_3比较
select id,
name,
score,
count(1) as row_cnt
from (select *
from t1
union all
select *
from t1_3) t
group by id, name, score
having count(1) <> 2
--执行结果
+-----+-------+--------+----------+
| id | name | score | row_cnt |
+-----+-------+--------+----------+
| 2 | bb | 87 | 1 |
| 3 | cc | 80 | 1 |
+-----+-------+--------+----------+
如果我们不想要看具体数据,只想得到结果,对接进行行数统计,行数为0代表一致,行数不为0代表存在差异。
以t1与t1_3比较为例
--t1与t1_3比较
select count(1)
from (select id,
name,
score,
count(1) as row_cnt
from (select *
from t1
union all
select *
from t1_3) t
group by id, name, score) t
where t.row_cnt <> 2
--执行结果
+-----------+
| count(1) |
+-----------+
| 2 |
+-----------+
原始数据均不重复,也可以使用full outer join 方式,on条件为所有字段,这里要求所有值不为空,否则不能保证验证结果。查询两表是否存在空行,如果存在代表两个不一致,不存在空行代表两表一致。
对比SQL
--t1与t1_1比较
select t1.id,
t1.name,
t1.score,
t1_1.id,
t1_1.name,
t1_1.score
from t1
full outer join t1_1
on t1.id = t1_1.id
and t1.name = t1_1.name
and t1.score = t1_1.score
where t1.id is null
or t1_1.id is null;
--执行结果
+-----+-------+--------+-----+-------+--------+
| id | name | score | id | name | score |
+-----+-------+--------+-----+-------+--------+
+-----+-------+--------+-----+-------+--------+
--t1与t1_2比较
select t1.id,
t1.name,
t1.score,
t1_2.id,
t1_2.name,
t1_2.score
from t1
full outer join t1_2
on t1.id = t1_2.id
and t1.name = t1_2.name
and t1.score = t1_2.score
where t1.id is null
or t1_2.id is null;
--执行结果
+-------+-------+--------+-------+-------+--------+
| id | name | score | id | name | score |
+-------+-------+--------+-------+-------+--------+
| NULL | NULL | NULL | 2 | bb | 70 |
| 2 | bb | 87 | NULL | NULL | NULL |
+-------+-------+--------+-------+-------+--------+
--t1与t1_3比较
select t1.id,
t1.name,
t1.score,
t1_3.id,
t1_3.name,
t1_3.score
from t1
full outer join t1_3
on t1.id = t1_3.id
and t1.name = t1_3.name
and t1.score = t1_3.score
where t1.id is null
or t1_3.id is null;
--执行结果
+-------+-------+--------+-------+-------+--------+
| id | name | score | id | name | score |
+-------+-------+--------+-------+-------+--------+
| 2 | bb | 87 | NULL | NULL | NULL |
| NULL | NULL | NULL | 3 | cc | 80 |
+-------+-------+--------+-------+-------+--------+
两种方式各有优劣,对比两种方式的优缺点,有以下问题,你有解决方案吗? 问题:
union all
方式想要知道差异数据来源于哪张表,该如何处理?full outer join
方式,如果有字段存在空值该如何处理?上面场景假设不存在重复记录,可是实际场景中可能存在表中存在完全重复的数据,接下来我们比较这样的场景。依旧是两种方法。
样例数据
--t6,表内 id=2是完全一致的两条数据。
+-----+-------+--------+
| id | name | score |
+-----+-------+--------+
| 1 | aa | 90 |
| 2 | bb | 87 |
| 2 | bb | 87 |
| 3 | cc | 80 |
+-----+-------+--------+
--t6_1,与原始数据完全一致
+-----+-------+--------+
| id | name | score |
+-----+-------+--------+
| 1 | aa | 90 |
| 2 | bb | 87 |
| 2 | bb | 87 |
| 3 | cc | 80 |
+-----+-------+--------+
--t6_2,id=2的数据只有一条。
+-----+-------+--------+
| id | name | score |
+-----+-------+--------+
| 1 | aa | 90 |
| 2 | bb | 87 |
| 3 | cc | 80 |
+-----+-------+--------+
想要使用union all ,我们得让数据不存在重复,才能使用数据行数=2来判断数据是否一致。为了保证数据不重复,且信息不丢失,需要对原始数据重复行进行计数,好确每个重复的行数。
对比SQl
--t6与t6_1比较
select id,
name,
score,
row_cnt,
count(1) as cnt
from (
--原始数据重复行计数
select id,
name,
score,
count(1) as row_cnt
from t6
group by id, name, score
union all
--对比表重复行计数
select id,
name,
score,
count(1) as row_cnt
from t6_1
group by id, name, score) tt1
group by id,
name,
score,
row_cnt
having count(1) <> 2
--对比结果
+-----+-------+--------+----------+------+
| id | name | score | row_cnt | cnt |
+-----+-------+--------+----------+------+
+-----+-------+--------+----------+------+
--t6与t6_2比较
select id,
name,
score,
row_cnt,
count(1) as cnt
from (
--原始数据重复行计数
select id,
name,
score,
count(1) as row_cnt
from t6
group by id, name, score
union all
--对比表重复行计数
select id,
name,
score,
count(1) as row_cnt
from t6_2
group by id, name, score) tt1
group by id,
name,
score,
row_cnt
having count(1) <> 2
--对比结果
+-----+-------+--------+----------+------+
| id | name | score | row_cnt | cnt |
+-----+-------+--------+----------+------+
| 2 | bb | 87 | 2 | 1 |
| 2 | bb | 87 | 1 | 1 |
+-----+-------+--------+----------+------+
--备注:
--这里可以看到, 存在数据差异,id=2的 一个数据行数是2行(row_cnt=2),一个数据一行(row_cnt=1),但是我们依旧无法区分数据来源。
使用full outer join
方式,也不能直接对比,也需要对原始数据进行处理,否则多对多join,出现笛卡尔积,比较不出来到底是否一致了。
对比SQl
--t6与t6_1比较
select *
from (select id,
name,
score,
row_number() over (partition by id,
name,
score order by random() asc) as rn
from t6) tt1
full outer join
(select id,
name,
score,
row_number() over (partition by id,
name,
score order by random() asc) as rn
from t6_1) tt2
on tt1.id = tt2.id
and tt1.name = tt2.name
and tt1.score = tt2.score
and tt1.rn = tt2.rn
where tt1.id is null
or tt2.id is null
--对比结果
+-----+-------+--------+-----+-----+-------+--------+-----+
| id | name | score | rn | id | name | score | rn |
+-----+-------+--------+-----+-----+-------+--------+-----+
+-----+-------+--------+-----+-----+-------+--------+-----+
--t6与t6_2比较
select *
from (select id,
name,
score,
row_number() over (partition by id,
name,
score order by random() asc) as rn
from t6) tt1
full outer join
(select id,
name,
score,
row_number() over (partition by id,
name,
score order by random() asc) as rn
from t6_2) tt2
on tt1.id = tt2.id
and tt1.name = tt2.name
and tt1.score = tt2.score
and tt1.rn = tt2.rn
where tt1.id is null
or tt2.id is null
--对比结果
+-----+-------+--------+-----+-------+-------+--------+-------+
| id | name | score | rn | id | name | score | rn |
+-----+-------+--------+-----+-------+-------+--------+-------+
| 2 | bb | 87 | 2 | NULL | NULL | NULL | NULL |
+-----+-------+--------+-----+-------+-------+--------+-------+
+-----+-------+--------+----------+------+
--备注:
--因为t6中存在一样rn=2的数据,t6_2中不存在,如此我们可以得出结论:t6中存在两条重复记录,但是t6_2中只有一条。