Mysql范围查询优化

mingjie

发布于 2022-05-12 09:43:37

2.1K00

代码可运行

文章被收录于专栏：Postgresql源码分析Postgresql源码分析

运行总次数：0

代码可运行

mysql range optimization

速查1：eq_range_index_dive_limit决定的两种评估方式

对比	Dive	Statistic	真实
idx1(t1,t2,k)	5（误差0%）	2（误差60%）	5
Idx2(t1,k)	1026（误差0%）	985（误差%）	1026（误差4%）
statistics时长	0.188ms	0.107ms

速查2：内存限制

in或者or的一个值大概需要230字节！
in (1,2,…,N) and in (1,2,…,M)算作M*N个！
and的一个值大概125字节！

1 range和in的一些知识点

Equality Range Optimization of Many-Valued Comparisons

Consider these expressions, where col_name is an indexed column:

col_name IN(val1, ..., valN)
col_name = val1 OR ... OR col_name = valN

Each expression is true if col_name is equal to any of several values. These comparisons are equality range comparisons (where the “range” is a single value). The optimizer estimates the cost of reading qualifying rows for equality range comparisons as follows:

If there is a unique index on col_name, the row estimate for each range is 1 because at most one row can have the given value.
如果有唯一索引，行估计值就很简单了，每种值之可能有一行
Otherwise, any index on col_name is nonunique and the optimizer can estimate the row count for each range using dives into the index or index statistics.
如果没有唯一索引，优化器需要用索引或者索引统计信息中的信息来评估每种值对应的行数

With index dives, the optimizer makes a dive at each end of a range and uses the number of rows in the range as the estimate. For example, the expression col_name IN (10, 20, 30) has three equality ranges and the optimizer makes two dives per range to generate a row estimate. Each pair of dives yields an estimate of the number of rows that have the given value.

例如IN (10, 20, 30)每个值做两次dive获得三个dive对，可以精确估计行数

Index dives provide accurate row estimates, but as the number of comparison values in the expression increases, the optimizer takes longer to generate a row estimate. Use of index statistics is less accurate than index dives but permits faster row estimation for large value lists.

在《高性能MySQL》里面提及用in这种方式可以有效的替代一定的range查询，提升查询效率，因为在一条索引里面，range字段后面的部分是不生效的（in后面的点查还能生效的，但是order by无效，原因就是组合）。使用in这种方式其实MySQL优化器是转化成了n*m种组合方式来进行查询，最终将返回值合并，有点类似union但是更高效。同时它存在这一些问题：

老版本的MySQL在IN()组合条件过多的时候会发生很多问题。查询优化可能需要花很多时间，并消耗大量内存。新版本MySQL在组合数超过一定的数量就不进行计划评估了，这可能导致MySQL不能很好的利用索引。

这里的“一定数量”在MySQL5.6.5以及以后的版本中是由eq_range_index_dive_limit这个参数控制，一直到5.7以后的版本默认会修改成200，当然我们是可以手动设置的。

参数使用

eq_range_index_dive_limit = 0    只能使用index dive
N >= eq_range_index_dive_limit   使用index statistics
N <  eq_range_index_dive_limit   只能使用index dive

index dive与index statistics是MySQL优化器对开销代价的估算方法，前者统计速度慢但是能得到精准的值，后者统计速度快但是数据未必精准。

the optimizer can estimate the row count for each range using dives into the index or index statistics.

在MySQL5.7版本中将默认值从10修改成200目的是为了尽可能的保证范围等值运算（IN()）执行计划尽量精准，因为IN()list的数量很多时候都是超过10的。

下面全部四种情况的查询会跳过dive

A single-index FORCE INDEX index hint is present. The idea is that if index use is forced, there is nothing to be gained from the additional overhead of performing dives into the index.
The index is nonunique and not a FULLTEXT index.
No subquery is present.
No DISTINCT, GROUP BY, or ORDER BY clause is present.

Those dive-skipping conditions apply only for single-table queries. Index dives are not skipped for multiple-table queries (joins).

2 实战

测试表

-- 1000万数据量
CREATE TABLE `sbtest1` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `k` int(11) NOT NULL DEFAULT '0',
  `c` char(120) NOT NULL DEFAULT '',
  `pad` char(60) NOT NULL DEFAULT '',
  `t1` int(11) NOT NULL,
  `t2` int(11) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `idx1` (`t1`,`t2`,`k`),
  KEY `idx2` (`t1`,`k`)
) ENGINE=InnoDB AUTO_INCREMENT=10000001 DEFAULT CHARSET=utf8mb4;

-- t1的选择性1W/1000W，t2的选择性1000/1000W
update sbtest1 set t1=FLOOR(RAND() * 10000);
update sbtest1 set t2=FLOOR(RAND() * 1000);

测试SQL

（选择走索引1点查或者走索引2排序）

-- KEY `idx1` (`t1`,`t2`,`k`)
-- KEY `idx2` (`t1`,`k`)
-- 选哪个？

-- 选了idx2避免排序
-- 0.00053350
explain select * from sbtest1 where t1=4949 and t2 in (7101, 384) order by k;
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: sbtest1
   partitions: NULL
         type: range
possible_keys: idx1,idx2
          key: idx1
      key_len: 8
          ref: NULL
         rows: 5
     filtered: 100.00
        Extra: Using index condition; Using filesort

-- 强制idx1走前两列索引
-- 0.00517950
explain select * from sbtest1 force index(idx2) where t1=4949 and t2 in (7101, 384) order by k;
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: sbtest1
   partitions: NULL
         type: ref
possible_keys: idx2
          key: idx2
      key_len: 4
          ref: const
         rows: 1026
     filtered: 20.00
        Extra: Using index condition; Using where

当前评估使用index dive

mysql> show variables like 'eq_range_index_dive_limit';
+---------------------------+-------+
| Variable_name             | Value |
+---------------------------+-------+
| eq_range_index_dive_limit | 200   |
+---------------------------+-------+

确认trace走(t1,t2,k)，看到使用了index_dives_for_eq_ranges，评估的行数是很准确的，走idx1用5行，走idx2用1026行

set optimizer_trace=1;
set optimizer_trace_max_mem_size=16384;
select * from sbtest1 where t1=4949 and t2 in (7101, 384) order by k;

select * from information_schema.optimizer_trace;

"range_scan_alternatives": [
  {
    "index": "idx1",
    "ranges": [
      "4949 <= t1 <= 4949 AND 384 <= t2 <= 384",
      "4949 <= t1 <= 4949 AND 7101 <= t2 <= 7101"
    ],
    "index_dives_for_eq_ranges": true,
    "rowid_ordered": false,
    "using_mrr": false,
    "index_only": false,
    "rows": 5,
    "cost": 8.01,
    "chosen": true
  },
  {
    "index": "idx2",
    "ranges": [
      "4949 <= t1 <= 4949"
    ],
    "index_dives_for_eq_ranges": true,
    "rowid_ordered": false,
    "using_mrr": false,
    "index_only": false,
    "rows": 1026,
    "cost": 1232.2,
    "chosen": false,
    "cause": "cost"
  }
],

那么把eq_range_index_dive_limit改小会发生什么？可以看到评估明显不准了

set eq_range_index_dive_limit=1;
select * from sbtest1 where t1=4949 and t2 in (7101, 384) order by k;
select * from information_schema.optimizer_trace;
"range_scan_alternatives": [
  {
    "index": "idx1",
    "ranges": [
      "4949 <= t1 <= 4949 AND 384 <= t2 <= 384",
      "4949 <= t1 <= 4949 AND 7101 <= t2 <= 7101"
    ],
    "index_dives_for_eq_ranges": false,
    "rowid_ordered": false,
    "using_mrr": false,
    "index_only": false,
    "rows": 2,
    "cost": 4.41,
    "chosen": true
  },
  {
    "index": "idx2",
    "ranges": [
      "4949 <= t1 <= 4949"
    ],
    "index_dives_for_eq_ranges": false,
    "rowid_ordered": false,
    "using_mrr": false,
    "index_only": false,
    "rows": 985,
    "cost": 1183,
    "chosen": false,
    "cause": "cost"
  }
],

对比

对比	Dive	Statistic	真实
idx1(t1,t2,k)	5（误差0%）	2（误差60%）	5
Idx2(t1,k)	1026（误差0%）	985（误差%）	1026（误差4%）
statistics时长	0.188ms	0.107ms

statistic

show profiles;
show profile for query 32;
...
| statistics           | 0.000107 |

dive

show profiles;
show profile for query 30;
...
| statistics           | 0.000188 |

3 内存限制

Limiting Memory Use for Range Optimization

range_optimizer_max_mem_size

0：无限制
其他值：评估超过后会有报错 Memory capacity of *N* bytes for 'range_optimizer_max_mem_size' exceeded. Range optimization was not done for this query.
in或者or的一个值大概需要230字节！
in (1,2,…,N) and in (1,2,…,M)算作M*N个！
and的一个值大概125字节！

For a simple query such as the following, where there is one candidate key for the range access method, each predicate combined with OR uses approximately 230 bytes:

SELECT COUNT(*) FROM t
WHERE a=1 OR a=2 OR a=3 OR .. . a=N;

Before 5.7.11, the number of bytes per predicate combined with OR was higher, approximately 700 bytes.

4 官方文档

https://dev.mysql.com/doc/refman/5.7/en/range-optimization.html#row-constructor-range-optimization

https://dev.mysql.com/doc/refman/5.7/en/server-system-variables.html

本文参与腾讯云自媒体同步曝光计划，分享自作者个人站点/博客。

原始发表：2020-04-02，如有侵权请联系 cloudcommunity@tencent.com 删除

数据库

云数据库 SQL Server

sql

编程算法

本文分享自作者个人站点/博客前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

登录后参与评论

0 条评论

热度