为何redis cluster偏偏使用16384个槽

用户3904122

发布于 2022-06-29 14:56:54

4120

发布于 2022-06-29 14:56:54

文章被收录于专栏：光华路程序猿

昨天跟同事讨论redis集群，谈到redis cluster时随口吹嘘了一遍工作机制："redis cluster采用虚拟槽分区，将key根据哈希函数映射到了16384个槽位... ..."云云

随即同事A：“为何redis cluster使用16384个槽位？”

是呀，redis cluster使用slot=CRC16(key) & 16384计算槽位。而hash函数crc16()产生的hash值有16位，自然会产生2^16=65536个值。也就是hash的值分布在0-65535范围内，按道理我们应该使用65536来进行mod操作，为何使用16384呢？

查了下，果然早有人有此疑问(https://github.com/redis/redis/issues/2576)，而且作者也给出了解释：

The reason is:

Normal heartbeat packets carry the full configuration of a node, that can be replaced in an idempotent way with the old in order to update an old config. This means they contain the slots configuration for a node, in raw form, that uses 2k of space with16k slots, but would use a prohibitive 8k of space using 65k slots.
At the same time it is unlikely that Redis Cluster would scale to more than 1000 mater nodes because of other design tradeoffs.

So 16k was in the right range to ensure enough slots per master with a max of 1000 maters, but a small enough number to propagate the slot configuration as a raw bitmap easily. Note that in small clusters the bitmap would be hard to compress because when N is small the bitmap would have slots/N bits set that is a large percentage of bits set.

总结一下，主要两个原因：

消息大小的考虑，槽位数越大，维护槽位信息占用空间越大，浪费带宽，也容易导致网络拥塞。

redis cluster中将节点加入到集群，需要执行cluster meet ip:port来完成节点的握手操作，之后节点间就可以通过定期ping-pong来交换信息，其消息头结构体如下:

#define CLUSTER_SLOTS 16384
typedef struct {
    char sig[4];        /* Signature "RCmb" (Redis Cluster message bus). */
    uint32_t totlen;    /* Total length of this message */
    uint16_t ver;       /* Protocol version, currently set to 1. */
    uint16_t port;      /* TCP base port number. */
    uint16_t type;      /* Message type */
    uint16_t count;     /* Only used for some kind of messages. */
    uint64_t currentEpoch;  /* The epoch accordingly to the sending node. */
    uint64_t configEpoch;   /* The config epoch if it's a master, or the last
                               epoch advertised by its master if it is a
                               slave. */
    uint64_t offset;    /* Master replication offset if node is a master or
                           processed replication offset if node is a slave. */
    char sender[CLUSTER_NAMELEN]; /* Name of the sender node */
    unsigned char myslots[CLUSTER_SLOTS/8];
    char slaveof[CLUSTER_NAMELEN];
    char myip[NET_IP_STR_LEN];    /* Sender IP, if not all zeroed. */
    char notused1[34];  /* 34 bytes reserved for future usage. */
    uint16_t cport;      /* Sender TCP cluster bus port */
    uint16_t flags;      /* Sender node flags */
    unsigned char state; /* Cluster state from the POV of the sender */
    unsigned char mflags[3]; /* Message flags: CLUSTERMSG_FLAG[012]_... */
    union clusterMsgData data;
} clusterMsg;

其中的unsigned char myslots[CLUSTER_SLOTS/8];维护了当前节点持有槽信息的bitmap。每一位代表一个槽，对应位为1表示此槽属于当前节点。因为#define CLUSTER_SLOTS 16384故而myslots占用空间为:16384/8/1024=2kb,但如果#define CLUSTER_SLOTS为65536,则占用了8kb。

而且在消息体中也会携带其他节点的信息用于交换。这个“其他节点的信息”具体约为集群节点数量的1/10，至少携带3个节点的信息。故而集群节点越多，消息内容占用空间就越大。