昨天跟同事讨论redis集群,谈到redis cluster
时随口吹嘘了一遍工作机制:"redis cluster采用虚拟槽分区,将key根据哈希函数映射到了16384个槽位... ..."云云
随即同事A:“为何redis cluster使用16384个槽位?”
是呀,redis cluster
使用slot=CRC16(key) & 16384
计算槽位。而hash函数crc16()
产生的hash值有16位,自然会产生2^16=65536
个值。也就是hash的值分布在0-65535
范围内,按道理我们应该使用65536
来进行mod
操作,为何使用16384
呢?
查了下,果然早有人有此疑问(https://github.com/redis/redis/issues/2576),而且作者也给出了解释:
The reason is:
So 16k was in the right range to ensure enough slots per master with a max of 1000 maters, but a small enough number to propagate the slot configuration as a raw bitmap easily. Note that in small clusters the bitmap would be hard to compress because when N is small the bitmap would have slots/N bits set that is a large percentage of bits set.
总结一下,主要两个原因:
redis cluster
中将节点加入到集群,需要执行cluster meet ip:port
来完成节点的握手操作,之后节点间就可以通过定期ping-pong
来交换信息,其消息头结构体如下:
#define CLUSTER_SLOTS 16384
typedef struct {
char sig[4]; /* Signature "RCmb" (Redis Cluster message bus). */
uint32_t totlen; /* Total length of this message */
uint16_t ver; /* Protocol version, currently set to 1. */
uint16_t port; /* TCP base port number. */
uint16_t type; /* Message type */
uint16_t count; /* Only used for some kind of messages. */
uint64_t currentEpoch; /* The epoch accordingly to the sending node. */
uint64_t configEpoch; /* The config epoch if it's a master, or the last
epoch advertised by its master if it is a
slave. */
uint64_t offset; /* Master replication offset if node is a master or
processed replication offset if node is a slave. */
char sender[CLUSTER_NAMELEN]; /* Name of the sender node */
unsigned char myslots[CLUSTER_SLOTS/8];
char slaveof[CLUSTER_NAMELEN];
char myip[NET_IP_STR_LEN]; /* Sender IP, if not all zeroed. */
char notused1[34]; /* 34 bytes reserved for future usage. */
uint16_t cport; /* Sender TCP cluster bus port */
uint16_t flags; /* Sender node flags */
unsigned char state; /* Cluster state from the POV of the sender */
unsigned char mflags[3]; /* Message flags: CLUSTERMSG_FLAG[012]_... */
union clusterMsgData data;
} clusterMsg;
其中的unsigned char myslots[CLUSTER_SLOTS/8];
维护了当前节点持有槽信息的bitmap。每一位代表一个槽,对应位为1表示此槽属于当前节点。因为#define CLUSTER_SLOTS 16384
故而myslots
占用空间为:16384/8/1024=2kb
,但如果#define CLUSTER_SLOTS
为65536
,则占用了8kb。
而且在消息体中也会携带其他节点的信息用于交换。这个“其他节点的信息”具体约为集群节点数量的1/10,至少携带3个节点的信息。故而集群节点越多,消息内容占用空间就越大。
节点越多,交换信息报文也越大;另一方面因为节点槽位信息是通过bitmap维护的,传输过程中会对bitmap进行压缩。如果槽位越小,节点也少的情况下,bitmap的填充率slots/N(N表示节点数)就较小,对应压缩率就高。反之节点很少槽位很多则压缩率就很低。
所以综合考虑,作者觉得实际上16384个槽位就够了。