Redis Sentinel实践

什么是Sentinel(哨兵)

Redis-sentinel是Redis实例的监控管理、通知和实例失效备援服务，是Redis高可用性解决方案。在一般的分布式中心节点数据库中，Redis-sentinel的作用是中心节点的工作，监控各个其他节点的工作情况并且进行故障恢复，来提高集群的高可用性。

Sentinel的工作机制

进入正题前先来聊聊Sentinel的工作机制

Master 状态监测
如果Master 异常，则会进行Master-slave 转换，将其中一个Slave作为Master，将之前的Master作为Slave
Master-Slave切换后，master_redis.conf、slave_redis.conf和sentinel.conf的内容都会发生改变，即master_redis.conf中会多一行slaveof的配置，sentinel.conf的监控目标会随之调换

Sentinel的工作方式

每个Sentinel以每秒钟一次的频率向它所知的Master，Slave以及其他 Sentinel 实例发送一个 PING 命令
如果一个实例（instance）距离最后一次有效回复 PING 命令的时间超过down-after-milliseconds 选项所指定的值，则这个实例会被 Sentinel 标记为主观下线。
如果一个Master被标记为主观下线，则正在监视这个Master的所有 Sentinel 要以每秒一次的频率确认Master的确进入了主观下线状态。
当有足够数量的 Sentinel（大于等于配置文件指定的值）在指定的时间范围内确认Master的确进入了主观下线状态，则Master会被标记为客观下线
在一般情况下，每个 Sentinel 会以每 10 秒一次的频率向它已知的所有Master，Slave发送 INFO 命令，根据回复获取master当前信息。
当Master被 Sentinel 标记为客观下线时，Sentinel 向下线的 Master 的所有 Slave 发送 INFO 命令的频率会从 10 秒一次改为每秒一次。
若没有足够数量的 Sentinel 同意 Master 已经下线， Master 的客观下线状态就会被移除。若 Master 重新向 Sentinel 的 PING 命令返回有效回复， Master 的主观下线状态就会被移除。

图解Redis-sentinel

Sentinel（哨兵）是Redis 的高可用性解决方案：由一个或多个Sentinel 实例组成的Sentinel 系统可以监视任意多个主服务器，以及这些主服务器属下的所有从服务器，并在被监视的主服务器进入下线状态时，自动将下线主服务器属下的某个从服务器升级为新的主服务器。

在Server1 掉线后：

升级Server2 为新的主服务器：

部署Redis-sentinel

准备工作

Redis主从： 6379, 6380, 对其做redis主从。

Redis Sentinel 集群： 26379, 26380, 26381, 作为Sentinel集群。

IP	端口	角色
127.0.0.1	6379	Redis Master
127.0.0.1	6380	Redis Slave
127.0.0.1	26379	Sentinel
127.0.0.1	26380	Sentinel
127.0.0.1	26381	Sentinel

配置Sentinel.config

配置端口

在sentinel.conf 配置文件中，我们可以找到port 属性，这里是用来设置sentinel 的端口，一般情况下，至少会需要三个哨兵对redis 进行监控，我们可以通过修改端口启动多个sentinel 服务。
```
 port 26381
```
开启守护进程
```
 daemonize yes
```
配置master
```
 sentinel monitor mymaster 127.0.0.1 6379 2
 sentinel down-after-milliseconds mymaster 60000
 sentinel failover-timeout mymaster 180000
 sentinel parallel-syncs mymaster 1
```
上面的配置项配置了名字分为mymaster的master，配置文件只需要配置master的信息就好啦，不用配置slave的信息，因为slave能够被自动检测到(master节点会有关于slave的消息)。需要注意的是，配置文件在sentinel运行期间是会被动态修改的，例如当发生主备切换时候，配置文件中的master会被修改为另外一个slave。这样，之后sentinel如果重启时，就可以根据这个配置来恢复其之前所监控的redis集群的状态。
发生切换之后执行的一个自定义脚本：如发邮件、vip切换等。
```
 #sentinel notification-script <master-name> <script-path>
 #sentinel client-reconfig-script <master-name> <script-path>
```
notification-script：通知型脚本:当sentinel有任何警告级别的事件发生时（比如说redis实例的主观失效和客观失效等等），将会去调用这个脚本，这时这个脚本应该通过邮件，SMS等方式去通知系统管理员关于系统不正常运行的信息。调用该脚本时，将传给脚本两个参数，一个是事件的类型，一个是事件的描述。如果sentinel.conf配置文件中配置了这个脚本路径，那么必须保证这个脚本存在于这个路径，并且是可执行的，否则sentinel无法正常启动成功。

client-reconfig-script：当一个master由于failover而发生改变时，这个脚本将会被调用，通知相关的客户端关于master地址已经发生改变的信息。以下参数将会在调用脚本时传给脚本:
```
<master-name> <role> <state> <from-ip> <from-port> <to-ip> <to-port>
```
目前<state>总是“failover”, <role>是“leader”或者“observer”中的一个。参数 from-ip, from-port, to-ip, to-port是用来和旧的master和新的master(即旧的slave)通信的。这个脚本应该是通用的，能被多次调用，不是针对性的。

创建redis_sentinel的工作目录

 $ mkdir -pv /data/redis_sentinel/{26379,26380,26381}

启动 Sentinel

启动单个sentinel
```
 $ /usr/local/redis/bin/redis-sentinel /usr/local/redis/etc/redis-sentinel-26379.conf
```
由上图可知，当sentinel启动的时候，slave能够被自动检测到

启动sentinel集群

修改sentinel.conf，再启动两个sentinel实例

 $ /usr/local/redis/bin/redis-sentinel /usr/local/redis/etc/redis-sentinel-26380.conf

 $ /usr/local/redis/bin/redis-sentinel /usr/local/redis/etc/redis-sentinel-26381.conf

重新查看sentinel.conf

重新打开sentinel.conf文件，发现sentinel自动生成了一些信息，记录了监控过程中的状态变化。

模拟故障

模拟master故障

这里直接关闭master，终端输入：

 /usr/local/redis/bin/redis-cli -h 127.0.0.1 -p 6379 shutdown

经过一段时间后，我们可以看到sentinel.log文件中增加了以下内容：

+sdown 表示哨兵主观认为数据库下线
+odown 表示哨兵客观认为数据库下线
+try-failover 表示哨兵开始进行故障恢复
+failover-end 表示哨兵完成故障修复，其中包括了领头哨兵的选举、备选从数据库的选择等等较为复杂的过程
+switch-master表示主数据库迁移
+slave列出了新的主数据库的从数据库，而哨兵并没有彻底清除6379实例的信息，这是因为停止的实例有可能会在将来恢复，哨兵会让其重新加入进来

关于更多的信息见：

 +reset-master <instance details> -- 当master被重置时.
 +slave <instance details> -- 当检测到一个slave并添加进slave列表时.
 +failover-state-reconf-slaves <instance details> -- Failover状态变为reconf-slaves状态时
 +failover-detected <instance details> -- 当failover发生时
 +slave-reconf-sent <instance details> -- sentinel发送SLAVEOF命令把它重新配置时
 +slave-reconf-inprog <instance details> -- slave被重新配置为另外一个master的slave，但数据复制还未发生时。
 +slave-reconf-done <instance details> -- slave被重新配置为另外一个master的slave并且数据复制已经与master同步时。
 -dup-sentinel <instance details> -- 删除指定master上的冗余sentinel时 (当一个sentinel重新启动时，可能会发生这个事件).
 +sentinel <instance details> -- 当master增加了一个sentinel时。
 +sdown <instance details> -- 进入SDOWN状态时;
 -sdown <instance details> -- 离开SDOWN状态时。
 +odown <instance details> -- 进入ODOWN状态时。
 -odown <instance details> -- 离开ODOWN状态时。
 +new-epoch <instance details> -- 当前配置版本被更新时。
 +try-failover <instance details> -- 达到failover条件，正等待其他sentinel的选举。
 +elected-leader <instance details> -- 被选举为去执行failover的时候。
 +failover-state-select-slave <instance details> -- 开始要选择一个slave当选新master时。
 no-good-slave <instance details> -- 没有合适的slave来担当新master
 selected-slave <instance details> -- 找到了一个适合的slave来担当新master
 failover-state-send-slaveof-noone <instance details> -- 当把选择为新master的slave的身份进行切换的时候。
 failover-end-for-timeout <instance details> -- failover由于超时而失败时。
 failover-end <instance details> -- failover成功完成时。
 switch-master <master name> <oldip> <oldport> <newip> <newport> -- 当master的地址发生变化时。通常这是客户端最感兴趣的消息了。
 +tilt -- 进入Tilt模式。
 -tilt -- 退出Tilt模式。

恢复故障master

重新启动6379的实例，查看sentinel.log文件，日志中增加了以下的内容：
- -sdown 哨兵将下线的Redis实例重新加入，并且作为新的主数据库的从数据库存在
此时可以重新进入6379这个实例。你可以看到这个信息

错误解决(可选)

2208:X 14 Jun 23:13:09.185 * +sentinel sentinel ebf9b1b4a5cc98bffead5d0996b8f43deb806641 10.0.3.92 16379 @ dxy 10.0.3.110 6379
2208:X 14 Jun 23:13:24.234 # +sdown sentinel ebf9b1b4a5cc98bffead5d0996b8f43deb806641 10.0.3.92 16379 @ dxy 10.0.3.110 6379
2208:X 14 Jun 23:14:18.888 * +sentinel sentinel 07e189ae6c30d4951d3eb48e9effd948de026c3b 10.0.3.66 16379 @ dxy 10.0.3.110 6379
2208:X 14 Jun 23:14:33.962 # +sdown sentinel 07e189ae6c30d4951d3eb48e9effd948de026c3b 10.0.3.66 16379 @ dxy 10.0.3.110 6379

从日志里可以看到，除了本地的sentinel正常，其他2个sentinel都主观不可用了（SDOWN），时间刚好15秒(down-after-milliseconds 15000)，sentinel会向master发送心跳PING来确认master是否存活，如果master在“一定时间范围”内不回应PONG 或者是回复了一个错误消息，那么这个sentinel会主观地(单方面地)认为这个master已经不可用了(subjectively down, 也简称为SDOWN)。

而这个down-after-milliseconds就是用来指定这个“一定时间范围”的，单位是毫秒。

通过时间点的判断可以看到，sentinel之间发现不了对方，导致SDOWN。因为没有错误信息，这里找了半天原因都没发现什么问题。最后登陆sentinel上查看一下：

$ redis -h 127.0.0.1 -p 26379
127.0.0.1:26379> info
DENIED Redis is running in protected mode because protected mode is enabled, no bind address was specified, no authentication password is requested to clients. In this mode connections are only accepted from the loopback interface. If you want to connect from external computers to Redis you may adopt one of the following solutions: 1) Just disable protected mode sending the command 'CONFIG SET protected-mode no' from the loopback interface by connecting to Redis from the same host the server is running, however MAKE SURE Redis is not publicly accessible from internet if you do so. Use CONFIG REWRITE to make this change permanent. 2) Alternatively you can just disable the protected mode by editing the Redis configuration file, and setting the protected mode option to 'no', and then restarting the server. 3) If you started the server manually just for testing, restart it with the '--protected-mode no' option. 4) Setup a bind address or an authentication password. NOTE: You only need to do one of the above things in order for the server to start accepting connections from the outside.

这里看到一大串的信息，总的就是在说redis在没有开启bind和密码的情况下，保护模式被开启。然后Redis的只接受来自环回IPv4和IPv6地址的连接。拒绝外部连接，使用户知道发生了什么错误。

其实应该为用户提供了线索，而不是拒绝连接。具体的说明可以看作者的讨论，最后作者给出的建议是关闭保护模式：--portected-mode no。所以最后我们这里的错误信息可以得到解释：由于sentinel没有指定bind和密码访问，所以被开启了protected-mode保护模式，拒绝其他sentinel的连接。导致进入了ODWON。在sentinel.conf里加入：

protected-mode no

问题得到解决。portected-mode是3.2被引入，默认开启。

参考文档

深入浅出Redis-redis哨兵集群

Redis Sentinel（哨兵）部署

Redis及其Sentinel配置项详细说明

Redis 复制、Sentinel的搭建和原理说明