[ETCD] The event in requested index is outdated and cleared

最近遇到一个很奇怪的问题，在线下调试很正常，但是到测试环境后，watch后发现一直报The event in requested index is outdated and cleared错误，甚是奇怪。

watch的代码很简单，基本就是一次全量拉取，并且记录下该数据的索引，然后以这个索引为基准进行watch，不然会丢失事件：

func (m *MachineMgr) watchLoop() {
	watcher := m.keysApi.Watcher(machineDirKey, &client.WatcherOptions{
		AfterIndex: m.watchIndex,
		Recursive:  true,
	})

	for {
		select {
		case <-m.ctx.Done():
			{
				return
			}
		default:
			{
				ctx, cfn := context.WithTimeout(context.Background(), etcdTimeout)
				rsp, err := watcher.Next(ctx)
				cfn()
				if nil != err {
					if err == context.DeadlineExceeded {
						// Quick quit
						select {
						case <-m.ctx.Done():
							{
								return
							}
						default:
							{

							}
						}
						break
					}
					log.Errorf("Watch key error: %v", err)
					// If we meet the outdated error, we need to update the watch index
					if watchIndexOutDated(err) {
						time.Sleep(time.Second * 10)
						// Pull all config
						if err = m.pullAllNodes(); nil != err {
							log.Errorf("Pull all nodes error: %v", err)
						}
						watcher = m.keysApi.Watcher(machineDirKey, &client.WatcherOptions{
							AfterIndex: m.watchIndex,
							Recursive:  true,
						})
					}
					break
				}
				// Deal with the watch event
				if err = m.handleWatchEvent(rsp); nil != err {
					log.Errorf("Handle watch event error: %v", err)
				}
			}
		}
	}
}

但是一直报错，后来就加入了watchIndexOutDated这个处理，针对这错误再进行一次全量拉取，更新数据索引。

后来大致了解了下，etcd针对于数据更改，只会保留1000个历史，并且不是每个key都有1000个，是共享了1000个，也就是说，假如当前集群的写操作很多，而你又没有及时的watch事件，那么你关注的事件就会被丢弃，也就会报这个错误了。

我们这个系统一开始把etcd当做了配置、状态中心，而上报状态导致了etcd的写压力非常大，这个迭代开始将上报的信息全部转移到redis里，etcd只作为配置以及服务发现使用，所以导致这个问题，还是因为一开始的架构有问题。

[ETCD] The event in requested index is outdated and cleared

作者

sryan
today is a good day

[ETCD] The event in requested index is outdated and cleared

作者

sryan today is a good day

sryan
today is a good day