关于TCP Keepalive

最近在分析应用接入的一个奇怪现象。我负责了一个服务端，开放一个端口提供服务。中间由类似于LVS的四层负载提供LB服务。

在服务端模拟网络异常，由于LVS有超时机制，会中断两个方向的连接，同时中断之后不会向C和S端发送RST。

在3分钟的异常恢复之后，会发现服务端的连接都消失了。

因为从理论上来说，这个连接不会消失。通过抓包等方式，定位到了是由于Connection timeout的错误导致客户端的连接出现了读错误。

最终发现是我这里设置的TCP Keepalive触发的读超时错误，我设置了90秒的探测间隔，其中的代码如下：

int EnableKeepalive(int fd, int interval) {
    int val = 1;

    if (setsockopt(fd, SOL_SOCKET, SO_KEEPALIVE, &val, sizeof(val)) == -1) {
        return -1;
    }

#ifdef __linux__
    /* Default settings are more or less garbage, with the keepalive time
     * set to 7200 by default on Linux. Modify settings to make the feature
     * actually useful. */

    /* Send first probe after interval. */
    val = interval;
    if (setsockopt(fd, IPPROTO_TCP, TCP_KEEPIDLE, &val, sizeof(val)) < 0) {
        return -1;
    }

    /* Send next probes after the specified interval. Note that we set the
     * delay as interval / 3, as we send three probes before detecting
     * an error (see the next setsockopt call). */
    val = interval / 3;
    if (val == 0)
        val = 1;
    if (setsockopt(fd, IPPROTO_TCP, TCP_KEEPINTVL, &val, sizeof(val)) < 0) {
        return -1;
    }

    /* Consider the socket in error state after three we send three ACK
     * probes without getting a reply. */
    val = 3;
    if (setsockopt(fd, IPPROTO_TCP, TCP_KEEPCNT, &val, sizeof(val)) < 0) {
        return -1;
    }
#else
    ((void)interval); /* Avoid unused var warning for non Linux systems. */
#endif

    return 0;
}

我们传入的间隔为60，所以第一次探活，是在60秒之后，然后分3次20秒间隔发送探活包，假设都失败，则直接ECONNTIMEOUT。

当然这个只是协议层的探活，在实际的应用中，应用层的心跳比TCP层可靠的多，因为网络通，服务不一定正常。再加上各种负载，很有可能会代答。

关于TCP Keepalive

作者

sryan
today is a good day

关于TCP Keepalive

作者

sryan today is a good day

sryan
today is a good day