From 56bf73716d6de70dfe943e0bf39bb86b9200de02 Mon Sep 17 00:00:00 2001 From: delthas Date: Sat, 18 Jul 2020 22:14:19 +0200 Subject: [PATCH] Increase downstream TCP keepalive interval to 1 hour The rationale for increasing the TCP keepalive interval from 15 seconds (default) to 1 hour follows. - Why increasing TCP keepalives for downstream connections is not an issue wrt to detecting connection interruptions The use case of TCP keepalives is detecting whether a TCP connection was forcefully shut down without receiving any TCP FIN or RST frame, when no data are sent from that endpoint to the other peer. If any data is sent from the peer and is not ACKed because the connection was interrupted, the socket will be closed after the TCP RTO (usually a few seconds) anyway, without the need for TCP keepalives. Therefore the only use of TCP keepalives is making sure that a peer that is not writing anything to the socket, and is actively reading and waiting for new stream data to be received, can, - instead of waiting forever to receive packets that will never arrive because the connection was interrupted -, detect this disconnection, close the connection locally, then try to connect again to its peer. This only makes sense from a client point-of-view. When an IRC client is not write(2)ing anything to the socket but is simply waiting for new messages to arrive, ie read(2)ing on the socket, it must ensure that the connection is still alive so that any new messages will indeed be sent to him. So that IRC client should probably enable TCP keepalives. However, when an IRC server is not writing anything to its downstream socket, it doesn't care if it misses any messages from its downstream client: in any case, the downstream client will instantly detect when its messages are not reaching its server, because of the TCP RTO (keepalives are not even needed in the client in that specific case), and will try to reconnect to the server. Thus TCP keepalives should be enabled for upstream connections, in order to make sure that soju does not miss any messages coming from upstream servers, but TCP keepalives are not needed for downstream connections. - Why increasing TCP keepalives for downstream connections is not an issue wrt security, performance, and server socket resources exhaustion TCP keepalives are orthogonal to security. Malicious clients can open thousands of TCP connections and keep them open with minimal bookkeeping, and TCP keepalives will not prevent attacks planning to use up all available sockets to soju. It is also unlikely that soju will keep many connections open, and in the event that thousands of dead, disconnected connections are active in soju, any upstream message that needs to be sent to downstreams will disconnect all disconnected downstreams after the TCP RTO (a few seconds). Performance could only be slightly affected in the few seconds before a TCP RTO if many messages were sent to a very large number of disconnected connections, which is extremely unlikely and not a large impact to performance either. - Why increasing TCP keepalives could be helpful to some clients running on mobile devices In the current state of IRC, most clients running on mobile devices (mostly running Android and iOS) will probably need to stay connected at all times, even when the application is in background, in order to receive private messages and highlight notifications, complete chat history (and possibly reduced connection traffic due to avoiding all the initial messages traffic, including all NAMES and WHO replies which are quite large). This means most IRC clients on mobile devices will keep a socket open at all times, in background. When a mobile device runs on a cellular data connection, it uses the phone wireless radio to transmit all TCP packets, including TCP packets without user data, for example TCP keepalives. On a typical mobile device, a wireless radio consumes significant power when full active, so it switches between several energy states in order to conserve power when not in use. It typically has 3 energy states, from Standby, when no messages are sent, to Low Power, to Full Power; and switches modes on an average time scale of 15s. This means that any time any TCP packet is sent from any socket on the device, the radio switches to a high-power energy state, sends the packet, then stays on that energy state for around 15s, then goes back to Standby. This does include TCP keepalives. If a TCP keepalive of 15s was used, this means that the IRC server would force all clients running on mobile devices to send a TCP keepalive packet at least once every 15s, which means that the radio would stay in its high-power energy state at all times. This would consume a very significant amount of power and use up battery much faster. Even though it would seem at first that a mobile device would have many different sockets open at any time; actually, a typical Android device typically has at one background socket open, with Firebase Cloud Messaging, for receiving instant push notifications (for example, for the equivalent of IRC highlights on other messaging platforms), and perhaps a socket open for the current foreground app. When the current foreground app does not use the network, or when no app is currently used and the phone is in sleep mode, and no notifications are sent, then the device can effectively have no wireless radio usage at all. This makes removing TCP keepalives extremely significant with regard to the mobile device battery usage. Increasing the TCP keepalive from soju lets downstream clients choose their own keepalive interval and therefore possibly save battery for mobile devices. Most modern mobile devices have complex heuristics for when to sleep the CPU and wireless radio, and have specific rules for TCP keepalives depending on the current internet connection, sleep state, etc. By increasing the downstream TCP keepalive to such a high period, soju lets clients choose their most optimal TCP keepalive period, which means that in turn clients can possibly let their mobile device platform choose best that keepalive for them, thus letting them save battery in those cases. --- cmd/soju/main.go | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/cmd/soju/main.go b/cmd/soju/main.go index 7ee179e..f62f36b 100644 --- a/cmd/soju/main.go +++ b/cmd/soju/main.go @@ -1,6 +1,7 @@ package main import ( + "context" "crypto/tls" "flag" "log" @@ -12,6 +13,7 @@ import ( "strings" "sync/atomic" "syscall" + "time" "github.com/pires/go-proxyproto" @@ -19,6 +21,9 @@ import ( "git.sr.ht/~emersion/soju/config" ) +// TCP keep-alive interval for downstream TCP connections +const downstreamKeepAlive = 1 * time.Hour + func main() { var listen, configPath string var debug bool @@ -96,10 +101,14 @@ func main() { } ircsTLSCfg := tlsCfg.Clone() ircsTLSCfg.NextProtos = []string{"irc"} - ln, err := tls.Listen("tcp", host, ircsTLSCfg) + lc := net.ListenConfig{ + KeepAlive: downstreamKeepAlive, + } + l, err := lc.Listen(context.Background(), "tcp", host) if err != nil { log.Fatalf("failed to start TLS listener on %q: %v", listen, err) } + ln := tls.NewListener(l, ircsTLSCfg) ln = proxyProtoListener(ln, srv) go func() { if err := srv.Serve(ln); err != nil { @@ -111,7 +120,10 @@ func main() { if _, _, err := net.SplitHostPort(host); err != nil { host = host + ":6667" } - ln, err := net.Listen("tcp", host) + lc := net.ListenConfig{ + KeepAlive: downstreamKeepAlive, + } + ln, err := lc.Listen(context.Background(), "tcp", host) if err != nil { log.Fatalf("failed to start listener on %q: %v", listen, err) }