Increase downstream TCP keepalive interval to 1 hour

The rationale for increasing the TCP keepalive interval from 15 seconds
(default) to 1 hour follows.

- Why increasing TCP keepalives for downstream connections is not an
  issue wrt to detecting connection interruptions

The use case of TCP keepalives is detecting whether a TCP connection was
forcefully shut down without receiving any TCP FIN or RST frame, when no
data are sent from that endpoint to the other peer.

If any data is sent from the peer and is not ACKed because the
connection was interrupted, the socket will be closed after the TCP RTO
(usually a few seconds) anyway, without the need for TCP keepalives.

Therefore the only use of TCP keepalives is making sure that a peer that
is not writing anything to the socket, and is actively reading and
waiting for new stream data to be received, can, - instead of waiting
forever to receive packets that will never arrive because the connection
was interrupted -, detect this disconnection, close the connection
locally, then try to connect again to its peer.

This only makes sense from a client point-of-view. When an IRC client is
not write(2)ing anything to the socket but is simply waiting for new
messages to arrive, ie read(2)ing on the socket, it must ensure that the
connection is still alive so that any new messages will indeed be sent
to him. So that IRC client should probably enable TCP keepalives.

However, when an IRC server is not writing anything to its downstream
socket, it doesn't care if it misses any messages from its downstream
client: in any case, the downstream client will instantly detect when
its messages are not reaching its server, because of the TCP RTO
(keepalives are not even needed in the client in that specific case),
and will try to reconnect to the server.

Thus TCP keepalives should be enabled for upstream connections, in
order to make sure that soju does not miss any messages coming from
upstream servers, but TCP keepalives are not needed for downstream
connections.

- Why increasing TCP keepalives for downstream connections is not an
  issue wrt security, performance, and server socket resources
  exhaustion

TCP keepalives are orthogonal to security. Malicious clients can open
thousands of TCP connections and keep them open with minimal
bookkeeping, and TCP keepalives will not prevent attacks planning to
use up all available sockets to soju.

It is also unlikely that soju will keep many connections open, and in
the event that thousands of dead, disconnected connections are active in
soju, any upstream message that needs to be sent to downstreams will
disconnect all disconnected downstreams after the TCP RTO (a few
seconds). Performance could only be slightly affected in the few seconds
before a TCP RTO if many messages were sent to a very large number of
disconnected connections, which is extremely unlikely and not a large
impact to performance either.

- Why increasing TCP keepalives could be helpful to some clients running
  on mobile devices

In the current state of IRC, most clients running on mobile devices
(mostly running Android and iOS) will probably need to stay connected
at all times, even when the application is in background, in order to
receive private messages and highlight notifications, complete chat
history (and possibly reduced connection traffic due to avoiding all the
initial messages traffic, including all NAMES and WHO replies which
are quite large).

This means most IRC clients on mobile devices will keep a socket open at
all times, in background. When a mobile device runs on a cellular data
connection, it uses the phone wireless radio to transmit all TCP
packets, including TCP packets without user data, for example TCP
keepalives.

On a typical mobile device, a wireless radio consumes significant power
when full active, so it switches between several energy states in order
to conserve power when not in use. It typically has 3 energy states,
from Standby, when no messages are sent, to Low Power, to Full Power;
and switches modes on an average time scale of 15s. This means that any
time any TCP packet is sent from any socket on the device, the radio
switches to a high-power energy state, sends the packet, then stays on
that energy state for around 15s, then goes back to Standby. This
does include TCP keepalives.

If a TCP keepalive of 15s was used, this means that the IRC server would
force all clients running on mobile devices to send a TCP keepalive
packet at least once every 15s, which means that the radio would stay
in its high-power energy state at all times. This would consume a
very significant amount of power and use up battery much faster.

Even though it would seem at first that a mobile device would have many
different sockets open at any time; actually, a typical Android device
typically has at one background socket open, with Firebase Cloud
Messaging, for receiving instant push notifications (for example, for
the equivalent of IRC highlights on other messaging platforms), and
perhaps a socket open for the current foreground app. When the current
foreground app does not use the network, or when no app is currently
used and the phone is in sleep mode, and no notifications are sent, then
the device can effectively have no wireless radio usage at all. This
makes removing TCP keepalives extremely significant with regard to the
mobile device battery usage.

Increasing the TCP keepalive from soju lets downstream clients choose
their own keepalive interval and therefore possibly save battery for
mobile devices. Most modern mobile devices have complex heuristics for
when to sleep the CPU and wireless radio, and have specific rules for
TCP keepalives depending on the current internet connection, sleep
state, etc.

By increasing the downstream TCP keepalive to such a high period, soju
lets clients choose their most optimal TCP keepalive period, which means
that in turn clients can possibly let their mobile device platform
choose best that keepalive for them, thus letting them save battery in
those cases.
This commit is contained in:
delthas 2020-07-18 22:14:19 +02:00 committed by Simon Ser
parent c0513013d5
commit 56bf73716d

View File

@ -1,6 +1,7 @@
package main package main
import ( import (
"context"
"crypto/tls" "crypto/tls"
"flag" "flag"
"log" "log"
@ -12,6 +13,7 @@ import (
"strings" "strings"
"sync/atomic" "sync/atomic"
"syscall" "syscall"
"time"
"github.com/pires/go-proxyproto" "github.com/pires/go-proxyproto"
@ -19,6 +21,9 @@ import (
"git.sr.ht/~emersion/soju/config" "git.sr.ht/~emersion/soju/config"
) )
// TCP keep-alive interval for downstream TCP connections
const downstreamKeepAlive = 1 * time.Hour
func main() { func main() {
var listen, configPath string var listen, configPath string
var debug bool var debug bool
@ -96,10 +101,14 @@ func main() {
} }
ircsTLSCfg := tlsCfg.Clone() ircsTLSCfg := tlsCfg.Clone()
ircsTLSCfg.NextProtos = []string{"irc"} ircsTLSCfg.NextProtos = []string{"irc"}
ln, err := tls.Listen("tcp", host, ircsTLSCfg) lc := net.ListenConfig{
KeepAlive: downstreamKeepAlive,
}
l, err := lc.Listen(context.Background(), "tcp", host)
if err != nil { if err != nil {
log.Fatalf("failed to start TLS listener on %q: %v", listen, err) log.Fatalf("failed to start TLS listener on %q: %v", listen, err)
} }
ln := tls.NewListener(l, ircsTLSCfg)
ln = proxyProtoListener(ln, srv) ln = proxyProtoListener(ln, srv)
go func() { go func() {
if err := srv.Serve(ln); err != nil { if err := srv.Serve(ln); err != nil {
@ -111,7 +120,10 @@ func main() {
if _, _, err := net.SplitHostPort(host); err != nil { if _, _, err := net.SplitHostPort(host); err != nil {
host = host + ":6667" host = host + ":6667"
} }
ln, err := net.Listen("tcp", host) lc := net.ListenConfig{
KeepAlive: downstreamKeepAlive,
}
ln, err := lc.Listen(context.Background(), "tcp", host)
if err != nil { if err != nil {
log.Fatalf("failed to start listener on %q: %v", listen, err) log.Fatalf("failed to start listener on %q: %v", listen, err)
} }