gRPC Keepalives and Transport Closing Errors on AWS Network Load Balancers

tl;dr: Enable keepalives and permit without stream on both the client and server and make sure the keepalive time matches between clients and servers.

We’re using gRpc in containerized applications written in four languages on AWS Fargate.  Our service location and load balancing method  is currently a combination of DNS and AWS Network Load Balancers.

We’ve been plagued by “transport is closing” errors with no obvious cause.  Once this bubbled up from “annoyance” to “problem”, we dug in to figure out what was going on and how to fix it.  If you are reading this, hopefully our work will help you. 🙂

Root Cause

The root of the problem is the need to use AWS Network Load Balancers in order to fully support HTTP/2. NLBs don’t support configurable timeouts, and terminate connections after 350 seconds of idle.  You can confirm this behavior by looking at the NLB monitor “Load balancer reset count.”

The Fix(es)

First, we attempted to get around this by configuring gRpc keepalives on our clients. It seemed like a keepalive sent every minute or so would be enough to keep the NLB connections open.

Unfortunately, this didn’t work.  Instead, we found keepalives weren’t even sent. Digging in to the gRpc implementation, we realized keepalives are only sent with active streams unless the “Permit Without Stream” option is enabled.   So, our second attempt including enabling this on the clients.

This actually worked better, but connections still died – just after a longer delay. At this point, we enabled trace debugging on both a Go client and a Java server, observed the behavior, then dug into the implementations again.  What we saw in this particular combination was that a gRpc Java server wouldn’t tolerate more than 2 pings in a 5 minute period from a client.

2019-01-24 22:09:39.163 [grpc-default-worker-ELG-3-1] DEBUG io.grpc.netty.NettyServerHandler - [id: 0x916edc41, L:/10.0.0.150:9138 - R:/10.0.0.151:10544] INBOUND PING: ack=false bytes=0
2019-01-24 22:09:39.164 [grpc-default-worker-ELG-3-1] DEBUG io.grpc.netty.NettyServerHandler - [id: 0x916edc41, L:/10.0.0.150:9138 - R:/10.0.0.151:10544] OUTBOUND PING: ack=true bytes=0
2019-01-24 22:11:49.164 [grpc-default-worker-ELG-3-1] DEBUG io.grpc.netty.NettyServerHandler - [id: 0x916edc41, L:/10.0.0.150:9138 - R:/10.0.0.151:10544] INBOUND PING: ack=false bytes=0
2019-01-24 22:11:49.164 [grpc-default-worker-ELG-3-1] DEBUG io.grpc.netty.NettyServerHandler - [id: 0x916edc41, L:/10.0.0.150:9138 - R:/10.0.0.151:10544] OUTBOUND PING: ack=true bytes=0
2019-01-24 22:11:49.164 [grpc-default-worker-ELG-3-1] DEBUG io.grpc.netty.NettyServerHandler - [id: 0x916edc41, L:/10.0.0.150:9138 - R:/10.0.0.151:10544] OUTBOUND GO_AWAY: lastStreamId=1 errorCode=11 length=14 bytes=746f6f5f6d616e795f70696e6773

 

We tried adjusting the keepalive time on the Go client, but Go gRpc 1.18 didn’t strictly honor the keepalive time.  The first ping came at exactly double the configured timeout, the second ping came a varying amount of time after that. The client was soon disconnected with GO_AWAY ENHANCE_YOUR_CALM.  Every time the client reconnected, it doubled the initial ping time. This resulted in idle connections staying open longer than not having keepalives configured, but there was no value of timeout that wouldn’t either eventually let the NLB terminate the connection or cause the server to kill the connection.

The only option was to enable keepalives on both clients and servers and standardize the time values to ensure the server wouldn’t boot off a more aggressive client.  Trace logging now confirmed ping flowing both directions with no GO_AWAY disconnects from the server.

2019-01-25 17:06:04.179 [grpc-default-worker-ELG-3-2] DEBUG io.grpc.netty.NettyServerHandler - [id: 0x1e6794c5, L:/10.0.0.150:9138 - R:/10.0.0.151:30606] OUTBOUND PING: ack=false bytes=57005
2019-01-25 17:06:04.187 [grpc-default-worker-ELG-3-2] DEBUG io.grpc.netty.NettyServerHandler - [id: 0x1e6794c5, L:/10.0.0.150:9138 - R:/10.0.0.151:30606] INBOUND PING: ack=true bytes=57005
2019-01-25 17:06:34.264 [grpc-default-worker-ELG-3-2] DEBUG io.grpc.netty.NettyServerHandler - [id: 0x1e6794c5, L:/10.0.0.150:9138 - R:/10.0.0.151:30606] OUTBOUND PING: ack=false bytes=57005
2019-01-25 17:06:34.267 [grpc-default-worker-ELG-3-2] DEBUG io.grpc.netty.NettyServerHandler - [id: 0x1e6794c5, L:/10.0.0.150:9138 - R:/10.0.0.151:30606] INBOUND PING: ack=true bytes=57005
2019-01-25 17:07:04.267 [grpc-default-worker-ELG-3-2] DEBUG io.grpc.netty.NettyServerHandler - [id: 0x1e6794c5, L:/10.0.0.150:9138 - R:/10.0.0.151:30606] OUTBOUND PING: ack=false bytes=57005
2019-01-25 17:07:04.269 [grpc-default-worker-ELG-3-2] DEBUG io.grpc.netty.NettyServerHandler - [id: 0x1e6794c5, L:/10.0.0.150:9138 - R:/10.0.0.151:30606] INBOUND PING: ack=true bytes=57005
2019-01-25 17:07:34.270 [grpc-default-worker-ELG-3-2] DEBUG io.grpc.netty.NettyServerHandler - [id: 0x1e6794c5, L:/10.0.0.150:9138 - R:/10.0.0.151:30606] OUTBOUND PING: ack=false bytes=57005
2019-01-25 17:07:34.272 [grpc-default-worker-ELG-3-2] DEBUG io.grpc.netty.NettyServerHandler - [id: 0x1e6794c5, L:/10.0.0.150:9138 - R:/10.0.0.151:30606] INBOUND PING: ack=true bytes=57005
2019-01-25 17:08:04.272 [grpc-default-worker-ELG-3-2] DEBUG io.grpc.netty.NettyServerHandler - [id: 0x1e6794c5, L:/10.0.0.150:9138 - R:/10.0.0.151:30606] OUTBOUND PING: ack=false bytes=57005
2019-01-25 17:08:04.275 [grpc-default-worker-ELG-3-2] DEBUG io.grpc.netty.NettyServerHandler - [id: 0x1e6794c5, L:/10.0.0.150:9138 - R:/10.0.0.151:30606] INBOUND PING: ack=true bytes=57005

 

We then attacked all our clients and servers in Java, Go, Scala, and Python.  We made sure the client and server values matched for keepalive time, timeout, and enabled permit without stream on both sides.