What? Connection reset by peer?
We are running Node.js web services behind AWS Classic Load Balancer. I noticed that many 502 errors after I migrate AWS Classic Load Balancer to Application Load Balancer. In order to understand what happened, I added Nginx in front of the Node.js web server, and then found that there are more than 100 ‘connection reset’ errors everyday in Nginx logs.
Here are some example logs:
2017/11/12 06:11:15 [error] 7#7: *2904 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost" 2017/11/12 06:11:27 [error] 7#7: *2950 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost" 2017/11/12 06:11:31 [error] 7#7: *2962 upstream prematurely closed connection while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost" 2017/11/12 06:11:44 [error] 7#7: *3005 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost" 2017/11/12 06:11:47 [error] 7#7: *3012 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost"
Analyzing the errors
The number of errors was increased after I migrate Classic LB to Application LB, and one of the differences between them is Classic LB is using pre-connected connections, and Application LB only using Http/1.1 Keep-Alive feature.
From the documentation of AWS Load Balancer:
Possible causes:
- The load balancer received a TCP RST from the target when attempting to establish a connection.
- The target closed the connection with a TCP RST or a TCP FIN while the load balancer had an outstanding request to the target.
- The target response is malformed or contains HTTP headers that are not valid.
- A new target group was used but no targets have passed an initial health check yet. A target must pass one health check to be considered healthy
Because there’s no any errors on Node.js side, so I’m guessing it was because of the keep-alive behaviour. There’s no upstream keep-alive timeout settings in the current Nginx version (1.13.6), then I tried Tengine – a taobao forked nginx which support upstream keepalive timeout. After running a couple of days, there’s no such errors any more. In order to understand what’s causing the issue, I tried to reproduce it on my local machine.
Capture network packages
In order to send a request at the same time when Node.js closing the connection after keep-alive timeout, I need to keep requesting the url until the issue reproduced. Here’s my settings for the testing environment:
Upstream (Node.js server):
- Set keep-alive timeout to 500 ms
Test client:
- Keep sending requests with an interval
- Interval starts from 500 ms and decrease 0.1 ms after each request
For the normal requests, upstream send a [FIN, ACK] to nginx after keep-alive timeout (500 ms), and nginx also send a [FIN, ACK] back, then upstream send a [ACK] to close the connection completely.
No. Time Source Destination Protocol Length Info 1 2017-11-12 17:11:04.299146 172.18.0.3 172.18.0.2 TCP 74 48528 → 8000 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=32031305 TSecr=0 WS=128 2 2017-11-12 17:11:04.299171 172.18.0.2 172.18.0.3 TCP 74 8000 → 48528 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=32031305 TSecr=32031305 WS=128 3 2017-11-12 17:11:04.299194 172.18.0.3 172.18.0.2 TCP 66 48528 → 8000 [ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=32031305 TSecr=32031305 4 2017-11-12 17:11:04.299259 172.18.0.3 172.18.0.2 HTTP 241 GET /_healthcheck HTTP/1.1 5 2017-11-12 17:11:04.299267 172.18.0.2 172.18.0.3 TCP 66 8000 → 48528 [ACK] Seq=1 Ack=176 Win=30080 Len=0 TSval=32031305 TSecr=32031305 6 2017-11-12 17:11:04.299809 172.18.0.2 172.18.0.3 HTTP 271 HTTP/1.1 200 OK (text/html) 7 2017-11-12 17:11:04.299852 172.18.0.3 172.18.0.2 TCP 66 48528 → 8000 [ACK] Seq=176 Ack=206 Win=30336 Len=0 TSval=32031305 TSecr=32031305 8 2017-11-12 17:11:04.800805 172.18.0.2 172.18.0.3 TCP 66 8000 → 48528 [FIN, ACK] Seq=206 Ack=176 Win=30080 Len=0 TSval=32031355 TSecr=32031305 9 2017-11-12 17:11:04.801120 172.18.0.3 172.18.0.2 TCP 66 48528 → 8000 [FIN, ACK] Seq=176 Ack=207 Win=30336 Len=0 TSval=32031355 TSecr=32031355 10 2017-11-12 17:11:04.801151 172.18.0.2 172.18.0.3 TCP 66 8000 → 48528 [ACK] Seq=207 Ack=177 Win=30080 Len=0 TSval=32031355 TSecr=32031355
For the failed requests, upstream received a new http request when it had closed the connection after keep-alive timeout (500 ms) and hasn’t got a chance to send the [FIN] package. Because of the connection has been closed from upstream’s perspective, so it send a [RST] response for this request.
No. Time Source Destination Protocol Length Info 433 2017-11-12 17:11:26.548449 172.18.0.3 172.18.0.2 TCP 74 48702 → 8000 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=32033530 TSecr=0 WS=128 434 2017-11-12 17:11:26.548476 172.18.0.2 172.18.0.3 TCP 74 8000 → 48702 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=32033530 TSecr=32033530 WS=128 435 2017-11-12 17:11:26.548502 172.18.0.3 172.18.0.2 TCP 66 48702 → 8000 [ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=32033530 TSecr=32033530 436 2017-11-12 17:11:26.548609 172.18.0.3 172.18.0.2 HTTP 241 GET /_healthcheck HTTP/1.1 437 2017-11-12 17:11:26.548618 172.18.0.2 172.18.0.3 TCP 66 8000 → 48702 [ACK] Seq=1 Ack=176 Win=30080 Len=0 TSval=32033530 TSecr=32033530 438 2017-11-12 17:11:26.549173 172.18.0.2 172.18.0.3 HTTP 271 HTTP/1.1 200 OK (text/html) 439 2017-11-12 17:11:26.549230 172.18.0.3 172.18.0.2 TCP 66 48702 → 8000 [ACK] Seq=176 Ack=206 Win=30336 Len=0 TSval=32033530 TSecr=32033530 440 2017-11-12 17:11:27.049668 172.18.0.3 172.18.0.2 HTTP 241 GET /_healthcheck HTTP/1.1 441 2017-11-12 17:11:27.050324 172.18.0.2 172.18.0.3 HTTP 271 HTTP/1.1 200 OK (text/html) 442 2017-11-12 17:11:27.050378 172.18.0.3 172.18.0.2 TCP 66 48702 → 8000 [ACK] Seq=351 Ack=411 Win=31360 Len=0 TSval=32033580 TSecr=32033580 443 2017-11-12 17:11:27.551182 172.18.0.3 172.18.0.2 HTTP 241 GET /_healthcheck HTTP/1.1 444 2017-11-12 17:11:27.551294 172.18.0.2 172.18.0.3 TCP 66 8000 → 48702 [RST, ACK] Seq=411 Ack=526 Win=32256 Len=0 TSval=32033630 TSecr=32033630
When nginx receives the [RST] package, it will log a ‘Connection reset’ error.
Testing Code
- Upstream Node.js Web Server
const express = require('express'); const app = express(); app.get('/_healthcheck', (req, res) => res.send('OK')); app.listen(process.env.PORT || 8000) .keepAliveTimeout = 500;
- Nginx config
upstream nodejs { least_conn; server chalice:8000 fail_timeout=1s max_fails=3; keepalive 16; } server_tokens off; log_format detailed escape=json '{' '"timestamp": "$time_iso8601",' '"remote_addr": "$remote_addr",' '"upstream_addr": "$upstream_addr",' '"connection": "$connection",' '"connection_requests": "$connection_requests",' '"request_time": "$request_time",' '"upstream_response_time": "$upstream_response_time",' '"status": "$status",' '"upstream_status": "$upstream_status",' '"body_bytes_sent": "$body_bytes_sent ",' '"request": "$request",' '"http_user_agent": "$http_user_agent"' '}'; access_log /var/log/nginx/access.log detailed; server { listen 80; server_name localhost; location / { proxy_http_version 1.1; proxy_redirect off; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarder-For $proxy_add_x_forwarded_for; proxy_set_header Host $http_host; proxy_set_header X-NginX-Proxy true; proxy_set_header Connection ""; proxy_pass http://nodejs; } }
- Test client in golang
package main import ( "log" "net/http" "time" ) func fetch(url string) (*http.Response, error) { resp, err := http.Get(url) if err != nil { log.Printf("Failed to get url %v", err) return nil, err } log.Printf("Fetched page, response code: %v", resp.StatusCode) return resp, nil } func main() { url := "http://localhost/_healthcheck" waitPeriod := 500 * time.Millisecond for times := 1; times < 20000; times++ { log.Printf("Fetch page, wait period: %d", waitPeriod) fetch(url) time.Sleep(waitPeriod) waitPeriod -= 100 * time.Microsecond } }
- Capture network packages
tcpdump -i eth0 tcp port 8000 -w /tmp/connection.pcap