Analyze ‘Connection reset’ error in Nginx upstream with keep-alive enabled

What? Connection reset by peer?

We are running Node.js web services behind AWS Classic Load Balancer. I noticed that many 502 errors after I migrate AWS Classic Load Balancer to Application Load Balancer. In order to understand what happened, I added Nginx in front of the Node.js web server, and then found that there are more than 100 ‘connection reset’ errors everyday in Nginx logs.

Here are some example logs:

2017/11/12 06:11:15 [error] 7#7: *2904 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost"
2017/11/12 06:11:27 [error] 7#7: *2950 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost"
2017/11/12 06:11:31 [error] 7#7: *2962 upstream prematurely closed connection while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost"
2017/11/12 06:11:44 [error] 7#7: *3005 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost"
2017/11/12 06:11:47 [error] 7#7: *3012 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost"

Analyzing the errors

The number of errors was increased after I migrate Classic LB to Application LB, and one of the differences between them is Classic LB is using pre-connected connections, and Application LB only using Http/1.1 Keep-Alive feature.

From the documentation of AWS Load Balancer:

Possible causes:

  • The load balancer received a TCP RST from the target when attempting to establish a connection.
  • The target closed the connection with a TCP RST or a TCP FIN while the load balancer had an outstanding request to the target.
  • The target response is malformed or contains HTTP headers that are not valid.
  • A new target group was used but no targets have passed an initial health check yet. A target must pass one health check to be considered healthy

Because there’s no any errors on Node.js side, so I’m guessing it was because of the keep-alive behaviour. There’s no upstream keep-alive timeout settings in the current Nginx version (1.13.6), then I tried Tengine – a taobao forked nginx which support upstream keepalive timeout. After running a couple of days, there’s no such errors any more. In order to understand what’s causing the issue, I tried to reproduce it on my local machine.

Capture network packages

In order to send a request at the same time when Node.js closing the connection after keep-alive timeout, I need to keep requesting the url until the issue reproduced. Here’s my settings for the testing environment:

Upstream (Node.js server):

  • Set keep-alive timeout to 500 ms

Test client:

  • Keep sending requests with an interval
  • Interval starts from 500 ms and decrease 0.1 ms after each request

For the normal requests, upstream send a [FIN, ACK] to nginx after keep-alive timeout (500 ms), and nginx also send a [FIN, ACK] back, then upstream send a [ACK] to close the connection completely.

No.     Time                          Source                Destination           Protocol Length Info
      1 2017-11-12 17:11:04.299146    172.18.0.3            172.18.0.2            TCP      74     48528 → 8000 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=32031305 TSecr=0 WS=128
      2 2017-11-12 17:11:04.299171    172.18.0.2            172.18.0.3            TCP      74     8000 → 48528 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=32031305 TSecr=32031305 WS=128
      3 2017-11-12 17:11:04.299194    172.18.0.3            172.18.0.2            TCP      66     48528 → 8000 [ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=32031305 TSecr=32031305
      4 2017-11-12 17:11:04.299259    172.18.0.3            172.18.0.2            HTTP     241    GET /_healthcheck HTTP/1.1 
      5 2017-11-12 17:11:04.299267    172.18.0.2            172.18.0.3            TCP      66     8000 → 48528 [ACK] Seq=1 Ack=176 Win=30080 Len=0 TSval=32031305 TSecr=32031305
      6 2017-11-12 17:11:04.299809    172.18.0.2            172.18.0.3            HTTP     271    HTTP/1.1 200 OK  (text/html)
      7 2017-11-12 17:11:04.299852    172.18.0.3            172.18.0.2            TCP      66     48528 → 8000 [ACK] Seq=176 Ack=206 Win=30336 Len=0 TSval=32031305 TSecr=32031305
      8 2017-11-12 17:11:04.800805    172.18.0.2            172.18.0.3            TCP      66     8000 → 48528 [FIN, ACK] Seq=206 Ack=176 Win=30080 Len=0 TSval=32031355 TSecr=32031305
      9 2017-11-12 17:11:04.801120    172.18.0.3            172.18.0.2            TCP      66     48528 → 8000 [FIN, ACK] Seq=176 Ack=207 Win=30336 Len=0 TSval=32031355 TSecr=32031355
     10 2017-11-12 17:11:04.801151    172.18.0.2            172.18.0.3            TCP      66     8000 → 48528 [ACK] Seq=207 Ack=177 Win=30080 Len=0 TSval=32031355 TSecr=32031355

For the failed requests, upstream received a new http request when it had closed the connection after keep-alive timeout (500 ms) and hasn’t got a chance to send the [FIN] package. Because of the connection has been closed from upstream’s perspective, so it send a [RST] response for this request.

No.     Time                          Source                Destination           Protocol Length Info
    433 2017-11-12 17:11:26.548449    172.18.0.3            172.18.0.2            TCP      74     48702 → 8000 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=32033530 TSecr=0 WS=128
    434 2017-11-12 17:11:26.548476    172.18.0.2            172.18.0.3            TCP      74     8000 → 48702 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=32033530 TSecr=32033530 WS=128
    435 2017-11-12 17:11:26.548502    172.18.0.3            172.18.0.2            TCP      66     48702 → 8000 [ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=32033530 TSecr=32033530
    436 2017-11-12 17:11:26.548609    172.18.0.3            172.18.0.2            HTTP     241    GET /_healthcheck HTTP/1.1 
    437 2017-11-12 17:11:26.548618    172.18.0.2            172.18.0.3            TCP      66     8000 → 48702 [ACK] Seq=1 Ack=176 Win=30080 Len=0 TSval=32033530 TSecr=32033530
    438 2017-11-12 17:11:26.549173    172.18.0.2            172.18.0.3            HTTP     271    HTTP/1.1 200 OK  (text/html)
    439 2017-11-12 17:11:26.549230    172.18.0.3            172.18.0.2            TCP      66     48702 → 8000 [ACK] Seq=176 Ack=206 Win=30336 Len=0 TSval=32033530 TSecr=32033530
    440 2017-11-12 17:11:27.049668    172.18.0.3            172.18.0.2            HTTP     241    GET /_healthcheck HTTP/1.1 
    441 2017-11-12 17:11:27.050324    172.18.0.2            172.18.0.3            HTTP     271    HTTP/1.1 200 OK  (text/html)
    442 2017-11-12 17:11:27.050378    172.18.0.3            172.18.0.2            TCP      66     48702 → 8000 [ACK] Seq=351 Ack=411 Win=31360 Len=0 TSval=32033580 TSecr=32033580
    443 2017-11-12 17:11:27.551182    172.18.0.3            172.18.0.2            HTTP     241    GET /_healthcheck HTTP/1.1 
    444 2017-11-12 17:11:27.551294    172.18.0.2            172.18.0.3            TCP      66     8000 → 48702 [RST, ACK] Seq=411 Ack=526 Win=32256 Len=0 TSval=32033630 TSecr=32033630

When nginx receives the [RST] package, it will log a ‘Connection reset’ error.

Testing Code

  • Upstream Node.js Web Server
const express = require('express');

const app = express();

app.get('/_healthcheck', (req, res) => res.send('OK'));

app.listen(process.env.PORT || 8000)
  .keepAliveTimeout = 500;
  • Nginx config
upstream nodejs {
  least_conn;
  server chalice:8000 fail_timeout=1s max_fails=3;
  keepalive 16;
}

server_tokens off;
log_format detailed escape=json
  '{' 
  '"timestamp": "$time_iso8601",'
  '"remote_addr": "$remote_addr",'
  '"upstream_addr": "$upstream_addr",'
  '"connection": "$connection",'
  '"connection_requests": "$connection_requests",'
  '"request_time": "$request_time",'
  '"upstream_response_time": "$upstream_response_time",'
  '"status": "$status",'
  '"upstream_status": "$upstream_status",'
  '"body_bytes_sent": "$body_bytes_sent ",'
  '"request": "$request",'
  '"http_user_agent": "$http_user_agent"'
  '}';
access_log /var/log/nginx/access.log detailed;

server {
  listen 80;
  server_name localhost;

  location / {
    proxy_http_version 1.1;
    proxy_redirect off;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarder-For $proxy_add_x_forwarded_for;
    proxy_set_header Host $http_host;
    proxy_set_header X-NginX-Proxy true;
    proxy_set_header Connection "";

    proxy_pass http://nodejs;
  }
}
  • Test client in golang
package main

import (
    "log"
    "net/http"
    "time"
)

func fetch(url string) (*http.Response, error) {
    resp, err := http.Get(url)
    if err != nil {
        log.Printf("Failed to get url %v", err)
        return nil, err
    }

    log.Printf("Fetched page, response code: %v", resp.StatusCode)
    return resp, nil
}

func main() {
    url := "http://localhost/_healthcheck"
    waitPeriod := 500 * time.Millisecond

    for times := 1; times < 20000; times++ {
        log.Printf("Fetch page, wait period: %d", waitPeriod)

        fetch(url)
        time.Sleep(waitPeriod)
        waitPeriod -= 100 * time.Microsecond
    }
}
  • Capture network packages
tcpdump -i eth0 tcp port 8000 -w /tmp/connection.pcap

Leave a Reply

Your email address will not be published. Required fields are marked *