Analyze ‘Connection reset’ error in Nginx upstream with keep-alive…

What? Connection reset by peer?

We are running Node.js web services behind AWS Classic Load Balancer. I noticed that many 502 errors after I migrate AWS Classic Load Balancer to Application Load Balancer. In order to understand what happened, I added Nginx in front of the Node.js web server, and then found that there are more than 100 ‘connection reset’ errors everyday in Nginx logs.

Here are some example logs:

2017/11/12 06:11:15 [error] 7#7: *2904 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost"
2017/11/12 06:11:27 [error] 7#7: *2950 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost"
2017/11/12 06:11:31 [error] 7#7: *2962 upstream prematurely closed connection while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost"
2017/11/12 06:11:44 [error] 7#7: *3005 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost"
2017/11/12 06:11:47 [error] 7#7: *3012 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost"

Analyzing the errors

The number of errors was increased after I migrate Classic LB to Application LB, and one of the differences between them is Classic LB is using pre-connected connections, and Application LB only using Http/1.1 Keep-Alive feature.

From the documentation of AWS Load Balancer:

Possible causes:

The load balancer received a TCP RST from the target when attempting to establish a connection.

The target closed the connection with a TCP RST or a TCP FIN while the load balancer had an outstanding request to the target.

The target response is malformed or contains HTTP headers that are not valid.

A new target group was used but no targets have passed an initial health check yet. A target must pass one health check to be considered healthy

Because there’s no any errors on Node.js side, so I’m guessing it was because of the keep-alive behaviour. There’s no upstream keep-alive timeout settings in the current Nginx version (1.13.6), then I tried Tengine – a taobao forked nginx which support upstream keepalive timeout. After running a couple of days, there’s no such errors any more. In order to understand what’s causing the issue, I tried to reproduce it on my local machine.

Capture network packages

In order to send a request at the same time when Node.js closing the connection after keep-alive timeout, I need to keep requesting the url until the issue reproduced. Here’s my settings for the testing environment:

Upstream (Node.js server):

Set keep-alive timeout to 500 ms

Test client:

Keep sending requests with an interval
Interval starts from 500 ms and decrease 1 ms after each request

For the normal requests, upstream send a [FIN, ACK] to nginx after keep-alive timeout (500 ms), and nginx also send a [FIN, ACK] back, then upstream send a [ACK] to close the connection completely.

No.     Time                          Source                Destination           Protocol Length Info
      1 2017-11-12 17:11:04.299146    172.18.0.3            172.18.0.2            TCP      74     48528 → 8000 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=32031305 TSecr=0 WS=128
      2 2017-11-12 17:11:04.299171    172.18.0.2            172.18.0.3            TCP      74     8000 → 48528 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=32031305 TSecr=32031305 WS=128
      3 2017-11-12 17:11:04.299194    172.18.0.3            172.18.0.2            TCP      66     48528 → 8000 [ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=32031305 TSecr=32031305
      4 2017-11-12 17:11:04.299259    172.18.0.3            172.18.0.2            HTTP     241    GET /_healthcheck HTTP/1.1 
      5 2017-11-12 17:11:04.299267    172.18.0.2            172.18.0.3            TCP      66     8000 → 48528 [ACK] Seq=1 Ack=176 Win=30080 Len=0 TSval=32031305 TSecr=32031305
      6 2017-11-12 17:11:04.299809    172.18.0.2            172.18.0.3            HTTP     271    HTTP/1.1 200 OK  (text/html)
      7 2017-11-12 17:11:04.299852    172.18.0.3            172.18.0.2            TCP      66     48528 → 8000 [ACK] Seq=176 Ack=206 Win=30336 Len=0 TSval=32031305 TSecr=32031305
      8 2017-11-12 17:11:04.800805    172.18.0.2            172.18.0.3            TCP      66     8000 → 48528 [FIN, ACK] Seq=206 Ack=176 Win=30080 Len=0 TSval=32031355 TSecr=32031305
      9 2017-11-12 17:11:04.801120    172.18.0.3            172.18.0.2            TCP      66     48528 → 8000 [FIN, ACK] Seq=176 Ack=207 Win=30336 Len=0 TSval=32031355 TSecr=32031355
     10 2017-11-12 17:11:04.801151    172.18.0.2            172.18.0.3            TCP      66     8000 → 48528 [ACK] Seq=207 Ack=177 Win=30080 Len=0 TSval=32031355 TSecr=32031355

For the failed requests, the upstream closed the connection after keep-alive timeout (500 ms), the client sends a new http request before it receives and processes the [FIN] package. Because of the connection has been closed from upstream’s perspective, so it send a [RST] response for this request. This would happen in following scenarios:

Upstream hasn’t send the [FIN] package yet (pending to send package at network layer)
Upstream has sent the [FIN] package, but client hasn’t received it yet
Client received the [FIN] package, but hasn’t processed it yet

Example of the first scenario, hasn’t send [FIN] yet:

No.     Time                          Source                Destination           Protocol Length Info
    433 2017-11-12 17:11:26.548449    172.18.0.3            172.18.0.2            TCP      74     48702 → 8000 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=32033530 TSecr=0 WS=128
    434 2017-11-12 17:11:26.548476    172.18.0.2            172.18.0.3            TCP      74     8000 → 48702 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=32033530 TSecr=32033530 WS=128
    435 2017-11-12 17:11:26.548502    172.18.0.3            172.18.0.2            TCP      66     48702 → 8000 [ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=32033530 TSecr=32033530
    436 2017-11-12 17:11:26.548609    172.18.0.3            172.18.0.2            HTTP     241    GET /_healthcheck HTTP/1.1 
    437 2017-11-12 17:11:26.548618    172.18.0.2            172.18.0.3            TCP      66     8000 → 48702 [ACK] Seq=1 Ack=176 Win=30080 Len=0 TSval=32033530 TSecr=32033530
    438 2017-11-12 17:11:26.549173    172.18.0.2            172.18.0.3            HTTP     271    HTTP/1.1 200 OK  (text/html)
    439 2017-11-12 17:11:26.549230    172.18.0.3            172.18.0.2            TCP      66     48702 → 8000 [ACK] Seq=176 Ack=206 Win=30336 Len=0 TSval=32033530 TSecr=32033530
    440 2017-11-12 17:11:27.049668    172.18.0.3            172.18.0.2            HTTP     241    GET /_healthcheck HTTP/1.1 
    441 2017-11-12 17:11:27.050324    172.18.0.2            172.18.0.3            HTTP     271    HTTP/1.1 200 OK  (text/html)
    442 2017-11-12 17:11:27.050378    172.18.0.3            172.18.0.2            TCP      66     48702 → 8000 [ACK] Seq=351 Ack=411 Win=31360 Len=0 TSval=32033580 TSecr=32033580
    443 2017-11-12 17:11:27.551182    172.18.0.3            172.18.0.2            HTTP     241    GET /_healthcheck HTTP/1.1 
    444 2017-11-12 17:11:27.551294    172.18.0.2            172.18.0.3            TCP      66     8000 → 48702 [RST, ACK] Seq=411 Ack=526 Win=32256 Len=0 TSval=32033630 TSecr=32033630

Example of the second scenario, Sent [FIN] at the same time of receiving a new request:

No. Time    Source  Destination Protocol    Length  Info
13  2018-06-15 21:40:00.522110  127.0.0.1   127.0.0.1   TCP 68  50678  >  8000 [SYN] Seq=0 Win=65535 Len=0 MSS=16344 WS=32 TSval=1503957438 TSecr=0 SACK_PERM=1
14  2018-06-15 21:40:00.522210  127.0.0.1   127.0.0.1   TCP 68  8000  >  50678 [SYN, ACK] Seq=0 Ack=1 Win=65535 Len=0 MSS=16344 WS=32 TSval=1503957438 TSecr=1503957438 SACK_PERM=1
15  2018-06-15 21:40:00.522219  127.0.0.1   127.0.0.1   TCP 56  50678  >  8000 [ACK] Seq=1 Ack=1 Win=408288 Len=0 TSval=1503957438 TSecr=1503957438
16  2018-06-15 21:40:00.522228  127.0.0.1   127.0.0.1   TCP 56  [TCP Window Update] 8000  >  50678 [ACK] Seq=1 Ack=1 Win=408288 Len=0 TSval=1503957438 TSecr=1503957438
17  2018-06-15 21:40:00.522315  127.0.0.1   127.0.0.1   HTTP    189 GET / HTTP/1.1 
18  2018-06-15 21:40:00.522358  127.0.0.1   127.0.0.1   TCP 56  8000  >  50678 [ACK] Seq=1 Ack=134 Win=408160 Len=0 TSval=1503957438 TSecr=1503957438
19  2018-06-15 21:40:00.522727  127.0.0.1   127.0.0.1   HTTP    261 HTTP/1.1 200 OK  (text/html)
20  2018-06-15 21:40:00.522773  127.0.0.1   127.0.0.1   TCP 56  50678  >  8000 [ACK] Seq=134 Ack=206 Win=408064 Len=0 TSval=1503957438 TSecr=1503957438
21  2018-06-15 21:40:01.025685  127.0.0.1   127.0.0.1   HTTP    189 GET / HTTP/1.1 
22  2018-06-15 21:40:01.025687  127.0.0.1   127.0.0.1   TCP 56  8000  >  50678 [FIN, ACK] Seq=206 Ack=134 Win=408160 Len=0 TSval=1503957939 TSecr=1503957438
23  2018-06-15 21:40:01.025748  127.0.0.1   127.0.0.1   TCP 44  8000  >  50678 [RST] Seq=206 Win=0 Len=0
24  2018-06-15 21:40:01.025760  127.0.0.1   127.0.0.1   TCP 56  50678  >  8000 [ACK] Seq=267 Ack=207 Win=408064 Len=0 TSval=1503957939 TSecr=1503957939
25  2018-06-15 21:40:01.025769  127.0.0.1   127.0.0.1   TCP 44  8000  >  50678 [RST] Seq=207 Win=0 Len=0

When the client receives the [RST] package, it will log a ‘Connection reset’ error.

Testing Code

This issue is a generic issue when closing the connection on the server side while HTTP keep-alive enabled, so you can easily reproduce it by clone the example code (Node.js) from https://github.com/weixu365/test-connection-reset

npm install
npm start

# In a separate terminal
npm run client

Upstream Node.js Web Server

const express = require('express');
 
const app = express();
 
app.get('/', (req, res) => res.send('OK'));
 
const port = process.env.PORT || 8000;
app.listen(port, () => {
  console.log(`Listening on http://localhost:${port}`)
})
  .keepAliveTimeout = 500;

Test client in Node.js

const axios = require('axios');
const Bluebird = require('bluebird');
const HttpAgent = require('agentkeepalive');

const keepAliveOption = {
  freeSocketKeepAliveTimeout: 30 * 1000, // Should be less than server keep alive timeout
  socketActiveTTL: 50 * 1000 // Should be less than dns ttl
};
const httpAgent = new HttpAgent(keepAliveOption);

let host = 'http://localhost:8000';
let path = '/';

const httpClient = axios.create({
  baseURL: host,
  timeout: 5000,
});

const sendRequest = () =>
  httpClient.request({
    url: path,
    httpAgent,
  })
    .then(res => {
      console.log('Received response', res.status);
    })
    .catch(e => {
      console.error('Error occurred', e.message);
    });

let delay=501;
const start = () =>
  sendRequest()
    .then(() => delay -= 1)
    .then(() => delay > 450 ? Bluebird.delay(delay).then(start) : 'Done')

start();

Capture network packages

tcpdump -i eth0 tcp port 8000 -w /tmp/connection.pcap

Nginx config (Optional)

upstream nodejs {
  least_conn;
  server chalice:8000 fail_timeout=1s max_fails=3;
  keepalive 16;
}

server_tokens off;
log_format detailed escape=json
  '{' 
  '"timestamp": "$time_iso8601",'
  '"remote_addr": "$remote_addr",'
  '"upstream_addr": "$upstream_addr",'
  '"connection": "$connection",'
  '"connection_requests": "$connection_requests",'
  '"request_time": "$request_time",'
  '"upstream_response_time": "$upstream_response_time",'
  '"status": "$status",'
  '"upstream_status": "$upstream_status",'
  '"body_bytes_sent": "$body_bytes_sent ",'
  '"request": "$request",'
  '"http_user_agent": "$http_user_agent"'
  '}';
access_log /var/log/nginx/access.log detailed;

server {
  listen 80;
  server_name localhost;

  location / {
    proxy_http_version 1.1;
    proxy_redirect off;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarder-For $proxy_add_x_forwarded_for;
    proxy_set_header Host $http_host;
    proxy_set_header X-NginX-Proxy true;
    proxy_set_header Connection "";

    proxy_pass http://nodejs;
  }
}

References:

https://github.com/weixu365/test-connection-reset

12 thoughts on “Analyze ‘Connection reset’ error in Nginx upstream with keep-alive enabled”

Arthur says:

June 15, 2018 at 01:58

For the failed requests, upstream received a new http request when it had closed the connection after keep-alive timeout (500 ms) and hasn’t got a chance to send the [FIN] package. Because of the connection has been closed from upstream’s perspective, so it send a [RST] response for this request.

我遇到了同样的问题，能帮忙再解释一下这句话吗？为什么 hasn’t got a chance to send the [FIN] package？
fin包不是一定会发出来吗？

1. Wei Xu says:
  
  June 15, 2018 at 10:35
  
  Hi, I updated the blog to explain what’s the scenarios when this happens. Basically, the issue would happen after server closed the connection and before the client side receives and processes the [FIN] package
  
  This would happen in following scenarios:
  
  * Upstream hasn’t send the [FIN] package yet (pending to send package at network layer)
  * Upstream has sent the [FIN] package, but client hasn’t received it yet
  * Client received the [FIN] package, but hasn’t processed it yet
  
  1. Arthur says:
    
    June 15, 2018 at 11:17
    
    有一个疑问，就是如果server端实际是有发送了fin包，但是有可能不能被wireshark tcpdump等抓包工具抓到对吗？
    
    补充一下，我的理解是server主动close的tcp，这样子的话，server不再发送回有效包信息给client端nginx
    
    1. Wei Xu says:
      
      June 15, 2018 at 11:54
      
      The server will resend [FIN] if it didn’t receive ACK, https://tools.ietf.org/html/rfc793
      
      All segments preceding and including FIN will be retransmitted until acknowledged.
      
      But it still possible that the client side cannot receive the [FIN] package because the network is not reliable (that’s also why TCP need ACK packages).
      
      In my case, the only problem is the server close the connection and the client sends new requests before it handles the [FIN] package. My solution is closing connection on the client side.
      
      The client cannot close the connection at the same time when the server send [FIN] because of:
      It takes time to
      * send package on the server side
      * delivery package through network
      * handle package on the client side
      
Arthur says:

June 15, 2018 at 14:14

444 2017-11-12 17:11:27.551294 172.18.0.2 172.18.0.3 TCP 66 8000 → 48702 [RST, ACK] Seq=411 Ack=526 Win=32256 Len=0 TSval=32033630 TSecr=32033630

上面出现RST包的那次抓包里面，为什么抓包记录没有出现node向Nginx发送的FIN包？我对这个比较疑惑，还没想明白。

1. Wei Xu says:
  
  June 15, 2018 at 20:21
  
  In the captured packages, my understanding is the server received a new request after it decided to close the connection and before the [FIN] package sent out, that’s why there’s no [FIN] package.
  
  Instead of [FIN] package, there’s a [RST] package for the new request and no [ACK] for this [RST] package because of:
  
  if this bit is set to 1, it indicates to the receiving computer that the computer should immediately stop using the TCP connection; it should not send any more packets using the connection’s identifying numbers, called ports, and discard any further packets it receives with headers indicating they belong to that connection. A TCP reset basically kills a TCP connection instantly.
  
  https://en.wikipedia.org/wiki/TCP_reset_attack
  https://stackoverflow.com/questions/7735618/tcp-rst-packet-details
  
2. Wei Xu says:
  
  June 15, 2018 at 22:01
  
  I added an example for the second scenario Send [FIN] and receive request at the same time in the blog
  
Wei Xu says:

October 17, 2018 at 12:26

The issue had been fixed in Nginx V1.15.3,
http://nginx.org/en/docs/http/ngx_http_upstream_module.html#keepalive_timeout

Andres says:

November 28, 2018 at 04:54

Wei, thanks for your blog post. We are running into the same scenario with Elastic Beanstalk. What would be the solution? To increase the keep-alive timeout in Node’s side?

1. Wei Xu says:
  
  November 28, 2018 at 16:34
  
  Increase the keep-alive timeout would help, e.g. in my case, I never see connection reset error after I increased the timeout, you could try 1 or 2 minutes, the default value is 5 seconds in Nodejs8.
  
Pingback: Metrics Driven Development – How did I reduce AWS EC2 costs to 27% and improve 25% in latency – A Programmer's Blog
Pingback: Google App Engine - Intermittent 502 / connection reset by peer - Tutorial Guruji

Analyze ‘Connection reset’ error in Nginx upstream with keep-alive enabled

What? Connection reset by peer?

Analyzing the errors

Capture network packages

Testing Code

References:

Like this:

Related

12 thoughts on “Analyze ‘Connection reset’ error in Nginx upstream with keep-alive enabled”

Leave a Reply Cancel reply

What? Connection reset by peer?

Analyzing the errors

Capture network packages

Testing Code

References:

Share this:

Like this:

Related

12 thoughts on “Analyze ‘Connection reset’ error in Nginx upstream with keep-alive enabled”

Leave a Reply Cancel reply