What? Connection reset by peer?
We are running Node.js web services behind AWS Classic Load Balancer. I noticed that many 502 errors after I migrate AWS Classic Load Balancer to Application Load Balancer. In order to understand what happened, I added Nginx in front of the Node.js web server, and then found that there are more than 100 ‘connection reset’ errors everyday in Nginx logs.
Here are some example logs:
2017/11/12 06:11:15 [error] 7#7: *2904 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost" 2017/11/12 06:11:27 [error] 7#7: *2950 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost" 2017/11/12 06:11:31 [error] 7#7: *2962 upstream prematurely closed connection while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost" 2017/11/12 06:11:44 [error] 7#7: *3005 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost" 2017/11/12 06:11:47 [error] 7#7: *3012 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost"
Analyzing the errors
The number of errors was increased after I migrate Classic LB to Application LB, and one of the differences between them is Classic LB is using pre-connected connections, and Application LB only using Http/1.1 Keep-Alive feature.
From the documentation of AWS Load Balancer:
Possible causes:
- The load balancer received a TCP RST from the target when attempting to establish a connection.
- The target closed the connection with a TCP RST or a TCP FIN while the load balancer had an outstanding request to the target.
- The target response is malformed or contains HTTP headers that are not valid.
- A new target group was used but no targets have passed an initial health check yet. A target must pass one health check to be considered healthy
Because there’s no any errors on Node.js side, so I’m guessing it was because of the keep-alive behaviour. There’s no upstream keep-alive timeout settings in the current Nginx version (1.13.6), then I tried Tengine – a taobao forked nginx which support upstream keepalive timeout. After running a couple of days, there’s no such errors any more. In order to understand what’s causing the issue, I tried to reproduce it on my local machine.
Capture network packages
In order to send a request at the same time when Node.js closing the connection after keep-alive timeout, I need to keep requesting the url until the issue reproduced. Here’s my settings for the testing environment:
Upstream (Node.js server):
- Set keep-alive timeout to 500 ms
Test client:
- Keep sending requests with an interval
- Interval starts from 500 ms and decrease 1 ms after each request
For the normal requests, upstream send a [FIN, ACK] to nginx after keep-alive timeout (500 ms), and nginx also send a [FIN, ACK] back, then upstream send a [ACK] to close the connection completely.
No. Time Source Destination Protocol Length Info 1 2017-11-12 17:11:04.299146 172.18.0.3 172.18.0.2 TCP 74 48528 → 8000 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=32031305 TSecr=0 WS=128 2 2017-11-12 17:11:04.299171 172.18.0.2 172.18.0.3 TCP 74 8000 → 48528 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=32031305 TSecr=32031305 WS=128 3 2017-11-12 17:11:04.299194 172.18.0.3 172.18.0.2 TCP 66 48528 → 8000 [ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=32031305 TSecr=32031305 4 2017-11-12 17:11:04.299259 172.18.0.3 172.18.0.2 HTTP 241 GET /_healthcheck HTTP/1.1 5 2017-11-12 17:11:04.299267 172.18.0.2 172.18.0.3 TCP 66 8000 → 48528 [ACK] Seq=1 Ack=176 Win=30080 Len=0 TSval=32031305 TSecr=32031305 6 2017-11-12 17:11:04.299809 172.18.0.2 172.18.0.3 HTTP 271 HTTP/1.1 200 OK (text/html) 7 2017-11-12 17:11:04.299852 172.18.0.3 172.18.0.2 TCP 66 48528 → 8000 [ACK] Seq=176 Ack=206 Win=30336 Len=0 TSval=32031305 TSecr=32031305 8 2017-11-12 17:11:04.800805 172.18.0.2 172.18.0.3 TCP 66 8000 → 48528 [FIN, ACK] Seq=206 Ack=176 Win=30080 Len=0 TSval=32031355 TSecr=32031305 9 2017-11-12 17:11:04.801120 172.18.0.3 172.18.0.2 TCP 66 48528 → 8000 [FIN, ACK] Seq=176 Ack=207 Win=30336 Len=0 TSval=32031355 TSecr=32031355 10 2017-11-12 17:11:04.801151 172.18.0.2 172.18.0.3 TCP 66 8000 → 48528 [ACK] Seq=207 Ack=177 Win=30080 Len=0 TSval=32031355 TSecr=32031355
For the failed requests, the upstream closed the connection after keep-alive timeout (500 ms), the client sends a new http request before it receives and processes the [FIN] package. Because of the connection has been closed from upstream’s perspective, so it send a [RST] response for this request. This would happen in following scenarios:
- Upstream hasn’t send the [FIN] package yet (pending to send package at network layer)
- Upstream has sent the [FIN] package, but client hasn’t received it yet
- Client received the [FIN] package, but hasn’t processed it yet
Example of the first scenario, hasn’t send [FIN] yet:
No. Time Source Destination Protocol Length Info 433 2017-11-12 17:11:26.548449 172.18.0.3 172.18.0.2 TCP 74 48702 → 8000 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=32033530 TSecr=0 WS=128 434 2017-11-12 17:11:26.548476 172.18.0.2 172.18.0.3 TCP 74 8000 → 48702 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=32033530 TSecr=32033530 WS=128 435 2017-11-12 17:11:26.548502 172.18.0.3 172.18.0.2 TCP 66 48702 → 8000 [ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=32033530 TSecr=32033530 436 2017-11-12 17:11:26.548609 172.18.0.3 172.18.0.2 HTTP 241 GET /_healthcheck HTTP/1.1 437 2017-11-12 17:11:26.548618 172.18.0.2 172.18.0.3 TCP 66 8000 → 48702 [ACK] Seq=1 Ack=176 Win=30080 Len=0 TSval=32033530 TSecr=32033530 438 2017-11-12 17:11:26.549173 172.18.0.2 172.18.0.3 HTTP 271 HTTP/1.1 200 OK (text/html) 439 2017-11-12 17:11:26.549230 172.18.0.3 172.18.0.2 TCP 66 48702 → 8000 [ACK] Seq=176 Ack=206 Win=30336 Len=0 TSval=32033530 TSecr=32033530 440 2017-11-12 17:11:27.049668 172.18.0.3 172.18.0.2 HTTP 241 GET /_healthcheck HTTP/1.1 441 2017-11-12 17:11:27.050324 172.18.0.2 172.18.0.3 HTTP 271 HTTP/1.1 200 OK (text/html) 442 2017-11-12 17:11:27.050378 172.18.0.3 172.18.0.2 TCP 66 48702 → 8000 [ACK] Seq=351 Ack=411 Win=31360 Len=0 TSval=32033580 TSecr=32033580 443 2017-11-12 17:11:27.551182 172.18.0.3 172.18.0.2 HTTP 241 GET /_healthcheck HTTP/1.1 444 2017-11-12 17:11:27.551294 172.18.0.2 172.18.0.3 TCP 66 8000 → 48702 [RST, ACK] Seq=411 Ack=526 Win=32256 Len=0 TSval=32033630 TSecr=32033630
Example of the second scenario, Sent [FIN] at the same time of receiving a new request:
No. Time Source Destination Protocol Length Info 13 2018-06-15 21:40:00.522110 127.0.0.1 127.0.0.1 TCP 68 50678 > 8000 [SYN] Seq=0 Win=65535 Len=0 MSS=16344 WS=32 TSval=1503957438 TSecr=0 SACK_PERM=1 14 2018-06-15 21:40:00.522210 127.0.0.1 127.0.0.1 TCP 68 8000 > 50678 [SYN, ACK] Seq=0 Ack=1 Win=65535 Len=0 MSS=16344 WS=32 TSval=1503957438 TSecr=1503957438 SACK_PERM=1 15 2018-06-15 21:40:00.522219 127.0.0.1 127.0.0.1 TCP 56 50678 > 8000 [ACK] Seq=1 Ack=1 Win=408288 Len=0 TSval=1503957438 TSecr=1503957438 16 2018-06-15 21:40:00.522228 127.0.0.1 127.0.0.1 TCP 56 [TCP Window Update] 8000 > 50678 [ACK] Seq=1 Ack=1 Win=408288 Len=0 TSval=1503957438 TSecr=1503957438 17 2018-06-15 21:40:00.522315 127.0.0.1 127.0.0.1 HTTP 189 GET / HTTP/1.1 18 2018-06-15 21:40:00.522358 127.0.0.1 127.0.0.1 TCP 56 8000 > 50678 [ACK] Seq=1 Ack=134 Win=408160 Len=0 TSval=1503957438 TSecr=1503957438 19 2018-06-15 21:40:00.522727 127.0.0.1 127.0.0.1 HTTP 261 HTTP/1.1 200 OK (text/html) 20 2018-06-15 21:40:00.522773 127.0.0.1 127.0.0.1 TCP 56 50678 > 8000 [ACK] Seq=134 Ack=206 Win=408064 Len=0 TSval=1503957438 TSecr=1503957438 21 2018-06-15 21:40:01.025685 127.0.0.1 127.0.0.1 HTTP 189 GET / HTTP/1.1 22 2018-06-15 21:40:01.025687 127.0.0.1 127.0.0.1 TCP 56 8000 > 50678 [FIN, ACK] Seq=206 Ack=134 Win=408160 Len=0 TSval=1503957939 TSecr=1503957438 23 2018-06-15 21:40:01.025748 127.0.0.1 127.0.0.1 TCP 44 8000 > 50678 [RST] Seq=206 Win=0 Len=0 24 2018-06-15 21:40:01.025760 127.0.0.1 127.0.0.1 TCP 56 50678 > 8000 [ACK] Seq=267 Ack=207 Win=408064 Len=0 TSval=1503957939 TSecr=1503957939 25 2018-06-15 21:40:01.025769 127.0.0.1 127.0.0.1 TCP 44 8000 > 50678 [RST] Seq=207 Win=0 Len=0
When the client receives the [RST] package, it will log a ‘Connection reset’ error.
Testing Code
This issue is a generic issue when closing the connection on the server side while HTTP keep-alive enabled, so you can easily reproduce it by clone the example code (Node.js) from https://github.com/weixu365/test-connection-reset
npm install npm start # In a separate terminal npm run client
- Upstream Node.js Web Server
const express = require('express'); const app = express(); app.get('/', (req, res) => res.send('OK')); const port = process.env.PORT || 8000; app.listen(port, () => { console.log(`Listening on http://localhost:${port}`) }) .keepAliveTimeout = 500;
- Test client in Node.js
const axios = require('axios'); const Bluebird = require('bluebird'); const HttpAgent = require('agentkeepalive'); const keepAliveOption = { freeSocketKeepAliveTimeout: 30 * 1000, // Should be less than server keep alive timeout socketActiveTTL: 50 * 1000 // Should be less than dns ttl }; const httpAgent = new HttpAgent(keepAliveOption); let host = 'http://localhost:8000'; let path = '/'; const httpClient = axios.create({ baseURL: host, timeout: 5000, }); const sendRequest = () => httpClient.request({ url: path, httpAgent, }) .then(res => { console.log('Received response', res.status); }) .catch(e => { console.error('Error occurred', e.message); }); let delay=501; const start = () => sendRequest() .then(() => delay -= 1) .then(() => delay > 450 ? Bluebird.delay(delay).then(start) : 'Done') start();
- Capture network packages
tcpdump -i eth0 tcp port 8000 -w /tmp/connection.pcap
- Nginx config (Optional)
upstream nodejs { least_conn; server chalice:8000 fail_timeout=1s max_fails=3; keepalive 16; } server_tokens off; log_format detailed escape=json '{' '"timestamp": "$time_iso8601",' '"remote_addr": "$remote_addr",' '"upstream_addr": "$upstream_addr",' '"connection": "$connection",' '"connection_requests": "$connection_requests",' '"request_time": "$request_time",' '"upstream_response_time": "$upstream_response_time",' '"status": "$status",' '"upstream_status": "$upstream_status",' '"body_bytes_sent": "$body_bytes_sent ",' '"request": "$request",' '"http_user_agent": "$http_user_agent"' '}'; access_log /var/log/nginx/access.log detailed; server { listen 80; server_name localhost; location / { proxy_http_version 1.1; proxy_redirect off; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarder-For $proxy_add_x_forwarded_for; proxy_set_header Host $http_host; proxy_set_header X-NginX-Proxy true; proxy_set_header Connection ""; proxy_pass http://nodejs; } }
For the failed requests, upstream received a new http request when it had closed the connection after keep-alive timeout (500 ms) and hasn’t got a chance to send the [FIN] package. Because of the connection has been closed from upstream’s perspective, so it send a [RST] response for this request.
我遇到了同样的问题,能帮忙再解释一下这句话吗?为什么 hasn’t got a chance to send the [FIN] package?
fin包不是一定会发出来吗?
Hi, I updated the blog to explain what’s the scenarios when this happens. Basically, the issue would happen after server closed the connection and before the client side receives and processes the [FIN] package
有一个疑问,就是如果server端实际是有发送了fin包,但是有可能不能被wireshark tcpdump等抓包工具抓到对吗?
补充一下,我的理解是server主动close的tcp,这样子的话,server不再发送回有效包信息给client端nginx
The server will resend [FIN] if it didn’t receive ACK, https://tools.ietf.org/html/rfc793
But it still possible that the client side cannot receive the [FIN] package because the network is not reliable (that’s also why TCP need ACK packages).
In my case, the only problem is the server close the connection and the client sends new requests before it handles the [FIN] package. My solution is closing connection on the client side.
The client cannot close the connection at the same time when the server send [FIN] because of:
It takes time to
* send package on the server side
* delivery package through network
* handle package on the client side
444 2017-11-12 17:11:27.551294 172.18.0.2 172.18.0.3 TCP 66 8000 → 48702 [RST, ACK] Seq=411 Ack=526 Win=32256 Len=0 TSval=32033630 TSecr=32033630
上面出现RST包的那次抓包里面,为什么抓包记录没有出现node向Nginx发送的FIN包?我对这个比较疑惑,还没想明白。
In the captured packages, my understanding is the server received a new request after it decided to close the connection and before the [FIN] package sent out, that’s why there’s no [FIN] package.
Instead of [FIN] package, there’s a [RST] package for the new request and no [ACK] for this [RST] package because of:
https://en.wikipedia.org/wiki/TCP_reset_attack
https://stackoverflow.com/questions/7735618/tcp-rst-packet-details
I added an example for the second scenario Send [FIN] and receive request at the same time in the blog
The issue had been fixed in Nginx V1.15.3,
http://nginx.org/en/docs/http/ngx_http_upstream_module.html#keepalive_timeout
Wei, thanks for your blog post. We are running into the same scenario with Elastic Beanstalk. What would be the solution? To increase the keep-alive timeout in Node’s side?
Increase the keep-alive timeout would help, e.g. in my case, I never see connection reset error after I increased the timeout, you could try 1 or 2 minutes, the default value is 5 seconds in Nodejs8.