Nginx

Metrics Driven Development – What I did to reduce AWS EC2 costs to 27% and improve 25% in latency

Recently, I did some work related to auto-scaling and performance tuning. As a result, the costs reduced to 27% and service latency improved 25%.

Overall Instance Count And Service Latency Change

Takeaways

React Server Side Render performs not good under Nodejs Cluster, consider using a reverse proxy, e.g. Nginx
React V16 Server Side Render performs much faster than V15, 40% in our case
Use smaller instances to get better scaling granularity if possible, e.g. change C4.2xLarge to C4.Large
AWS t2.large performs 3 times slower than C4.large on React Server Side Render
AWS Lambda performs 3 times slower than C4.large on React Server Side Render
There’s a race condition in Nginx http upstream keepalive module which generates 502 Bad Gateway errors (104 connection reset by peer)

Background

Here’s the background of the service before optimization:

Serving 6000 requests per minute
Using AWS Classic Load Balancer
Running 25 C3.2xLarge EC2 instances which have 8-core CPU on each instance
Using PM2 as the Process Manager and the Cluster Manager
Written in Nodejs and using React 15 server-side render
Continue reading “Metrics Driven Development – What I did to reduce AWS EC2 costs to 27% and improve 25% in latency”

Analyze ‘Connection reset’ error in Nginx upstream with keep-alive enabled

What? Connection reset by peer?

We are running Node.js web services behind AWS Classic Load Balancer. I noticed that many 502 errors after I migrate AWS Classic Load Balancer to Application Load Balancer. In order to understand what happened, I added Nginx in front of the Node.js web server, and then found that there are more than 100 ‘connection reset’ errors everyday in Nginx logs.

Here are some example logs:

2017/11/12 06:11:15 [error] 7#7: *2904 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost"
2017/11/12 06:11:27 [error] 7#7: *2950 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost"
2017/11/12 06:11:31 [error] 7#7: *2962 upstream prematurely closed connection while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost"
2017/11/12 06:11:44 [error] 7#7: *3005 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost"
2017/11/12 06:11:47 [error] 7#7: *3012 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost"

Analyzing the errors

The number of errors was increased after I migrate Classic LB to Application LB, and one of the differences between them is Classic LB is using pre-connected connections, and Application LB only using Http/1.1 Keep-Alive feature.

From the documentation of AWS Load Balancer:

Possible causes:

The load balancer received a TCP RST from the target when attempting to establish a connection.

The target closed the connection with a TCP RST or a TCP FIN while the load balancer had an outstanding request to the target.

The target response is malformed or contains HTTP headers that are not valid.

A new target group was used but no targets have passed an initial health check yet. A target must pass one health check to be considered healthy

Continue reading “Analyze ‘Connection reset’ error in Nginx upstream with keep-alive enabled”

Nginx的DNS解析过程分析

Nginx怎么做域名解析？怎么在你自己开发的模块里面使用Nginx提供的方法解析域名？它内部实现是什么样的？

本文以Nginx 1.5.1为例，从nginx_mail_smtp模块如何进行域名解析出发，分析Nginx进行域名解析的过程。为了简化流程，突出重点，在示例代码中省掉了一些异常部分的处理，比如内存分配失败等。DNS查询分为两种：根据域名查询地址和根据地址查询域名，在代码结构上这两种方式非常相似，这里只介绍根据域名查询地址这一种方式。本文将从以下几个方面进行介绍：

域名查询的函数接口介绍
域名解析流程分析
查询场景分析及实现介绍

一、域名查询的函数接口介绍

在使用同步IO的情况下，调用gethostbyname()或者gethostbyname_r()就可以根据域名查询到对应的IP地址, 但因为可能会通过网络进行远程查询，所以需要的时间比较长。

为了不阻塞当前线程，Nginx采用了异步的方式进行域名查询。整个查询过程主要分为三个步骤， Continue reading “Nginx的DNS解析过程分析”