Troubleshooting of blocked requests when fetching messages from hundreds SQS queues

I’m working on a project which needs to fetch messages from hundreds of SQS queues. We’re using SQS long polling to reduce the number of empty responses. It was very quick to get response at first when there are only dozen queues. As we added more and more queues, the performance getting worse and worse. It takes 60 seconds to get the response when there’s 300 queues and WaitTimeSeconds set to 10 seconds.

We are using Node.js in single thread mode, and I believe that it could handle 10 thousands connections without any problem because most of the tasks are IO processing. We also created an AWS support case, but nobody clearly answered the question.

Using AWS SDK to reproduce the issue

I start to troubleshoot the problem, the first step is reproduce the issue using a simple code which makes it easier to find out the issue.

const AWS = require("aws-sdk");
const Promise = require("bluebird");
const DynamoDB = require('./lib/dynamodb');
const Benchmark = require('./lib/benchmark');
const config = require("./config/application");

const Logger = require("./lib/logger");
const logger = new Logger();

AWS.config.setPromisesDependency(Promise);

const dynamoDB = new DynamoDB(config, logger);

class SDKWorker {
  constructor(index) {
    this.index = index;
    this.sqs = new AWS.SQS({ region: 'ap-southeast-2' });
  }

  start(subscription) {
    return this.sqs.getQueueUrl({ QueueName: subscription.queueName }).promise()
      .then(result => result.QueueUrl)
      .tap(queueUrl => logger.info(`Starting polling queue: ${queueUrl}`))
      .tap(queueUrl => this.poll(queueUrl));
  }

  poll(queueUrl) {
    const benchmark = new Benchmark('Received response');
    benchmark.start();

    return this.sqs.receiveMessage({
      AttributeNames: ['All'],
      MessageAttributeNames: ['All'],
      QueueUrl: queueUrl,
      MaxNumberOfMessages: 1,
      VisibilityTimeout: 60,
      WaitTimeSeconds: 10 // How long to wait for a message in seconds.
    }).promise()
      .tap(() => logger.info(benchmark.stopAndReport(), { worker: this.index }))
      .then(() => this.poll(queueUrl));
  }
}

dynamoDB.loadAll('subscriptions-sit')
  .map((subscription, index) => new SDKWorker(index).start(subscription));

For the above test, the average response time is 60 seconds which reproduced the issue in our real environment.

Using Https endpoint to query messages

I started to query SQS https endpoint directly, using aws4 to sign the requests and using the xml node parser from aws sdk to parse the xml responses.

const aws4 = require('aws4');
const request = Promise.promisify(require('request'));
const parseAWS = require('aws-sdk/lib/xml/node_parser');

class HttpsWorker {
  constructor(index) {
    this.index = index;
    this.sqs = new AWS.SQS({ region: 'ap-southeast-2' });
  }

  start(subscription) {
    return this.sqs.getQueueUrl({ QueueName: subscription.queueName }).promise()
      .then(result => result.QueueUrl)
      .tap(queueUrl => logger.info(`Starting polling queue: ${queueUrl}`))
      .tap(queueUrl => this.pollHttps(queueUrl, subscription.queueName));
  }

  pollHttps(queueUrl, queueName) {
    const benchmark = new Benchmark('Received response');
    benchmark.start();

    return this.receiveMessage(queueUrl, queueName)
      .tap(() => logger.info(benchmark.stopAndReport(), { worker: this.index }))
      .then(() => this.pollHttps(queueUrl, queueName));
  }

  receiveMessage(queueUrl, queueName) {
    const awsOption = {
      host: 'sqs.ap-southeast-2.amazonaws.com',
      path: `/${accountId}/${queueName}/?Action=ReceiveMessage&AttributeName.1=All&MaxNumberOfMessages=3&MessageAttributeName.1=All&QueueUrl=${queueUrl}&Version=2012-11-05&VisibilityTimeout=60&WaitTimeSeconds=10`,
    };

    awsOption.url = `https://${awsOption.host}${awsOption.path}`;
    const signedOptions = aws4.sign(awsOption, {
      accessKeyId: 'accessKeyId',
      secretAccessKey: 'secretAccessKey'
    });

    return request(signedOptions)
      .then(response => new parseAWS().parse(response.body));
  }
}

dynamoDB.loadAll('subscriptions-sit')
  .map((subscription, index) => new HttpsWorker(index).start(subscription));

This time everything works as expected, it takes about 10 – 10.100 seconds to get the responses which is reasonable.

AWS SDK settings for max connections

Now I believe the problem happens in AWS SDK. I started to analyse the polling logs, then found an interesting pattern:
The response time keep increasing until it reaches 60 seconds. At the beginning, some requests get response in 10 seconds, and later, some other requests get response in 20 seconds, then each time after 10 seconds, some requests get response until all the requests start to get response in 60 seconds.

It’s an interesting pattern because it seems that requests are processed in batch and there’s a batch size. Then I googled ‘aws sdk javascript max concurrency’ and found that there’s a maximum number of connections per origin which default is 50.

Things are clear now. The max connection is 50, for the 300 queues, they need to be processed in 6 batches, and each request takes 10 seconds to finish (let’s assume all of them are empty), then it would takes 6 * 10 seconds to finish all the requests.

Final Solution

Based on the documentation, I changed the maxSockets to 10000, then by using AWS sdk, it takes 10 seconds to get the response as we expected.

const agent = new https.Agent({
  maxSockets: 10000
});

AWS.config.update({
  httpOptions:{
    agent: agent
  }
});

In conclusion, if you need more than 50 concurrency requests accessing AWS resources, then you need to change the maxSockets option, else the requests will be queued and be processed slowly.

Leave a Reply

Your email address will not be published.