Published on

Best fit for long-running task

Authors
  • avatar
    Name
    Pursue
    Twitter

Catalogue

1 Foreword

When it comes to long-running tasks in AWS, the most-chosen solutions are AWS Lambda and AWS EC2, the former is because of its cheapest cost and ease of use but limited to maximum 15 mins execution time, and the latter has much more flexibility and customization while requires operational effort to maintain the infrastructure.

Generally EC2 might be the first choice when Lambda becomes a bottleneck. In this article, I'm gonna explore further what are the potential workarounds before we jump to EC2.

2 Start with Lambda

Imagine we have a task that needs to run for a few mins on a daily basis, it's hard to bypass Lambda option because the following benefits it can provide:

  • Widely supported by many kinds of mainstream programming languages
  • Easy to scale and engineer
  • You only focus on your code logic as it is serverless
  • Pay as you go, no need to pay for idle time with a very low cost
  • Support both Zip and Container image deployment

With these benefits, Lambda is a perfect fit for most of the scenarios at the beginning.

However, with the increase of your data volume, your daily task is getting more and more complex, leading to more time to complete until one day it exceeds the 15 mins execution time limit.

How could we solve this problem? Especially the total time consumption is just a little bit excessive(e.g. 20 or 30 mins) and our code was running perfectly on Lambda, are we wiling to abandon all what we've built?

3 Jump to Recursive Lambda

Fortunately, Lambda supports recursive invocation, which means your Lamabda can call the itself again and again until the task is eventually done.

For achieving this, the following steps are required:

  • Break down the task into smaller pieces
  • Each execution of Lambda should only handle one piece of the task
  • The running result of each execution can be passed into next one as the input
  • The logic of making the task finally done should be precise to ensure there's no infinite loop
  • Your Lambda IAM role needs the ability to invoke the Lambda function itself

Recursive Lambda can keep all your existing code and infrastructure the same but a little bit more logic to split your task, it's a good choice when you don't want to change too much.

However, there is one critical limitation of recursive Lambda you should know before you decide to use it: The maximum number of recursive invocations is 16.

With said being said, your maximum task time is limited to 16 * 15 mins = 240 mins. However, given we are unsure when AWS will get it more restricted again (there was no invocation limit for Recursive Lambda before), thus there's no guarantee you can make the full use of the 240 mins, you'd better to make your Lambda recursive as less invocations as possible(less than 16 times).

Adopting recursive Lambda requires so much mental burden, then how should we iterate ....

3 Adopt Step Function

Recursive Lambda is a good direction in terms of extending the execution time by invoking the same Lambda function many times, if we can eliminate the invocation upper limit, it might be a perfect option for most of cases.

Step Function is proposed with this purpose, you can use Step Function to orchestrate the whole process as follows:

  • Create your state machine to outline what steps are involved in your workflow
  • Break down the task into smaller pieces
  • The running result of each execution can be passed into the context of your workflow and the next step can read it as input
  • One condition needs to be defined to determine when your task is finished, if YES, then your workflow is done; Otherwise, the next step is to invoke your Lambda function again until the ending condition is met
  • Step Function has retry mechanism when there's an error(compare to Recusive Lambda, which is synchronised invokded, Lambda won't retry for this fashion)
  • Your Step Function IAM role needs the ability to invoke the Lambda function

It looks pretty much the same as Recursive Lambda, but the key difference is: instead of invoking the Lambda function by itself with a limit, Step Function can invoke the Lambda function on the workflow behalf, therefore you can invoke "infinite" times as long as the task doesn't reach your ending condition.

The diagram below illustrates how Step Function works in this scenario:

Step Function has been a great solution so far for a long-running tasks, but some of its cons are hidden behind the scenes and worth mentioning:

  • It doesn't fundamentally solve the problem of Lambda's 15 mins execution time limit, it requires you to break down your tasks and manages your tasks within a workflow. As a result, you have to wait for a long time to complete your whole workflow if your task is very time-consuming, e.i. depending on how many 15-mins tasks you split out.
  • Step Function mainly focuses on the orchestration of your tasks, it is powerful specifically for the workflow management across different AWS services, it might be an overkill if you only need to extend the Lambda execution time.

4 Re-engineer towards ECS Fargate(task standalone)

Now our original task is running exceedingly slow, we have to tweak the Lambda memory to speed up the execution time, but because of the 15 mins execution time limit, the number of child tasks are still high, resulting in a long time to complete the whole task.

For this case, we have to move our task to ECS, to be more specific, ECS with Fargate launch type and task standalone mode is suitable for this scenario. You can achieve the same goal by following the steps below:

  • Create ECS cluster with Fargate launch type as it is serverless and you don't need to manage the underlying infrastructure
  • Create a Fargate task definition with the same container image as your task foundation
  • Use standalone task mode to run your task without the need of a service
  • Schedule your task to run on a daily basis with the built-in feature Schedule Task in ECS

Note: Spot instance is recommended for cost-saving if your task can tolerate the interruption

By doing so, you still can schedule many task instances to run concurrently, but the execution time is no longer limited.

ECS option is very close to the EC2 direction expect it is serverless and requires minor operational effort, it is a good choice when you need to run a long-running task with a high frequency.

5 Go with AWS Batch

AWS Batch is the last option I'd like to introduce, it can leverage ECS Fargate under the hood but the subtle difference is that AWS Batch is more suitable for batch processing tasks. Let's imagine the scenario again:

You have a large number of tasks to run concurrently(say 100 tasks), each of 10 tasks as a group need to complete or fail together, then groups of tasks requires to be scheduled to run one by one.

AWS Batch can achieve this by following the steps below:

  • You need define your max vCPU and memory for all your tasks to utilize as a resource pool
  • Create a job queue to manage the job submission and scheduling
  • Create a job definition to define the job's requirements(similar to ECS task definition)
  • Submit a job via ECS Fargate option to the queue with an 10-size array, the job will be pending if there is no enough resource to run
  • Supposing the submitted job id is job-1, once it starts, you can check out in ECS cluster that there are 10 tasks running concurrently
  • While job-1 is running, you can submit another job job-2 to the queue with the same way but mark the job-1 as a dependency of job-2, so that job-2 will only start when job-1 is done
  • Repeat the above steps until all your tasks are done

As you can see, AWS Batch is more about batch job management and scheduling, it is a good choice when you have a large number of tasks to run concurrently and need to manage the job dependencies, which is much more difficult for ECS Fargate to accomplish.

6 Summary

In this article, I've introduced 4 options to consider before jumping to EC2 when Lambda can't meet your requirements, each of them has its pros and cons, you can choose the best fit based on your specific scenario:

  • Recursive Lambda
    • Pros: Keep your existing code and infrastructure the same
    • Cons: Limited to 16 recursive invocations
  • Step Function
    • Pros: No invocation limit
    • Cons: Overkill for extending Lambda execution time
  • ECS Fargate(task standalone)
    • Pros: Serverless and no need to manage the underlying infrastructure
    • Cons: Minor operational effort
  • AWS Batch
    • Pros: Batch job management and scheduling
    • Cons: A bit more operational effort

7 Cost Comparison

Given the following task requirements:

RegionMemoryDurationArchTask count
Sydney8 GB15 minsARM1
  • AWS Lambda

Cost: ~0.096 USD

Price Calculation:

0.0001067(8GB) * 900 = 0.096 USD
  • AWS ECS Fargate

Cost: Non-spot ~0.56 USD, spot less than 0.56

Price calculation(Non-spot):

30.42 tasks x 1 vCPU x 0.25 hours x 0.03885 USD per hour = 0.30 USD for vCPU hours
30.42 tasks x 8.00 GB x 0.25 hours x 0.00426 USD per GB per hour = 0.26 USD for GB hours
20 GB - 20 GB (no additional charge) = 0.00 GB billable ephemeral storage per task
0.30 USD for vCPU hours + 0.26 USD for GB hours = 0.56 USD total
Fargate cost (monthly): 0.56 USD
  • AWS Batch

There is no additional charge for AWS Batch, you will pay for the resources it is using behind the scene, e.g. price is the same as ECS Fargate if job is submitted via ECS task.

7 Reference