Software Architect | Craftsman | Mentor
In my recent exploration of GPU-based machine learning tasks on AWS, I delved into two prominent services: AWS Batch and Amazon SageMaker. While both services serve distinct purposes, I found it intriguing that they can address similar challenges, such as creating an API endpoint that processes user data through a GPU model and returns results. So, which service is best for our needs? Let's examine the strengths of each to gain a clearer perspective for our future decisions.
When I first started exploring AWS Batch, I came to see it as a robust Worker
queue system. For those familiar with Rails, you can liken it to Sidekiq or Solid Queue. The primary advantage of AWS Batch lies in its scalability; you can select the appropriate EC2 compute instance and configure the necessary resources for each job to run effectively. AWS automatically manages the scaling of these EC2 instances through a cluster
as requests are received.
Example:
This can be achieved using a Dockerfile that executes a task. (ECR).
TL;DR If you can Dockerize it, you can batch it.
When defining the Job for our Batch Process, we can specify a CMD
to execute against the Docker image.
CMD ["python3", "src/main.py", "--video_url", "Ref::video_url", "--callback_url", "Ref::callback_url"]
The Ref::
notation allows us to pass parameters when initiating a Job Request.
aws batch submit-job \\
--job-name yolo-dog-detection \\
--job-queue YoloDogDetectionGPUQueue \\
--job-definition yolo-dog-detection-gpu-job \\
--parameters video_url="SIGNED_AWS_URL or PUBLIC_HOSTED_URL",callback_url="<https://d31e23cab59b.ngrok.app>"
Once the job is submitted, it will enter the queue, and AWS will automatically scale the resources as required, allowing users to start processing their tasks. This setup can accommodate ANY style task; in our comparison, we utilized a YOLO model running on an EC2 instance equipped with a GPU.
It's important to note that AWS Batch does not come with built-in ML tools; it operates as a bring your own image/model
worker queue. This approach does introduce some overhead related to DevOps and setup, as users need to build and manage their models or tasks independently of Batch. However, for many users, this flexibility is a significant advantage. If you seek complete control over your environment, you can dockerize your requirements and allow Batch to manage the scaling effectively.
SageMaker is a powerful platform for machine learning development, offering a comprehensive set of features that simplify the entire ML workflow. Instead of training models locally, you can seamlessly initiate and iterate through your entire process within SageMaker. Create Notebooks, train Models, manage your training data with data lakes and storage solutions, deploy models, and establish endpoints, all integrated within the SageMaker environment. Leveraging SageMaker as the one stop shop, allows your whole team to get involved. No more I can't run this model on my machine
-- SageMaker like Batch will create an EC2 to run and test these Model within AWS. You'll be able to tap into other AWS resources as well S3, Redshift, Bedrock, OpenSource HuggingFace models, build full ML workflows, etc. When it comes to Machine Learning and AI specifically, SageMaker becomes your full one stop shop.
When considering SageMaker for your machine learning needs, there are several key factors to keep in mind. First and foremost, SageMaker offers a comprehensive suite of built-in ML tools that streamline the entire workflow, from data preparation to model deployment. This integration significantly reduces the overhead associated with managing separate tools and environments, allowing teams to focus more on model development rather than infrastructure management.
Ease of use is another critical advantage of SageMaker. The platform is designed to cater to users with varying levels of expertise, featuring a user-friendly interface and pre-configured environments. This accessibility enables teams to quickly get started with their projects, regardless of their technical background. However, it is essential to consider the cost implications; while SageMaker provides convenience, it also comes with a premium pricing model. Evaluating whether the benefits of speed and ease of use justify the additional costs for your specific use case is crucial.
Additionally, organizations should be aware of the potential for vendor lock-in when relying heavily on SageMaker. This dependency may complicate future transitions to other platforms, so it's important to consider your long-term strategy and flexibility needs. On the positive side, SageMaker seamlessly integrates with other AWS services, such as S3 for data storage and Redshift for data warehousing, which enhances your overall machine learning capabilities.
Many companies are finding success with a hybrid approach:
SageMaker for:
AWS Batch for:
Remember: Your choice isn't permanent. Many successful companies start with SageMaker for its ease of use and ML features, then gradually incorporate AWS Batch for specific workloads as they scale.
Want to discuss your specific ML infrastructure needs? Let's talk about building the right solution for your team.