Set up for SageMaker

You can use W&B Launch to submit launch jobs to Amazon SageMaker to train machine learning models using provided or custom algorithms on the SageMaker platform. SageMaker takes care of spinning up and releasing compute resources, so it can be a good choice for teams without an EKS cluster.

Launch jobs sent to a W&B Launch queue connected to Amazon SageMaker are executed as SageMaker Training Jobs with the CreateTrainingJob API. Use the launch queue configuration to control arguments sent to the CreateTrainingJob API.

Amazon SageMaker uses Docker images to execute training jobs. Images pulled by SageMaker must be stored in the Amazon Elastic Container Registry (ECR). This means that the image you use for training must be stored on ECR.

note

This guide shows how to execute SageMaker Training Jobs. For information on how to deploy to models for inference on Amazon SageMaker, see this example Launch job.

Prerequisites

Before you get started, ensure you satisfy the following prerequisites:

Decide if you want the Launch agent to build a Docker image for you.
Set up AWS resources and gather information about S3, ECR, and Sagemaker IAM roles.
Create an IAM role for the Launch agent.

Decide if you want the Launch agent to build a Docker images

Decide if you want the W&B Launch agent to build a Docker image for you. There are two options you can choose from:

Permit the launch agent build a Docker image, push the image to Amazon ECR, and submit SageMaker Training jobs for you. This option can offer some simplicity to ML Engineers rapidly iterating over training code.
The launch agent uses an existing Docker image that contains your training or inference scripts. This option works well with existing CI systems. If you choose this option, you will need to manually upload your Docker image to your container registry on Amazon ECR.

Set up AWS resources

Ensure you have the following AWS resources configured in your preferred AWS region:

An ECR repository to store container images.
One or more S3 buckets to store inputs and outputs for your SageMaker Training jobs.
An IAM role for Amazon SageMaker that permits SageMaker to run training jobs and interact with Amazon ECR and Amazon S3.

Make a note of the ARNs for these resources. You will need the ARNs when you define the Launch queue configuration.

Create a IAM Policy for Launch agent

From the IAM screen in AWS, create a new policy.

Toggle to the JSON policy editor, then paste the following policy based on your use case. Substitute values enclosed with <> with your own values:

Agent builds and submits Docker image
Agent submits pre-built Docker image

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "logs:DescribeLogStreams",
        "SageMaker:AddTags",
        "SageMaker:CreateTrainingJob",
        "SageMaker:DescribeTrainingJob"
      ],
      "Resource": "arn:aws:sagemaker:<region>:<account-id>:*"
    },
    {
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": "arn:aws:iam::<account-id>:role/<RoleArn-from-queue-config>"
    },
  {
      "Effect": "Allow",
      "Action": "kms:CreateGrant",
      "Resource": "<ARN-OF-KMS-KEY>",
      "Condition": {
        "StringEquals": {
          "kms:ViaService": "SageMaker.<region>.amazonaws.com",
          "kms:GrantIsForAWSResource": "true"
        }
      }
    }
  ]
}

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "logs:DescribeLogStreams",
        "SageMaker:AddTags",
        "SageMaker:CreateTrainingJob",
        "SageMaker:DescribeTrainingJob"
      ],
      "Resource": "arn:aws:sagemaker:<region>:<account-id>:*"
    },
    {
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": "arn:aws:iam::<account-id>:role/<RoleArn-from-queue-config>"
    },
     {
    "Effect": "Allow",
    "Action": [
      "ecr:CreateRepository",
      "ecr:UploadLayerPart",
      "ecr:PutImage",
      "ecr:CompleteLayerUpload",
      "ecr:InitiateLayerUpload",
      "ecr:DescribeRepositories",
      "ecr:DescribeImages",
      "ecr:BatchCheckLayerAvailability",
      "ecr:BatchDeleteImage"
    ],
    "Resource": "arn:aws:ecr:<region>:<account-id>:repository/<repository>"
  },
  {
    "Effect": "Allow",
    "Action": "ecr:GetAuthorizationToken",
    "Resource": "*"
  },
  {
      "Effect": "Allow",
      "Action": "kms:CreateGrant",
      "Resource": "<ARN-OF-KMS-KEY>",
      "Condition": {
        "StringEquals": {
          "kms:ViaService": "SageMaker.<region>.amazonaws.com",
          "kms:GrantIsForAWSResource": "true"
        }
      }
    }
  ]
}

Click Next.
Give the policy a name and description.
Click Create policy.

Create an IAM role for Launch agent

The Launch agent needs permission to create Amazon SageMaker training jobs. Follow the procedure below to create an IAM role:

From the IAM screen in AWS, create a new role.
For Trusted Entity, select AWS Account (or another option that suits your organization's policies).
Scroll through the permissions screen and select the policy name you just created above.
Give the role a name and description.
Select Create role.
Note the ARN for the role. You will specify the ARN when you set up the launch agent.

For more information on how to create IAM role, see the AWS Identity and Access Management Documentation.

info

If you want the launch agent to build images, see the Advanced agent set up for additional permissions required.
The kms:CreateGrant permission for SageMaker queues is required only if the associated ResourceConfig has a specified VolumeKmsKeyId and the associated role does not have a policy that permits this action.

Configure launch queue for SageMaker

Next, create a queue in the W&B App that uses SageMaker as its compute resource:

Navigate to the Launch App.
Click on the Create Queue button.
Select the Entity you would like to create the queue in.
Provide a name for your queue in the Name field.
Select SageMaker as the Resource.
Within the Configuration field, provide information about your SageMaker job. By default, W&B will populate a YAML and JSON CreateTrainingJob request body:

{
  "RoleArn": "<REQUIRED>", 
  "ResourceConfig": {
      "InstanceType": "ml.m4.xlarge",
      "InstanceCount": 1,
      "VolumeSizeInGB": 2
  },
  "OutputDataConfig": {
      "S3OutputPath": "<REQUIRED>"
  },
  "StoppingCondition": {
      "MaxRuntimeInSeconds": 3600
  }
}

You must at minimum specify:

RoleArn : ARN of the SageMaker execution IAM role (see prerequisites). Not to be confused with the launch agent IAM role.
OutputDataConfig.S3OutputPath : An Amazon S3 URI specifying where SageMaker outputs will be stored.
ResourceConfig: Required specification of a resource config. Options for resource config are outlined here.
StoppingCondition: Required specification of the stopping conditions for the training job. Options outlined here.

Click on the Create Queue button.

Set up the launch agent

The following section describes where you can deploy your agent and how to configure your agent based on where it is deployed.

There are several options for how the Launch agent is deployed for a Amazon SageMaker queue: on a local machine, on an EC2 instance, or in an EKS cluster. Configure your launch agent appropriately based on the where you deploy your agent.

Decide where to run the Launch agent

For production workloads and for customers who already have an EKS cluster, W&B recommends deploying the Launch agent to the EKS cluster using this Helm chart.

For production workloads without an current EKS cluster, an EC2 instance is a good option. Though the launch agent instance will keep running all the time, the agent doesn't need more than a t2.micro sized EC2 instance which is relatively affordable.

For experimental or solo use cases, running the Launch agent on your local machine can be a fast way to get started.

Based on your use case, follow the instructions provided in the following tabs to properly configure up your launch agent:

EKS
EC2
Local machine

W&B strongly encourages that you use the W&B managed helm chart to install the agent in an EKS cluster.

Navigate to the Amazon EC2 Dashboard and complete the following steps:

Click Launch instance.
Provide a name for the Name field. Optionally add a tag.
From the Instance type, select an instance type for your EC2 container. You do not need more than 1vCPU and 1GiB of memory (for example a t2.micro).
Create a key pair for your organization within the Key pair (login) field. You will use this key pair to connect to your EC2 instance with SSH client at a later step.
Within Network settings, select an appropriate security group for your organization.
Expand Advanced details. For IAM instance profile, select the launch agent IAM role you created above.
Review the Summary field. If correct, select Launch instance.

Navigate to Instances within the left panel of the EC2 Dashboard on AWS. Ensure that the EC2 instance you created is running (see the Instance state column). Once you confirm your EC2 instance is running, navigate to your local machine's terminal and complete the following:

Select Connect.
Select the SSH client tab and following the instructions outlined to connect to your EC2 instance.
Within your EC2 instance, install the following packages:

sudo yum install python311 -y && python3 -m ensurepip --upgrade && pip3 install wandb && pip3 install wandb[launch]

Next, install and start Docker within your EC2 instance:

sudo yum update -y && sudo yum install -y docker python3 && sudo systemctl start docker && sudo systemctl enable docker && sudo usermod -a -G docker ec2-user

newgrp docker

Now you can proceed to setting up the Launch agent config.

Use the AWS config files located at ~/.aws/config and ~/.aws/credentials to associate a role with an agent that is polling on a local machine. Provide the IAM role ARN that you created for the launch agent in the previous step.

~/.aws/config
[profile SageMaker-agent]
role_arn = arn:aws:iam::<account-id>:role/<agent-role-name>
source_profile = default                                                                   

~/.aws/credentials
[default]
aws_access_key_id=<access-key-id>
aws_secret_access_key=<secret-access-key>
aws_session_token=<session-token>

Note that session tokens have a max length of 1 hour or 3 days depending on the principal they are associated with.

Configure a launch agent

Configure the launch agent with a YAML config file named launch-config.yaml.

By default, W&B will check for the config file in ~/.config/wandb/launch-config.yaml. You can optionally specify a different directory when you activate the launch agent with the -c flag.

The following YAML snippet demonstrates how to specify the core config agent options:

launch-config.yaml
max_jobs: -1
queues:
  - <queue-name>
environment:
  type: aws
  region: <your-region>
registry:
  type: ecr
  uri: <ecr-repo-arn>
builder: 
  type: docker

Now start the agent with wandb launch-agent

(Optional) Push your launch job Docker image to Amazon ECR

info

This section applies only if your launch agent uses existing Docker images that contain your training or inference logic. There are two options on how your launch agent behaves.

Upload your Docker image that contains your launch job to your Amazon ECR repo. Your Docker image needs to be in your ECR registry before you submit new launch jobs if you are using image-based jobs.

Set up for SageMaker

Prerequisites​

Decide if you want the Launch agent to build a Docker images​

Set up AWS resources​

Create a IAM Policy for Launch agent​

Create an IAM role for Launch agent​

Configure launch queue for SageMaker​

Set up the launch agent​

Decide where to run the Launch agent​

Configure a launch agent​

(Optional) Push your launch job Docker image to Amazon ECR​