mrjob.emr - run on EMR

Job Runner

class mrjob.emr.EMRJobRunner(**kwargs)

Runs an MRJob on Amazon Elastic MapReduce. Invoked when you run your job with -r emr.

EMRJobRunner runs your job in an EMR cluster, which is basically a temporary Hadoop cluster. Normally, it creates a cluster just for your job; it’s also possible to run your job in a specific cluster by setting cluster_id or to automatically choose a waiting cluster, creating one if none exists, by setting pool_clusters.

Input, support, and jar files can be either local or on S3; use s3://... URLs to refer to files on S3.

This class has some useful utilities for talking directly to S3 and EMR, so you may find it useful to instantiate it without a script:

from mrjob.emr import EMRJobRunner

emr_client = EMRJobRunner().make_emr_client()
clusters = emr_client.list_clusters()

EMR Utilities


Get the ID of the cluster our job is running on, or None.


Get the version of the AMI that our cluster is running, or None.


Fetch the steps submitted by this runner from the EMR API.

Deprecated since version 0.7.4.

New in version 0.6.1.


Create a boto3 EMR client.

Returns:a botocore.client.EMR wrapped in a mrjob.retry.RetryWrapper

S3 Utilities

class mrjob.fs.s3.S3Filesystem(aws_access_key_id=None, aws_secret_access_key=None, aws_session_token=None, s3_endpoint=None, s3_region=None, part_size=None)

Filesystem for Amazon S3 URIs. Typically you will get one of these via EMRJobRunner().fs, composed with SSHFilesystem and LocalFilesystem.

  • aws_access_key_id – Your AWS access key ID
  • aws_secret_access_key – Your AWS secret access key
  • aws_session_token – session token for use with temporary AWS credentials
  • s3_endpoint – If set, always use this endpoint
  • s3_region – Default region for connections to the S3 API and newly created buckets.
  • part_size – Part size for multi-part uploading, in bytes, or None

Changed in version 0.6.8: added part_size

S3Filesystem.create_bucket(bucket_name, region=None)

Create a bucket on S3 with a location constraint matching the given region.


Get a list of the names of all buckets owned by this user on S3.


Get the (boto3) bucket, connecting through the appropriate endpoint.


Create a boto3 S3 client, wrapped in a mrjob.retry.RetryWrapper

Parameters:region – region to use to choose S3 endpoint.

Create a boto3 S3 resource, with its client wrapped in a mrjob.retry.RetryWrapper

Parameters:region – region to use to choose S3 endpoint

It’s best to use get_bucket() because it chooses the appropriate S3 endpoint automatically. If you are trying to get bucket metadata, use make_s3_client().

Other AWS clients


Create a boto3 EC2 client.

Returns:a botocore.client.EC2 wrapped in a mrjob.retry.RetryWrapper

Create a boto3 IAM client.

Returns:a botocore.client.IAM wrapped in a mrjob.retry.RetryWrapper