mrjob.emr - run on EMR

Job Runner

class mrjob.emr.EMRJobRunner(**kwargs)

Runs an MRJob on Amazon Elastic MapReduce. Invoked when you run your job with -r emr.

EMRJobRunner runs your job in an EMR cluster, which is basically a temporary Hadoop cluster. Normally, it creates a cluster just for your job; it’s also possible to run your job in a specific cluster by setting cluster_id or to automatically choose a waiting cluster, creating one if none exists, by setting pool_clusters.

Input, support, and jar files can be either local or on S3; use s3://... URLs to refer to files on S3.

This class has some useful utilities for talking directly to S3 and EMR, so you may find it useful to instantiate it without a script:

from mrjob.emr import EMRJobRunner

emr_client = EMRJobRunner().make_emr_client()
clusters = emr_client.list_clusters()

EMR Utilities


Get the ID of the cluster our job is running on, or None.


Get the version of the AMI that our cluster is running, or None.

Changed in version 0.5.4: This used to be called get_ami_version()


Fetch the steps submitted by this runner from the EMR API.

New in version 0.6.1.


Create a boto3 EMR client.

Returns:a botocore.client.EMR wrapped in a mrjob.retry.RetryWrapper

S3 Utilities

class mrjob.fs.s3.S3Filesystem(aws_access_key_id=None, aws_secret_access_key=None, aws_session_token=None, s3_endpoint=None, s3_region=None)

Filesystem for Amazon S3 URIs. Typically you will get one of these via EMRJobRunner().fs, composed with SSHFilesystem and LocalFilesystem.

S3Filesystem.create_bucket(bucket_name, region=None)

Create a bucket on S3 with a location constraint matching the given region.

Changed in version 0.6.0: The region argument used to be called location.


Get a list of the names of all buckets owned by this user on S3.

New in version 0.6.0.


Get the (boto3) bucket, connecting through the appropriate endpoint.


Create a boto3 S3 client, wrapped in a mrjob.retry.RetryWrapper

Parameters:region – region to use to choose S3 endpoint.

New in version 0.6.0.


Create a boto3 S3 resource, with its client wrapped in a mrjob.retry.RetryWrapper

Parameters:region – region to use to choose S3 endpoint

It’s best to use get_bucket() because it chooses the appropriate S3 endpoint automatically. If you are trying to get bucket metadata, use make_s3_client().

New in version 0.6.0.