mrjob.emr - run on EMR

Job Runner

class mrjob.emr.EMRJobRunner(**kwargs)

Runs an MRJob on Amazon Elastic MapReduce. Invoked when you run your job with -r emr.

EMRJobRunner runs your job in an EMR cluster, which is basically a temporary Hadoop cluster. Normally, it creates a cluster just for your job; it’s also possible to run your job in a specific cluster by setting cluster_id or to automatically choose a waiting cluster, creating one if none exists, by setting pool_clusters.

Input, support, and jar files can be either local or on S3; use s3://... URLs to refer to files on S3.

This class has some useful utilities for talking directly to S3 and EMR, so you may find it useful to instantiate it without a script:

from mrjob.emr import EMRJobRunner

emr_conn = EMRJobRunner().make_emr_conn()
clusters = emr_conn.list_clusters()
...

EMR Utilities

EMRJobRunner.get_cluster_id()

Get the ID of the cluster our job is running on, or None.

EMRJobRunner.get_image_version()

Get the version of the AMI that our cluster is running, or None.

Changed in version 0.5.4: This used to be called get_ami_version()

EMRJobRunner.make_emr_conn()

Create a connection to EMR.

Returns:a boto.emr.connection.EmrConnection, wrapped in a mrjob.retry.RetryWrapper

S3 Utilities

class mrjob.emr.S3Filesystem(aws_access_key_id=None, aws_secret_access_key=None, aws_session_token=None, s3_endpoint=None, s3_region=None)

Filesystem for Amazon S3 URIs. Typically you will get one of these via EMRJobRunner().fs, composed with SSHFilesystem and LocalFilesystem.

S3Filesystem.create_bucket(bucket_name, region=None)

Create a bucket on S3 with a location constraint matching the given region.

Changed in version 0.6.0: The region argument used to be called location.

S3Filesystem.get_all_bucket_names()

Get a stream of the names of all buckets owned by this user on S3.

New in version 0.6.0.

S3Filesystem.get_bucket(bucket_name)

Get the bucket, connecting through the appropriate endpoint.

S3Filesystem.make_s3_client(region_name=None)

Create a boto3 S3 client, wrapped in a mrjob.retry.RetryWrapper

Parameters:region – region to use to choose S3 endpoint.

New in version 0.6.0.

S3Filesystem.make_s3_resource(region_name=None)

Create a boto3 S3 resource, with its client wrapped in a mrjob.retry.RetryWrapper

Parameters:region – region to use to choose S3 endpoint

It’s best to use get_bucket() because it chooses the appropriate S3 endpoint automatically. If you are trying to get bucket metadata, use make_s3_client().

New in version 0.6.0.