mrjob.emr - run on EMR

Job Runner

class mrjob.emr.EMRJobRunner(**kwargs)

Runs an MRJob on Amazon Elastic MapReduce. Invoked when you run your job with -r emr.

EMRJobRunner runs your job in an EMR cluster, which is basically a temporary Hadoop cluster. Normally, it creates a cluster just for your job; it’s also possible to run your job in a specific cluster by setting cluster_id or to automatically choose a waiting cluster, creating one if none exists, by setting pool_clusters.

Input, support, and jar files can be either local or on S3; use s3://... URLs to refer to files on S3.

This class has some useful utilities for talking directly to S3 and EMR, so you may find it useful to instantiate it without a script:

from mrjob.emr import EMRJobRunner

emr_conn = EMRJobRunner().make_emr_conn()
clusters = emr_conn.list_clusters()
...

EMR Utilities

EMRJobRunner.get_cluster_id()
EMRJobRunner.get_image_version()

Get the AMI that our cluster is running.

Changed in version 0.5.4: This used to be called get_ami_version()

EMRJobRunner.make_emr_conn()

Create a connection to EMR.

Returns:a boto.emr.connection.EmrConnection, wrapped in a mrjob.retry.RetryWrapper

S3 Utilities

mrjob.fs.s3.s3_key_to_uri(s3_key)

Convert a boto Key object into an s3:// URI

class mrjob.emr.S3Filesystem(aws_access_key_id=None, aws_secret_access_key=None, aws_security_token=None, s3_endpoint=None)

Filesystem for Amazon S3 URIs. Typically you will get one of these via EMRJobRunner().fs, composed with SSHFilesystem and LocalFilesystem.

S3Filesystem.make_s3_conn(region='')

Create a connection to S3.

Parameters:region – region to use to choose S3 endpoint.

If you are doing anything with buckets other than creating them or fetching basic metadata (name and location), it’s best to use get_bucket() because it chooses the appropriate S3 endpoint automatically.

Returns:a boto.s3.connection.S3Connection, wrapped in a mrjob.retry.RetryWrapper
S3Filesystem.get_s3_key(uri)

Get the boto Key object matching the given S3 uri, or return None if that key doesn’t exist.

uri is an S3 URI: s3://foo/bar

S3Filesystem.get_s3_keys(uri)

Get a stream of boto Key objects for each key inside the given dir on S3.

uri is an S3 URI: s3://foo/bar

S3Filesystem.make_s3_key(uri)

Create the given S3 key, and return the corresponding boto Key object.

uri is an S3 URI: s3://foo/bar