mrjob.emr - run on EMR¶

Job Runner¶

class mrjob.emr.EMRJobRunner(**kwargs)¶

Runs an MRJob on Amazon Elastic MapReduce. Invoked when you run your job with -r emr.

EMRJobRunner runs your job in an EMR cluster, which is basically a temporary Hadoop cluster. Normally, it creates a cluster just for your job; it’s also possible to run your job in a specific cluster by setting cluster_id or to automatically choose a waiting cluster, creating one if none exists, by setting pool_clusters.

Input, support, and jar files can be either local or on S3; use s3://... URLs to refer to files on S3.

This class has some useful utilities for talking directly to S3 and EMR, so you may find it useful to instantiate it without a script:

from mrjob.emr import EMRJobRunner

emr_client = EMRJobRunner().make_emr_client()
clusters = emr_client.list_clusters()
...

EMR Utilities¶

EMRJobRunner.get_cluster_id()¶: Get the ID of the cluster our job is running on, or None.

EMRJobRunner.get_image_version()¶: Get the version of the AMI that our cluster is running, or None.

EMRJobRunner.get_job_steps()¶: Fetch the steps submitted by this runner from the EMR API.

Deprecated since version 0.7.4.

New in version 0.6.1.

EMRJobRunner.make_emr_client()¶

Create a boto3 EMR client.

Returns:	a `botocore.client.EMR` wrapped in a `mrjob.retry.RetryWrapper`

S3 Utilities¶

class mrjob.fs.s3.S3Filesystem(aws_access_key_id=None, aws_secret_access_key=None, aws_session_token=None, s3_endpoint=None, s3_region=None, part_size=None)¶

Filesystem for Amazon S3 URIs. Typically you will get one of these via EMRJobRunner().fs, composed with SSHFilesystem and LocalFilesystem.

Parameters:

aws_access_key_id – Your AWS access key ID
aws_secret_access_key – Your AWS secret access key
aws_session_token – session token for use with temporary AWS credentials
s3_endpoint – If set, always use this endpoint
s3_region – Default region for connections to the S3 API and newly created buckets.
part_size – Part size for multi-part uploading, in bytes, or None

Changed in version 0.6.8: added part_size

S3Filesystem.create_bucket(bucket_name, region=None)¶: Create a bucket on S3 with a location constraint matching the given region.

S3Filesystem.get_all_bucket_names()¶: Get a list of the names of all buckets owned by this user on S3.

S3Filesystem.get_bucket(bucket_name)¶: Get the (boto3) bucket, connecting through the appropriate endpoint.

S3Filesystem.make_s3_client(region_name=None)¶

Create a boto3 S3 client, wrapped in a mrjob.retry.RetryWrapper

Parameters:	region – region to use to choose S3 endpoint.

S3Filesystem.make_s3_resource(region_name=None)¶

Create a boto3 S3 resource, with its client wrapped in a mrjob.retry.RetryWrapper

Parameters:	region – region to use to choose S3 endpoint

It’s best to use get_bucket() because it chooses the appropriate S3 endpoint automatically. If you are trying to get bucket metadata, use make_s3_client().

Other AWS clients¶

EMRJobRunner.make_ec2_client()¶

Create a boto3 EC2 client.

Returns:	a `botocore.client.EC2` wrapped in a `mrjob.retry.RetryWrapper`

EMRJobRunner.make_iam_client()¶

Create a boto3 IAM client.

Returns:	a `botocore.client.IAM` wrapped in a `mrjob.retry.RetryWrapper`

mrjob v0.7.4 documentation

mrjob.emr - run on EMR¶

Job Runner¶

EMR Utilities¶

S3 Utilities¶

Other AWS clients¶

Table Of Contents

Need help?

This Page