mrjob.emr - run on EMR

Job Runner

class mrjob.emr.EMRJobRunner(**kwargs)

Runs an MRJob on Amazon Elastic MapReduce. Invoked when you run your job with -r emr.

EMRJobRunner runs your job in an EMR cluster, which is basically a temporary Hadoop cluster. Normally, it creates a cluster just for your job; it’s also possible to run your job in a specific cluster by setting cluster_id or to automatically choose a waiting cluster, creating one if none exists, by setting pool_clusters.

Input, support, and jar files can be either local or on S3; use s3://... URLs to refer to files on S3.

This class has some useful utilities for talking directly to S3 and EMR, so you may find it useful to instantiate it without a script:

from mrjob.emr import EMRJobRunner

emr_client = EMRJobRunner().make_emr_client()
clusters = emr_client.list_clusters()
...

EMR Utilities

EMRJobRunner.get_cluster_id()

Get the ID of the cluster our job is running on, or None.

EMRJobRunner.get_image_version()

Get the version of the AMI that our cluster is running, or None.

EMRJobRunner.get_job_steps()

Fetch the steps submitted by this runner from the EMR API.

Deprecated since version 0.7.4.

New in version 0.6.1.

EMRJobRunner.make_emr_client()

Create a boto3 EMR client.

Returns:a botocore.client.EMR wrapped in a mrjob.retry.RetryWrapper

S3 Utilities

class mrjob.fs.s3.S3Filesystem(aws_access_key_id=None, aws_secret_access_key=None, aws_session_token=None, s3_endpoint=None, s3_region=None, part_size=None)

Filesystem for Amazon S3 URIs. Typically you will get one of these via EMRJobRunner().fs, composed with SSHFilesystem and LocalFilesystem.

Parameters:
  • aws_access_key_id – Your AWS access key ID
  • aws_secret_access_key – Your AWS secret access key
  • aws_session_token – session token for use with temporary AWS credentials
  • s3_endpoint – If set, always use this endpoint
  • s3_region – Default region for connections to the S3 API and newly created buckets.
  • part_size – Part size for multi-part uploading, in bytes, or None

Changed in version 0.6.8: added part_size

S3Filesystem.create_bucket(bucket_name, region=None)

Create a bucket on S3 with a location constraint matching the given region.

S3Filesystem.get_all_bucket_names()

Get a list of the names of all buckets owned by this user on S3.

S3Filesystem.get_bucket(bucket_name)

Get the (boto3) bucket, connecting through the appropriate endpoint.

S3Filesystem.make_s3_client(region_name=None)

Create a boto3 S3 client, wrapped in a mrjob.retry.RetryWrapper

Parameters:region – region to use to choose S3 endpoint.
S3Filesystem.make_s3_resource(region_name=None)

Create a boto3 S3 resource, with its client wrapped in a mrjob.retry.RetryWrapper

Parameters:region – region to use to choose S3 endpoint

It’s best to use get_bucket() because it chooses the appropriate S3 endpoint automatically. If you are trying to get bucket metadata, use make_s3_client().

Other AWS clients

EMRJobRunner.make_ec2_client()

Create a boto3 EC2 client.

Returns:a botocore.client.EC2 wrapped in a mrjob.retry.RetryWrapper
EMRJobRunner.make_iam_client()

Create a boto3 IAM client.

Returns:a botocore.client.IAM wrapped in a mrjob.retry.RetryWrapper