mrjob.emr - run on EMR¶
Job Runner¶
-
class
mrjob.emr.EMRJobRunner(**kwargs)¶ Runs an
MRJobon Amazon Elastic MapReduce. Invoked when you run your job with-r emr.EMRJobRunnerruns your job in an EMR cluster, which is basically a temporary Hadoop cluster. Normally, it creates a cluster just for your job; it’s also possible to run your job in a specific cluster by setting cluster_id or to automatically choose a waiting cluster, creating one if none exists, by setting pool_clusters.Input, support, and jar files can be either local or on S3; use
s3://...URLs to refer to files on S3.This class has some useful utilities for talking directly to S3 and EMR, so you may find it useful to instantiate it without a script:
from mrjob.emr import EMRJobRunner emr_client = EMRJobRunner().make_emr_client() clusters = emr_client.list_clusters() ...
EMR Utilities¶
-
EMRJobRunner.get_cluster_id()¶ Get the ID of the cluster our job is running on, or
None.
-
EMRJobRunner.get_image_version()¶ Get the version of the AMI that our cluster is running, or
None.
-
EMRJobRunner.get_job_steps()¶ Fetch the steps submitted by this runner from the EMR API.
Deprecated since version 0.7.4.
New in version 0.6.1.
-
EMRJobRunner.make_emr_client()¶ Create a
boto3EMR client.Returns: a botocore.client.EMRwrapped in amrjob.retry.RetryWrapper
S3 Utilities¶
-
class
mrjob.fs.s3.S3Filesystem(aws_access_key_id=None, aws_secret_access_key=None, aws_session_token=None, s3_endpoint=None, s3_region=None, part_size=None)¶ Filesystem for Amazon S3 URIs. Typically you will get one of these via
EMRJobRunner().fs, composed withSSHFilesystemandLocalFilesystem.Parameters: - aws_access_key_id – Your AWS access key ID
- aws_secret_access_key – Your AWS secret access key
- aws_session_token – session token for use with temporary AWS credentials
- s3_endpoint – If set, always use this endpoint
- s3_region – Default region for connections to the S3 API and newly created buckets.
- part_size – Part size for multi-part uploading, in bytes, or
None
Changed in version 0.6.8: added part_size
-
S3Filesystem.create_bucket(bucket_name, region=None)¶ Create a bucket on S3 with a location constraint matching the given region.
-
S3Filesystem.get_all_bucket_names()¶ Get a list of the names of all buckets owned by this user on S3.
-
S3Filesystem.get_bucket(bucket_name)¶ Get the (
boto3) bucket, connecting through the appropriate endpoint.
-
S3Filesystem.make_s3_client(region_name=None)¶ Create a
boto3S3 client, wrapped in amrjob.retry.RetryWrapperParameters: region – region to use to choose S3 endpoint.
-
S3Filesystem.make_s3_resource(region_name=None)¶ Create a
boto3S3 resource, with its client wrapped in amrjob.retry.RetryWrapperParameters: region – region to use to choose S3 endpoint It’s best to use
get_bucket()because it chooses the appropriate S3 endpoint automatically. If you are trying to get bucket metadata, usemake_s3_client().
Other AWS clients¶
-
EMRJobRunner.make_ec2_client()¶ Create a
boto3EC2 client.Returns: a botocore.client.EC2wrapped in amrjob.retry.RetryWrapper
-
EMRJobRunner.make_iam_client()¶ Create a
boto3IAM client.Returns: a botocore.client.IAMwrapped in amrjob.retry.RetryWrapper