mrjob.emr - run on EMR¶
Job Runner¶
-
class
mrjob.emr.
EMRJobRunner
(**kwargs)¶ Runs an
MRJob
on Amazon Elastic MapReduce. Invoked when you run your job with-r emr
.EMRJobRunner
runs your job in an EMR cluster, which is basically a temporary Hadoop cluster. Normally, it creates a cluster just for your job; it’s also possible to run your job in a specific cluster by setting cluster_id or to automatically choose a waiting cluster, creating one if none exists, by setting pool_clusters.Input, support, and jar files can be either local or on S3; use
s3://...
URLs to refer to files on S3.This class has some useful utilities for talking directly to S3 and EMR, so you may find it useful to instantiate it without a script:
from mrjob.emr import EMRJobRunner emr_client = EMRJobRunner().make_emr_client() clusters = emr_client.list_clusters() ...
EMR Utilities¶
-
EMRJobRunner.
get_cluster_id
()¶ Get the ID of the cluster our job is running on, or
None
.
-
EMRJobRunner.
get_image_version
()¶ Get the version of the AMI that our cluster is running, or
None
.
-
EMRJobRunner.
get_job_steps
()¶ Fetch the steps submitted by this runner from the EMR API.
Deprecated since version 0.7.4.
New in version 0.6.1.
-
EMRJobRunner.
make_emr_client
()¶ Create a
boto3
EMR client.Returns: a botocore.client.EMR
wrapped in amrjob.retry.RetryWrapper
S3 Utilities¶
-
class
mrjob.fs.s3.
S3Filesystem
(aws_access_key_id=None, aws_secret_access_key=None, aws_session_token=None, s3_endpoint=None, s3_region=None, part_size=None)¶ Filesystem for Amazon S3 URIs. Typically you will get one of these via
EMRJobRunner().fs
, composed withSSHFilesystem
andLocalFilesystem
.Parameters: - aws_access_key_id – Your AWS access key ID
- aws_secret_access_key – Your AWS secret access key
- aws_session_token – session token for use with temporary AWS credentials
- s3_endpoint – If set, always use this endpoint
- s3_region – Default region for connections to the S3 API and newly created buckets.
- part_size – Part size for multi-part uploading, in bytes, or
None
Changed in version 0.6.8: added part_size
-
S3Filesystem.
create_bucket
(bucket_name, region=None)¶ Create a bucket on S3 with a location constraint matching the given region.
-
S3Filesystem.
get_all_bucket_names
()¶ Get a list of the names of all buckets owned by this user on S3.
-
S3Filesystem.
get_bucket
(bucket_name)¶ Get the (
boto3
) bucket, connecting through the appropriate endpoint.
-
S3Filesystem.
make_s3_client
(region_name=None)¶ Create a
boto3
S3 client, wrapped in amrjob.retry.RetryWrapper
Parameters: region – region to use to choose S3 endpoint.
-
S3Filesystem.
make_s3_resource
(region_name=None)¶ Create a
boto3
S3 resource, with its client wrapped in amrjob.retry.RetryWrapper
Parameters: region – region to use to choose S3 endpoint It’s best to use
get_bucket()
because it chooses the appropriate S3 endpoint automatically. If you are trying to get bucket metadata, usemake_s3_client()
.
Other AWS clients¶
-
EMRJobRunner.
make_ec2_client
()¶ Create a
boto3
EC2 client.Returns: a botocore.client.EC2
wrapped in amrjob.retry.RetryWrapper
-
EMRJobRunner.
make_iam_client
()¶ Create a
boto3
IAM client.Returns: a botocore.client.IAM
wrapped in amrjob.retry.RetryWrapper