EMR runner options

All options from Options available to all runners and Hadoop-related options are available to the emr runner.

Amazon credentials

See Configuring AWS credentials and Configuring SSH credentials for specific instructions about setting these options.

aws_access_key_id : string

Default: None

“Username” for Amazon web services.

There isn’t a command-line switch for this option because credentials are supposed to be secret! Use the environment variable AWS_ACCESS_KEY_ID instead.

aws_secret_access_key (--aws-secret-access-key) : string

Default: None

Your “password” on AWS.

There isn’t a command-line switch for this option because credentials are supposed to be secret! Use the environment variable AWS_SECRET_ACCESS_KEY instead.

aws_security_token : string

Default: None

Temporary AWS session token, used along with aws_access_key_id and aws_secret_access_key when using temporary credentials.

There isn’t a command-line switch for this option because credentials are supposed to be secret! Use the environment variable AWS_SECURITY_TOKEN instead.

ec2_key_pair (--ec2-key-pair) : string

Default: None

name of the SSH key you set up for EMR.

ec2_key_pair_file (--ec2-key-pair-file) : path

Default: None

path to file containing the SSH key for EMR

Cluster creation and configuration

additional_emr_info (--additional-emr-info) : special

Default: None

Special parameters to select additional features, mostly to support beta EMR features. Pass a JSON string on the command line or use data structures in the config file (which is itself basically JSON).

emr_api_params (--emr-api-param, --no-emr-api-param) : dict

Default: {}

Additional raw parameters to pass directly to the EMR API when creating a cluster. This allows old versions of mrjob to access new API features. See the API documentation for RunJobFlow for the full list of options.

Option names and values are strings. On the command line, to set an option use --emr-api-param KEY=VALUE:

--emr-api-param Instances.Ec2SubnetId=someID

and to suppress a value that would normally be passed to the API, use --no-emr-api-param:

--no-emr-api-param VisibleToAllUsers

In the config file, emr_api_params is a dict; params can be suppressed by setting them to null:

runners:
  emr:
    emr_api_params:
      Instances.Ec2SubnetId: someID
      VisibleToAllUsers: null

emr_applications (--emr-application) : string list

Default: []

Additional applications to run on 4.x AMIs (e.g. 'Ganglia', 'Mahout', 'Spark'). You do not need to specify 'Hadoop'; mrjob will always include it automatically.

See Applications in the EMR docs for more details.

New in version 0.5.2.

emr_configurations (--emr-configuration) : list of dicts

Default: []

Configurations for 4.x AMIs. For example:

runners:
  emr:
    emr_configurations:
    - Classification: core-site
      Properties:
        hadoop.security.groups.cache.secs: 250

On the command line, configurations should be JSON-encoded:

--emr-configuration '{"Classification": "core-site", ...}

See Configuring Applications in the EMR docs for more details.

New in version 0.5.3.

emr_endpoint (--emr-endpoint) : string

Default: infer from region

Force mrjob to connect to EMR on this endpoint (e.g. us-west-1.elasticmapreduce.amazonaws.com).

Mostly exists as a workaround for network issues.

hadoop_streaming_jar_on_emr (--hadoop-streaming-jar-on-emr) : string

Default: AWS default

Deprecated since version 0.5.4: Prepend file:// and pass that to hadoop_streaming_jar instead.

iam_endpoint (--iam-endpoint) : string

Default: (automatic)

Force mrjob to connect to IAM on this endpoint (e.g. iam.us-gov.amazonaws.com).

Mostly exists as a workaround for network issues.

iam_instance_profile (--iam-instance-profile) : string

Default: (automatic)

Name of an IAM instance profile to use for EC2 clusters created by EMR. See http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-iam-roles.html for more details on using IAM with EMR.

iam_service_role (--iam-service-role) : string

Default: (automatic)

Name of an IAM role for the EMR service to use. See http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-iam-roles.html for more details on using IAM with EMR.

image_version (--image-version) : string

Default: '4.8.2'

EMR AMI (Amazon Machine Image) version to use. This controls which Hadoop version(s) are available and which version of Python is installed, among other things; see the AWS docs on specifying the AMI version. for details.

This works for 4.x AMIs as well; mrjob will just prepend emr- and use that as the release_label.

Changed in version 0.5.4: This used to be called ami_version.

Changed in version 0.5.7: Default used to be '3.11.0'.

Warning

The deprecated ami_version alias for this option is completely ignored by mrjob 0.5.4 (it works in later 0.5.x versions).

Warning

The 2.x series of AMIs is deprecated by Amazon and not recommended.

Warning

The 1.x series of AMIs is no longer supported because they use Python 2.5.

max_hours_idle (--max-hours-idle) : string

Default: None

If we create a persistent cluster, have it automatically terminate itself after it’s been idle this many hours AND we’re within mins_to_end_of_hour of an EC2 billing hour.

mins_to_end_of_hour (--mins-to-end-of-hour) : string

Default: 5.0

If max_hours_idle is set, controls how close to the end of an EC2 billing hour the cluster can automatically terminate itself.

region (--region) : string

Default: 'us-west-2'

region to run EMR jobs on (e.g. us-west-1). Also used by mrjob to create temporary buckets if you don’t set cloud_tmp_dir explicitly.

Changed in version 0.5.4: This option used to be named aws_region.

release_label (--release-label) : string

Default: None

EMR Release to use (e.g. emr-4.0.0). This overrides image_version.

For more information about Release Labels, see Differences Introduced in 4.x.

New in version 0.5.0.

subnet (--subnet) : string

Default: None

ID of Amazon VPC subnet to launch cluster in (e.g. 'subnet-12345678'). If this is not set, or an empty string, cluster will be launched in the normal AWS cloud.

New in version 0.5.3.

tags (--tag) : dict

Default: {}

Metadata tags to apply to the EMR cluster after its creation. See Tagging Amazon EMR Clusters for more information on applying metadata tags to EMR clusters.

Tag names and values are strings. On the command line, to set a tag use --tag KEY=VALUE:

--tag team=development

In the config file, tags is a dict:

runners:
  emr:
    tags:
      team: development
      project: mrjob

Changed in version 0.5.4: This option used to be named emr_tags

visible_to_all_users (--visible-to-all-users, --no-visible-to-all-users) : boolean

Default: True

If true (the default) EMR clusters will be visible to all IAM users. Otherwise, the cluster will only be visible to the IAM user that created it.

Warning

You should almost certainly not set this to False if you are Pooling Clusters with other users; other users will not be able to reuse your clusters, and mrjob terminate-idle-clusters won’t be able to shut them down when they become idle.

zone (zone) : string

Default: AWS default

Availability zone to run the job in

Changed in version 0.5.4: This option used to be named aws_availability_zone

Bootstrapping

These options apply at bootstrap time, before the Hadoop cluster has started. Bootstrap time is a good time to install Debian packages or compile and install another Python binary.

bootstrap (--bootstrap) : string list

Default: []

A list of lines of shell script to run once on each node in your cluster, at bootstrap time.

This option is complex and powerful; the best way to get started is to read the EMR Bootstrapping Cookbook.

Passing expressions like path#name will cause path to be automatically uploaded to the task’s working directory with the filename name, marked as executable, and interpolated into the script by their absolute path on the machine running the script.

path may also be a URI, and ~ and environment variables within path will be resolved based on the local environment. name is optional. For details of parsing, see parse_setup_cmd().

Unlike with setup, archives are not supported (unpack them yourself).

Remember to put sudo before commands requiring root privileges!

bootstrap_actions (--bootstrap-actions) : string list

Default: []

A list of raw bootstrap actions (essentially scripts) to run prior to any of the other bootstrap steps. Any arguments should be separated from the command by spaces (we use shlex.split()). If the action is on the local filesystem, we’ll automatically upload it to S3.

This has little advantage over bootstrap; it is included in order to give direct access to the EMR API.

bootstrap_python (--bootstrap-python, --no-bootstrap-python) : boolean

Default: (automatic)

Attempt to install a compatible (major) version of Python at bootstrap time, including header files and pip (see Installing Python packages with pip).

In Python 2, this never does anything.

In Python 3, this runs sudo yum install -y python34 python34-devel python34-pip by default on AMIs prior to 4.6.0 (starting with AMI 4.6.0, Python 3 is pre-installed).

New in version 0.5.0.

Changed in version 0.5.4: no longer installs Python 3 on AMI version 4.6.0+ by default

bootstrap_spark (--bootstrap-spark, --no-bootstrap-spark) : boolean

Default: (automatic)

Install Spark on the cluster. This works on AMI version 3.x and later.

By default, we automatically install Spark only if our job has Spark steps.

New in version 0.5.7.

In case you’re curious, here’s how mrjob determines you’re using Spark:

Monitoring the cluster

check_cluster_every (--check-cluster-every) : string

Default: 30

How often to check on the status of EMR jobs in seconds. If you set this too low, AWS will throttle you.

Changed in version 0.5.4: This used to be called check_emr_status_every

enable_emr_debugging (--enable-emr-debugging) : boolean

Default: False

store Hadoop logs in SimpleDB

Number and type of instances

instance_type (--instance-type) : string

Default: (automatic)

By default, mrjob picks the cheapest instance type that will work at all. This is usually m1.medium, with two exceptions:

  • m1.large if you’re running Spark (see bootstrap_spark)
  • m1.small if you’re running on the (deprecated) 2.x AMIs

Once you’ve tested a job and want to run it at scale, it’s usually a good idea to use instances larger than the default; see http://aws.amazon.com/ec2/instance-types/ for options.

If you’re running multiple nodes (see num_core_instances), this option doesn’t apply to the master node because it’s just coordinating tasks, not running them. Use master_instance_type instead.

Changed in version 0.5.6: Used to default to m1.medium in all cases.

Changed in version 0.5.4: This option used to be named ec2_instance_type.

core_instance_type (--core-instance-type) : string

Default: value of instance_type

like instance_type, but only for the core Hadoop nodes; these nodes run tasks and host HDFS. Usually you just want to use instance_type.

Changed in version 0.5.4: This replaces the ec2_core_instance_type and ec2_slave_instance_type options.

core_instance_bid_price (--core-instance-bid-price) : string

Default: None

When specified and not “0”, this creates the core Hadoop nodes as spot instances at this bid price. You usually only want to set bid price for task instances.

Changed in version 0.5.4: This option used to be named ec2_core_instance_bid_price.

master_instance_type (--master-instance-type) : string

Default: (automatic)

like instance_type, but only for the master Hadoop node. This node hosts the task tracker/resource manager and HDFS, and runs tasks if there are no other nodes.

If you’re running a single node (no num_core_instances or num_task_instances), this will default to the value of instance_type.

Otherwise, defaults to m1.medium (exception: m1.small on the deprecated 2.x AMIs), which is usually adequate for all but the largest jobs.

Changed in version 0.5.4: This option used to be named ec2_master_instance_type.

master_instance_bid_price (--master-instance-bid-price) : string

Default: None

When specified and not “0”, this creates the master Hadoop node as a spot instance at this bid price. You usually only want to set bid price for task instances unless the master instance is your only instance.

Changed in version 0.5.4: This option used to be named ec2_master_instance_bid_price.

task_instance_type (--task-instance-type) : string

Default: value of core_instance_type

like instance_type, but only for the task Hadoop nodes; these nodes run tasks but do not host HDFS. Usually you just want to use instance_type.

Changed in version 0.5.4: This option used to be named ec2_task_instance_type.

task_instance_bid_price (--task-instance-bid-price) : string

Default: None

When specified and not “0”, this creates the master Hadoop node as a spot instance at this bid price. (You usually only want to set bid price for task instances.)

Changed in version 0.5.4: This option used to be named ec2_task_instance_bid_price.

num_core_instances (--num-core-instances) : string

Default: 0

Number of core instances to start up. These run your job and host HDFS. This is in addition to the single master instance.

Changed in version 0.5.4: This option used to be named num_ec2_core_instances.

num_task_instances (--num-task-instances) : string

Default: 0

Number of task instances to start up. These run your job but do not host HDFS. If you use this, you must set num_core_instances; EMR does not allow you to run task instances without core instances (because there’s nowhere to host HDFS).

Changed in version 0.5.4: This used to be called num_ec2_task_instances.

Choosing/creating a cluster to join

cluster_id (--cluster-id) : string

Default: automatically create a cluster and use it

The ID of a persistent EMR cluster to run jobs in. It’s fine for other jobs to be using the cluster; we give our job’s steps a unique ID.

emr_action_on_failure (--emr-action-on-failure) : string

Default: (automatic)

What happens if step of your job fails

  • 'CANCEL_AND_WAIT' cancels all steps on the cluster

  • 'CONTINUE' continues to the next step (useful when submitting several

    jobs to the same cluster)

  • 'TERMINATE_CLUSTER' shuts down the cluster entirely

The default is 'CANCEL_AND_WAIT' when using pooling (see pool_clusters) or an existing cluster (see cluster_id), and 'TERMINATE_CLUSTER' otherwise.

pool_name (--pool-name) : string

Default: 'default'

Specify a pool name to join. Does not imply pool_clusters.

pool_clusters (--pool-clusters) : string

Default: False

Try to run the job on a WAITING pooled cluster with the same bootstrap configuration. Prefer the one with the most compute units. Use S3 to “lock” the cluster and ensure that the job is not scheduled behind another job. If no suitable cluster is WAITING, create a new pooled cluster.

Warning

Do not run this without either setting max_hours_idle (recommended) or putting mrjob terminate-idle-clusters in your crontab; clusters left idle can quickly become expensive!

Changed in version 0.5.4: Pooling now gracefully recovers from joining a cluster that was in the process of shutting down (see max_hours_idle).

pool_wait_minutes (--pool-wait-minutes) : string

Default: 0

If pooling is enabled and no cluster is available, retry finding a cluster every 30 seconds until this many minutes have passed, then start a new cluster instead of joining one.

S3 paths and options

MRJob uses boto to manipulate/access S3.

cloud_log_dir (--cloud-log-dir) : string

Default: append logs to cloud_tmp_dir

Where on S3 to put logs, for example s3://yourbucket/logs/. Logs for your cluster will go into a subdirectory, e.g. s3://yourbucket/logs/j-CLUSTERID/.

Changed in version 0.5.4: This option used to be named s3_log_uri

cloud_tmp_dir (--cloud-tmp-dir) : string

Default: (automatic)

S3 directory (URI ending in /) to use as temp space, e.g. s3://yourbucket/tmp/.

By default, mrjob looks for a bucket belong to you whose name starts with mrjob- and which matches region. If it can’t find one, it creates one with a random name. This option is then set to tmp/ in this bucket (e.g. s3://mrjob-01234567890abcdef/tmp/).

Changed in version 0.5.4: This used to be called s3_tmp_dir.

Changed in version 0.5.0: This used to be called s3_scratch_uri.

cloud_fs_sync_secs (--cloud_fs_sync_secs) : string

Default: 5.0

How long to wait for S3 to reach eventual consistency. This is typically less than a second (zero in U.S. West), but the default is 5.0 to be safe.

Changed in version 0.5.4: This used to be called s3_sync_wait_time

cloud_upload_part_size (--cloud-upload-part-size) : integer

Default: 100

Upload files to S3 in parts no bigger than this many megabytes (technically, mebibytes). Default is 100 MiB, as recommended by Amazon. Set to 0 to disable multipart uploading entirely.

Currently, Amazon requires parts to be between 5 MiB and 5 GiB. mrjob does not enforce these limits.

Changed in version 0.5.4: This used to be called s3_upload_part_size.

s3_endpoint (--s3-endpoint) : string

Default: (automatic)

Force mrjob to connect to S3 on this endpoint, rather than letting it choose the appropriate endpoint for each S3 bucket.

Mostly exists as a workaround for network issues.

Warning

If you set this to a region-specific endpoint (e.g. 's3-us-west-1.amazonaws.com') mrjob will not be able to access buckets located in other regions.

SSH access and tunneling

ssh_bin (--ssh-bin) : command

Default: 'ssh'

Path to the ssh binary; may include switches (e.g. 'ssh -v' or ['ssh', '-v']). Defaults to ssh

ssh_bind_ports (--ssh-bind-ports) : special

Default: range(40001, 40841)

A list of ports that are safe to listen on. The command line syntax looks like 2000[:2001][,2003,2005:2008,etc], where commas separate ranges and colons separate range endpoints.

ssh_tunnel (--ssh-tunnel, --no-ssh-tunnel) : boolean

Default: False

If True, create an ssh tunnel to the job tracker/resource manager and listen on a randomly chosen port. This requires you to set ec2_key_pair and ec2_key_pair_file. See Configuring SSH credentials for detailed instructions.

Changed in version 0.5.0: This option used to be named ssh_tunnel_to_job_tracker.

ssh_tunnel_is_open (--ssh-tunnel-is-open) : boolean

Default: False

if True, any host can connect to the job tracker through the SSH tunnel you open. Mostly useful if your browser is running on a different machine from your job runner.