EMR runner options

All options from Options available to all runners, Hadoop-related options, and Cloud runner options are available when running jobs on Amazon Elastic MapReduce.

Amazon credentials

See Configuring AWS credentials and Configuring SSH credentials for specific instructions about setting these options.

aws_access_key_id : string

Default: None

“Username” for Amazon web services.

There isn’t a command-line switch for this option because credentials are supposed to be secret! Use the environment variable AWS_ACCESS_KEY_ID instead.

aws_secret_access_key (--aws-secret-access-key) : string

Default: None

Your “password” on AWS.

There isn’t a command-line switch for this option because credentials are supposed to be secret! Use the environment variable AWS_SECRET_ACCESS_KEY instead.

aws_session_token : string

Default: None

Temporary AWS session token, used along with aws_access_key_id and aws_secret_access_key when using temporary credentials.

There isn’t a command-line switch for this option because credentials are supposed to be secret! Use the environment variable AWS_SESSION_TOKEN instead.

ec2_key_pair (--ec2-key-pair) : string

Default: None

name of the SSH key you set up for EMR.

ec2_key_pair_file (--ec2-key-pair-file) : path

Default: None

path to file containing the SSH key for EMR

iam_instance_profile (--iam-instance-profile) : string

Default: (automatic)

Name of an IAM instance profile to use for EC2 clusters created by EMR. See http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-iam-roles.html for more details on using IAM with EMR.

iam_service_role (--iam-service-role) : string

Default: (automatic)

Name of an IAM role for the EMR service to use. See http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-iam-roles.html for more details on using IAM with EMR.

Instance configuration

On EMR, there are three ways to configure instances:

If there is a conflict, whichever comes later in the config files takes precedence, and the command line beats config files. In the case of a tie, instance_fleets beats instance_groups beats other instance options.

You may set ebs_root_volume_gb regardless of which style of instance configuration you use.

instance_fleets (--instance-fleet) :

Default: None

A list of instance fleet definitions to pass to the EMR API. Pass a JSON string on the command line or use data structures in the config file (which is itself basically JSON). For example:

runners:
  emr:
    instance_fleets:
    - InstanceFleetType: MASTER
      InstanceTypeConfigs:
      - InstanceType: m1.medium
      TargetOnDemandCapacity: 1
    - InstanceFleetType: CORE
      TargetSpotCapacity: 2
      TargetOnDemandCapacity: 2
      LaunchSpecifications:
        SpotSpecification:
          TimeoutDurationMinutes: 20
          TimeoutAction: SWITCH_TO_ON_DEMAND
      InstanceTypeConfigs:
      - InstanceType: m1.medium
        BidPriceAsPercentageOfOnDemandPrice: 50
        WeightedCapacity: 1
      - InstanceType: m1.large
        BidPriceAsPercentageOfOnDemandPrice: 50
        WeightedCapacity: 2

instance_groups (--instance-groups) :

Default: None

A list of instance group definitions to pass to the EMR API. Pass a JSON string on the command line or use data structures in the config file (which is itself basically JSON).

This allows for more fine-tuned EBS volume configuration than ebs_root_volume_gb. For example:

runners:
  emr:
    instance_groups:
    - InstanceRole: MASTER
      InstanceCount: 1
      InstanceType: m1.medium
    - InstanceRole: CORE
      InstanceCount: 10
      InstanceType: c1.xlarge
      EbsConfiguration:
        EbsOptimized: true
        EbsBlockDeviceConfigs:
        - VolumeSpecification:
            SizeInGB: 100
            VolumeType: gp2

instance_groups is incompatible with instance_fleets and other instance options. See instance_fleets for details.

core_instance_bid_price (--core-instance-bid-price) : string

Default: None

When specified and not “0”, this creates the core Hadoop nodes as spot instances at this bid price. You usually only want to set bid price for task instances.

master_instance_bid_price (--master-instance-bid-price) : string

Default: None

When specified and not “0”, this creates the master Hadoop node as a spot instance at this bid price. You usually only want to set bid price for task instances unless the master instance is your only instance.

task_instance_bid_price (--task-instance-bid-price) : string

Default: None

When specified and not “0”, this creates the master Hadoop node as a spot instance at this bid price. (You usually only want to set bid price for task instances.)

ebs_root_volume_gb (--ebs-root-volume-gb) : integer

Default: None

When specified (and not zero), sets the size of the root EBS volume, in GiB.

New in version 0.6.5.

Cluster software configuration

See also bootstrap, image_id, and image_version.

applications (--application, --applications) : string list

Default: []

Additional applications to run on 4.x AMIs (e.g. 'Ganglia', 'Mahout', 'Spark').

You do not need to specify 'Hadoop'; mrjob will always include it automatically. In most cases it’ll auto-detect when to include 'Spark' as well.

See Applications in the EMR docs for more details.

Changed in version 0.6.7: Added --applications switch

bootstrap_actions (--bootstrap-actions) : string list

Default: []

A list of raw bootstrap actions (essentially scripts) to run prior to any of the other bootstrap steps. Any arguments should be separated from the command by spaces (we use shlex.split()). If the action is on the local filesystem, we’ll automatically upload it to S3.

This has little advantage over bootstrap; it is included in order to give direct access to the EMR API.

bootstrap_spark (--bootstrap-spark, --no-bootstrap-spark) : boolean

Default: (automatic)

Install Spark on the cluster. This works on AMI version 3.x and later.

By default, we automatically install Spark only if our job has Spark steps.

In case you’re curious, here’s how mrjob determines you’re using Spark:

emr_configurations (--emr-configuration) : list of dicts

Default: []

Cluster configs for AMI version 4.x and later. For example:

runners:
  emr:
    emr_configurations:
    - Classification: core-site
      Properties:
        hadoop.security.groups.cache.secs: 250

On the command line, configurations should be JSON-encoded:

--emr-configuration '{"Classification": "core-site", ...}

See Configuring Applications in the EMR docs for more details.

Changed in version 0.6.11: !clear tag works. Later config dicts will overwrite earlier ones with the same Classification. If the later dict has empty Properties and Configurations, the earlier dict will be simply deleted.

max_concurrent_steps (--max-concurrent-steps) : string

Default: 1

How many steps may an EMR cluster run at the same time? This affects both clusters launched by our job, and, if using cluster pooling, which clusters our job will join.

Prior to AMI 5.28.0, EMR clusters could only ever run one step at a time.

New in version 0.7.4.

release_label (--release-label) : string

Default: None

EMR Release to use (e.g. emr-4.0.0). This overrides image_version.

For more information about Release Labels, see Differences Introduced in 4.x.

Monitoring your job

See also check_cluster_every, ssh_tunnel.

enable_emr_debugging (--enable-emr-debugging) : boolean

Default: False

store Hadoop logs in SimpleDB

Cluster pooling

max_clusters_in_pool (--max-clusters-in-pool) : integer

Default: 0 (disabled)

Don’t create a new pooled cluster if there are already this many active (not terminated) clusters in our pool; instead wait until one of the clusters is available to join or terminates.

To deal with the situation where several jobs start at once, before creating a cluster, we wait a random number of seconds (see pool_jitter_seconds and double-check before creating a new cluster).

New in version 0.7.4.

min_available_mb (--min-available-mb) : integer

Default: 0 (disabled)

When joining a pooled cluster, connect to its YARN resource manager’s metrics API and make sure that availableMB is at least this high.

This requires SSH to work, so ec2_key_pair and ec2_key_pair_file must be set.

If you enable this option, pooling will no longer query clusters about their instance groups/fleets, since this information is mostly redundant.

New in version 0.7.4.

min_available_virtual_cores (--min-available-virtual-cores) : integer

Default: 0 (disabled)

When joining a pooled cluster, connect to its YARN resource manager’s metrics API and make sure that availableVirtualCores is at least this high.

Like with min_available_mb, this requires SSH to work and disables querying clusters about their instances.

New in version 0.7.4.

pool_clusters (--pool-clusters) : string

Default: True

Try to run the job on a WAITING pooled cluster with the same bootstrap configuration. Prefer the one with the most compute units. If we can’t join an existing cluster, create our own (unless max_clusters_in_pool or pool_wait_minutes disallow it).

pool_jitter_seconds (--pool-jitter-seconds) : string

Default: 60

Wait a random number of seconds between 0 and this many before double-checking active clusters in the pool for max_clusters_in_pool or to bypass pool_wait_minutes.

The main point of this option is so that if several jobs start simultaneously, they can double-check if the other jobs have launched a cluster before launching one themselves. You may need wish to adjust this based on your maximum pool size and the number of jobs you expect to launch simultaneously.

New in version 0.7.4.

pool_name (--pool-name) : string

Default: 'default'

Specify a pool name to join. Does not imply pool_clusters.

pool_timeout_minutes (--pool-timeout-minutes) : string

Default: 0 (disabled)

If we can’t create or join a cluster after this many minutes, raise an exception and bail out.

New in version 0.7.4.

pool_wait_minutes (--pool-wait-minutes) : string

Default: 0

If pooling is enabled and no cluster is available, retry finding a cluster every 30 seconds until this many minutes have passed, then start a new cluster instead of joining one.

Changed in version 0.7.4: If there aren’t any active clusters with a matching pool name and hash, we may create our own cluster before pool_wait_minutes is up. We first wait a random number of seconds and double-check that other clusters have not been created (see pool_jitter_seconds).

S3 Filesystem

See also cloud_tmp_dir, cloud_part_size_mb

cloud_log_dir (--cloud-log-dir) : string

Default: append logs to cloud_tmp_dir

Where on S3 to put logs, for example s3://yourbucket/logs/. Logs for your cluster will go into a subdirectory, e.g. s3://yourbucket/logs/j-CLUSTERID/.

Docker

docker_client_config (--docker-client-config) : string

Default: None

An hdfs:// URI pointing to the client config, which is used to authenticate with a private Docker registry (e.g. ECR). This is mostly useful on AMIs prior to 6.1.0; otherwise you can use auto-authentication (see this page).

See “Using ECR” on this page for information about how to fetch working credentials. Because ECR credentials only last 12 hours, if you want to use ECR and Docker for multiple jobs on a long-running cluster, you may wish to set up a cron job at bootstrap time.

New in version 0.7.4.

docker_image (--docker-image, --no-docker) : string

Default: None

The repository, name, and optionally, tag of a docker image, in the format registry/repository:tag. If registry/ is omitted, we assume the default registry on Docker Hub (library). If registry is a hostname, we connect to that host instead (e.g. for use of ECR).

Other docker_* options will do nothing if this is not set.

Note that you must be running at least AMI 6.0.0 to use Docker on EMR.

New in version 0.7.4.

docker_mounts (--docker-mount) : string list

Default: []

Optional mounting instructions to pass to Docker, in the format /local/path:/path/inside/docker:ro_or_rw.

New in version 0.7.4.

API Endpoints

Note

You usually don’t want to set *_endpoint options unless you have a challenging network situation (e.g. you have to use a proxy to get around a firewall).

ec2_endpoint (--ec2-endpoint) : string

Default: (automatic)

New in version 0.6.5.

Force mrjob to connect to EC2 on this endpoint (e.g. ec2.us-gov-west-1.amazonaws.com).

emr_endpoint (--emr-endpoint) : string

Default: infer from region

Force mrjob to connect to EMR on this endpoint (e.g. us-west-1.elasticmapreduce.amazonaws.com).

iam_endpoint (--iam-endpoint) : string

Default: (automatic)

Force mrjob to connect to IAM on this endpoint (e.g. iam.us-gov.amazonaws.com).

s3_endpoint (--s3-endpoint) : string

Default: (automatic)

Force mrjob to connect to S3 on this endpoint, rather than letting it choose the appropriate endpoint for each S3 bucket.

Warning

If you set this to a region-specific endpoint (e.g. 's3-us-west-1.amazonaws.com') mrjob may not be able to access buckets located in other regions.

Other rarely used options

add_steps_in_batch (--add-steps-in-batch, --no-add-steps-in-batch) : boolean

Default: True for AMIs before 5.28.0, False otherwise

For a multi-step job, should we submit all steps at once, or one at a time? By default, we only submit steps all at once if the AMI doesn’t support running concurrent steps (that is, before AMI 5.28.0).

New in version 0.7.4.

additional_emr_info (--additional-emr-info) : special

Default: None

Special parameters to select additional features, mostly to support beta EMR features. Pass a JSON string on the command line or use data structures in the config file (which is itself basically JSON).

emr_action_on_failure (--emr-action-on-failure) : string

Default: (automatic)

What happens if step of your job fails

  • 'CANCEL_AND_WAIT' cancels all steps on the cluster

  • 'CONTINUE' continues to the next step (useful when submitting several

    jobs to the same cluster)

  • 'TERMINATE_CLUSTER' shuts down the cluster entirely

The default is 'CANCEL_AND_WAIT' when using pooling (see pool_clusters) or an existing cluster (see cluster_id), and 'TERMINATE_CLUSTER' otherwise.

hadoop_streaming_jar_on_emr (--hadoop-streaming-jar-on-emr) : string

Default: AWS default

ssh_add_bin (--ssh-add-bin) : command

Default: 'ssh-add'

Path to the ssh-add binary. Used on EMR to access logs on the non-master node, without copying your SSH key to the master node.

New in version 0.7.2.

ssh_bin (--ssh-bin) : command

Default: 'ssh'

Path to the ssh binary; may include switches (e.g. 'ssh -v' or ['ssh', '-v']). Defaults to ssh.

On EMR, mrjob uses SSH to tunnel to the job tracker (see ssh_tunnel), as a fallback way of fetching job progress, and as a quicker way of accessing your job’s logs.

Changed in version 0.6.8: Setting this to an empty value (--ssh-bin '') instructs mrjob to use the default value (used to effectively disable SSH).

tags (--tag) : dict

Default: {}

Metadata tags to apply to the EMR cluster after its creation. See Tagging Amazon EMR Clusters for more information on applying metadata tags to EMR clusters.

Tag names and values are strings. On the command line, to set a tag use --tag KEY=VALUE:

--tag team=development

In the config file, tags is a dict:

runners:
  emr:
    tags:
      team: development
      project: mrjob