EMR runner options

All options from Options available to all runners, Hadoop-related options, and Cloud runner options are available when running jobs on Amazon Elastic MapReduce.

Amazon credentials

See Configuring AWS credentials and Configuring SSH credentials for specific instructions about setting these options.

aws_access_key_id : string

Default: None

“Username” for Amazon web services.

There isn’t a command-line switch for this option because credentials are supposed to be secret! Use the environment variable AWS_ACCESS_KEY_ID instead.

aws_secret_access_key (--aws-secret-access-key) : string

Default: None

Your “password” on AWS.

There isn’t a command-line switch for this option because credentials are supposed to be secret! Use the environment variable AWS_SECRET_ACCESS_KEY instead.

aws_session_token : string

Default: None

Temporary AWS session token, used along with aws_access_key_id and aws_secret_access_key when using temporary credentials.

There isn’t a command-line switch for this option because credentials are supposed to be secret! Use the environment variable AWS_SESSION_TOKEN instead.

Changed in version 0.5.10: this used to be called aws_security_token.

ec2_key_pair (--ec2-key-pair) : string

Default: None

name of the SSH key you set up for EMR.

ec2_key_pair_file (--ec2-key-pair-file) : path

Default: None

path to file containing the SSH key for EMR

iam_instance_profile (--iam-instance-profile) : string

Default: (automatic)

Name of an IAM instance profile to use for EC2 clusters created by EMR. See http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-iam-roles.html for more details on using IAM with EMR.

iam_service_role (--iam-service-role) : string

Default: (automatic)

Name of an IAM role for the EMR service to use. See http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-iam-roles.html for more details on using IAM with EMR.

Instance configuration

On EMR, there are three ways to configure instances:

If there is a conflict, whichever comes later in the config files takes precedence, and the command line beats config files. In the case of a tie, instance_fleets beats instance_groups beats other instance options.

You may set ebs_root_volume_gb regardless of which style of instance configuration you use.

instance_fleets (--instance-fleet) :

Default: None

A list of instance fleet definitions to pass to the EMR API. Pass a JSON string on the command line or use data structures in the config file (which is itself basically JSON). For example:

runners:
  emr:
    instance_fleets:
    - InstanceFleetType: MASTER
      InstanceTypeConfigs:
      - InstanceType: m1.medium
      TargetOnDemandCapacity: 1
    - InstanceFleetType: CORE
      TargetSpotCapacity: 2
      TargetOnDemandCapacity: 2
      LaunchSpecifications:
        SpotSpecification:
          TimeoutDurationMinutes: 20
          TimeoutAction: SWITCH_TO_ON_DEMAND
      InstanceTypeConfigs:
      - InstanceType: m1.medium
        BidPriceAsPercentageOfOnDemandPrice: 50
        WeightedCapacity: 1
      - InstanceType: m1.large
        BidPriceAsPercentageOfOnDemandPrice: 50
        WeightedCapacity: 2

instance_groups (--instance-groups) :

Default: None

A list of instance group definitions to pass to the EMR API. Pass a JSON string on the command line or use data structures in the config file (which is itself basically JSON).

This allows for more fine-tuned EBS volume configuration than ebs_root_volume_gb. For example:

runners:
  emr:
    instance_groups:
    - InstanceRole: MASTER
      InstanceCount: 1
      InstanceType: m1.medium
    - InstanceRole: CORE
      InstanceCount: 10
      InstanceType: c1.xlarge
      EbsConfiguration:
        EbsOptimized: true
        EbsBlockDeviceConfigs:
        - VolumeSpecification:
            SizeInGB: 100
            VolumeType: gp2

instance_groups is incompatible with instance_fleets and other instance options. See instance_fleets for details.

core_instance_bid_price (--core-instance-bid-price) : string

Default: None

When specified and not “0”, this creates the core Hadoop nodes as spot instances at this bid price. You usually only want to set bid price for task instances.

Changed in version 0.5.4: This option used to be named ec2_core_instance_bid_price.

master_instance_bid_price (--master-instance-bid-price) : string

Default: None

When specified and not “0”, this creates the master Hadoop node as a spot instance at this bid price. You usually only want to set bid price for task instances unless the master instance is your only instance.

Changed in version 0.5.4: This option used to be named ec2_master_instance_bid_price.

task_instance_bid_price (--task-instance-bid-price) : string

Default: None

When specified and not “0”, this creates the master Hadoop node as a spot instance at this bid price. (You usually only want to set bid price for task instances.)

Changed in version 0.5.4: This option used to be named ec2_task_instance_bid_price.

ebs_root_volume_gb (--ebs-root-volume-gb) : integer

Default: None

When specified (and not zero), sets the size of the root EBS volume, in GiB.

New in version 0.6.5.

Cluster software configuration

See also bootstrap, image_id, and image_version.

applications (--application) : string list

Default: []

Additional applications to run on 4.x AMIs (e.g. 'Ganglia', 'Mahout', 'Spark').

You do not need to specify 'Hadoop'; mrjob will always include it automatically. In most cases it’ll auto-detect when to include 'Spark' as well.

See Applications in the EMR docs for more details.

New in version 0.5.2.

Changed in version 0.5.9: This used to be called emr_applications.

bootstrap_actions (--bootstrap-actions) : string list

Default: []

A list of raw bootstrap actions (essentially scripts) to run prior to any of the other bootstrap steps. Any arguments should be separated from the command by spaces (we use shlex.split()). If the action is on the local filesystem, we’ll automatically upload it to S3.

This has little advantage over bootstrap; it is included in order to give direct access to the EMR API.

bootstrap_spark (--bootstrap-spark, --no-bootstrap-spark) : boolean

Default: (automatic)

Install Spark on the cluster. This works on AMI version 3.x and later.

By default, we automatically install Spark only if our job has Spark steps.

New in version 0.5.7.

In case you’re curious, here’s how mrjob determines you’re using Spark:

emr_configurations (--emr-configuration) : list of dicts

Default: []

Configurations for 4.x AMIs. For example:

runners:
  emr:
    emr_configurations:
    - Classification: core-site
      Properties:
        hadoop.security.groups.cache.secs: 250

On the command line, configurations should be JSON-encoded:

--emr-configuration '{"Classification": "core-site", ...}

See Configuring Applications in the EMR docs for more details.

New in version 0.5.3.

release_label (--release-label) : string

Default: None

EMR Release to use (e.g. emr-4.0.0). This overrides image_version.

For more information about Release Labels, see Differences Introduced in 4.x.

New in version 0.5.0.

Monitoring your job

See also check_cluster_every, ssh_tunnel.

enable_emr_debugging (--enable-emr-debugging) : boolean

Default: False

store Hadoop logs in SimpleDB

Cluster pooling

pool_clusters (--pool-clusters) : string

Default: True

Try to run the job on a WAITING pooled cluster with the same bootstrap configuration. Prefer the one with the most compute units. Use S3 to “lock” the cluster and ensure that the job is not scheduled behind another job. If no suitable cluster is WAITING, create a new pooled cluster.

Warning

If you use this in mrjob versions prior to 0.6.0, make sure to set max_hours_idle too, or your pooled clusters will run (costing you money) forever.

Changed in version 0.5.4: Pooling now gracefully recovers from joining a cluster that was in the process of shutting down (see max_hours_idle).

pool_name (--pool-name) : string

Default: 'default'

Specify a pool name to join. Does not imply pool_clusters.

pool_wait_minutes (--pool-wait-minutes) : string

Default: 0

If pooling is enabled and no cluster is available, retry finding a cluster every 30 seconds until this many minutes have passed, then start a new cluster instead of joining one.

S3 Filesystem

See also cloud_tmp_dir, cloud_part_size_mb

cloud_log_dir (--cloud-log-dir) : string

Default: append logs to cloud_tmp_dir

Where on S3 to put logs, for example s3://yourbucket/logs/. Logs for your cluster will go into a subdirectory, e.g. s3://yourbucket/logs/j-CLUSTERID/.

Changed in version 0.5.4: This option used to be named s3_log_uri

API Endpoints

Note

You usually don’t want to set *_endpoint options unless you have a challenging network situation (e.g. you have to use a proxy to get around a firewall).

ec2_endpoint (--ec2-endpoint) : string

Default: (automatic)

New in version 0.6.5.

Force mrjob to connect to EC2 on this endpoint (e.g. ec2.us-gov-west-1.amazonaws.com).

emr_endpoint (--emr-endpoint) : string

Default: infer from region

Force mrjob to connect to EMR on this endpoint (e.g. us-west-1.elasticmapreduce.amazonaws.com).

iam_endpoint (--iam-endpoint) : string

Default: (automatic)

Force mrjob to connect to IAM on this endpoint (e.g. iam.us-gov.amazonaws.com).

s3_endpoint (--s3-endpoint) : string

Default: (automatic)

Force mrjob to connect to S3 on this endpoint, rather than letting it choose the appropriate endpoint for each S3 bucket.

Warning

If you set this to a region-specific endpoint (e.g. 's3-us-west-1.amazonaws.com') mrjob may not be able to access buckets located in other regions.

Other rarely used options

additional_emr_info (--additional-emr-info) : special

Default: None

Special parameters to select additional features, mostly to support beta EMR features. Pass a JSON string on the command line or use data structures in the config file (which is itself basically JSON).

emr_action_on_failure (--emr-action-on-failure) : string

Default: (automatic)

What happens if step of your job fails

  • 'CANCEL_AND_WAIT' cancels all steps on the cluster

  • 'CONTINUE' continues to the next step (useful when submitting several

    jobs to the same cluster)

  • 'TERMINATE_CLUSTER' shuts down the cluster entirely

The default is 'CANCEL_AND_WAIT' when using pooling (see pool_clusters) or an existing cluster (see cluster_id), and 'TERMINATE_CLUSTER' otherwise.

hadoop_streaming_jar_on_emr (--hadoop-streaming-jar-on-emr) : string

Default: AWS default

Deprecated since version 0.5.4: Prepend file:// and pass that to hadoop_streaming_jar instead.

mins_to_end_of_hour (--mins-to-end-of-hour) : float

Default: 5.0

Deprecated since version 0.6.0: This option was created back when EMR billed by the full hour, and does nothing as of v0.6.0. If using versions prior to v0.6.0, it’s recommended you set this to 60.0 to effectively disable this feature.

ssh_bin (--ssh-bin) : command

Default: 'ssh'

Path to the ssh binary; may include switches (e.g. 'ssh -v' or ['ssh', '-v']). Defaults to ssh.

On EMR, mrjob uses SSH to tunnel to the job tracker (see ssh_tunnel), as a fallback way of fetching job progress, and as a quicker way of accessing your job’s logs.

tags (--tag) : dict

Default: {}

Metadata tags to apply to the EMR cluster after its creation. See Tagging Amazon EMR Clusters for more information on applying metadata tags to EMR clusters.

Tag names and values are strings. On the command line, to set a tag use --tag KEY=VALUE:

--tag team=development

In the config file, tags is a dict:

runners:
  emr:
    tags:
      team: development
      project: mrjob

Changed in version 0.5.4: This option used to be named emr_tags

visible_to_all_users (--visible-to-all-users, --no-visible-to-all-users) : boolean

Default: True

If true (the default) EMR clusters will be visible to all IAM users. Otherwise, the cluster will only be visible to the IAM user that created it.

Deprecated since version 0.6.0: Hiding clusters from other users on the same account is not very useful. If you don’t want to share pooled clusters, try pool_name.