EMR runner options¶
All options from Options available to all runners, Hadoop-related options, and Cloud runner options are available when running jobs on Amazon Elastic MapReduce.
Amazon credentials¶
See Configuring AWS credentials and Configuring SSH credentials for specific instructions about setting these options.
- aws_access_key_id : string
Default:
None
“Username” for Amazon web services.
There isn’t a command-line switch for this option because credentials are supposed to be secret! Use the environment variable
AWS_ACCESS_KEY_ID
instead.
- aws_secret_access_key (
--aws-secret-access-key
) : string Default:
None
Your “password” on AWS.
There isn’t a command-line switch for this option because credentials are supposed to be secret! Use the environment variable
AWS_SECRET_ACCESS_KEY
instead.
- aws_session_token : string
Default:
None
Temporary AWS session token, used along with aws_access_key_id and aws_secret_access_key when using temporary credentials.
There isn’t a command-line switch for this option because credentials are supposed to be secret! Use the environment variable
AWS_SESSION_TOKEN
instead.
- ec2_key_pair (
--ec2-key-pair
) : string Default:
None
name of the SSH key you set up for EMR.
- ec2_key_pair_file (
--ec2-key-pair-file
) : path Default:
None
path to file containing the SSH key for EMR
- iam_instance_profile (
--iam-instance-profile
) : string Default: (automatic)
Name of an IAM instance profile to use for EC2 clusters created by EMR. See http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-iam-roles.html for more details on using IAM with EMR.
- iam_service_role (
--iam-service-role
) : string Default: (automatic)
Name of an IAM role for the EMR service to use. See http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-iam-roles.html for more details on using IAM with EMR.
Instance configuration¶
On EMR, there are three ways to configure instances:
- instance_fleets
- instance_groups
- individual instance options:
If there is a conflict, whichever comes later in the config files takes precedence, and the command line beats config files. In the case of a tie, instance_fleets beats instance_groups beats other instance options.
You may set ebs_root_volume_gb regardless of which style of instance configuration you use.
- instance_fleets (
--instance-fleet
) : Default:
None
A list of instance fleet definitions to pass to the EMR API. Pass a JSON string on the command line or use data structures in the config file (which is itself basically JSON). For example:
runners: emr: instance_fleets: - InstanceFleetType: MASTER InstanceTypeConfigs: - InstanceType: m1.medium TargetOnDemandCapacity: 1 - InstanceFleetType: CORE TargetSpotCapacity: 2 TargetOnDemandCapacity: 2 LaunchSpecifications: SpotSpecification: TimeoutDurationMinutes: 20 TimeoutAction: SWITCH_TO_ON_DEMAND InstanceTypeConfigs: - InstanceType: m1.medium BidPriceAsPercentageOfOnDemandPrice: 50 WeightedCapacity: 1 - InstanceType: m1.large BidPriceAsPercentageOfOnDemandPrice: 50 WeightedCapacity: 2
- instance_groups (
--instance-groups
) : Default:
None
A list of instance group definitions to pass to the EMR API. Pass a JSON string on the command line or use data structures in the config file (which is itself basically JSON).
This allows for more fine-tuned EBS volume configuration than ebs_root_volume_gb. For example:
runners: emr: instance_groups: - InstanceRole: MASTER InstanceCount: 1 InstanceType: m1.medium - InstanceRole: CORE InstanceCount: 10 InstanceType: c1.xlarge EbsConfiguration: EbsOptimized: true EbsBlockDeviceConfigs: - VolumeSpecification: SizeInGB: 100 VolumeType: gp2
instance_groups is incompatible with instance_fleets and other instance options. See instance_fleets for details.
- core_instance_bid_price (
--core-instance-bid-price
) : string Default:
None
When specified and not “0”, this creates the core Hadoop nodes as spot instances at this bid price. You usually only want to set bid price for task instances.
- master_instance_bid_price (
--master-instance-bid-price
) : string Default:
None
When specified and not “0”, this creates the master Hadoop node as a spot instance at this bid price. You usually only want to set bid price for task instances unless the master instance is your only instance.
- task_instance_bid_price (
--task-instance-bid-price
) : string Default:
None
When specified and not “0”, this creates the master Hadoop node as a spot instance at this bid price. (You usually only want to set bid price for task instances.)
- ebs_root_volume_gb (
--ebs-root-volume-gb
) : integer Default:
None
When specified (and not zero), sets the size of the root EBS volume, in GiB.
New in version 0.6.5.
Cluster software configuration¶
See also bootstrap, image_id, and image_version.
- applications (
--application
,--applications
) : string list Default:
[]
Additional applications to run on 4.x AMIs (e.g.
'Ganglia'
,'Mahout'
,'Spark'
).You do not need to specify
'Hadoop'
; mrjob will always include it automatically. In most cases it’ll auto-detect when to include'Spark'
as well.See Applications in the EMR docs for more details.
Changed in version 0.6.7: Added
--applications
switch
- bootstrap_actions (
--bootstrap-actions
) : string list Default:
[]
A list of raw bootstrap actions (essentially scripts) to run prior to any of the other bootstrap steps. Any arguments should be separated from the command by spaces (we use
shlex.split()
). If the action is on the local filesystem, we’ll automatically upload it to S3.This has little advantage over bootstrap; it is included in order to give direct access to the EMR API.
- bootstrap_spark (
--bootstrap-spark
,--no-bootstrap-spark
) : boolean Default: (automatic)
Install Spark on the cluster. This works on AMI version 3.x and later.
By default, we automatically install Spark only if our job has Spark steps.
In case you’re curious, here’s how mrjob determines you’re using Spark:
- any
SparkStep
orSparkScriptStep
in your job’s steps (including implicitly through thespark
method) - “Spark” included in applications option
- any bootstrap action (see bootstrap_actions) ending in
/spark-install
(this is how you install Spark on 3.x AMIs)
- any
- emr_configurations (
--emr-configuration
) : list of dicts Default:
[]
Cluster configs for AMI version 4.x and later. For example:
runners: emr: emr_configurations: - Classification: core-site Properties: hadoop.security.groups.cache.secs: 250
On the command line, configurations should be JSON-encoded:
--emr-configuration '{"Classification": "core-site", ...}
See Configuring Applications in the EMR docs for more details.
Changed in version 0.6.11:
!clear
tag works. Later config dicts will overwrite earlier ones with the sameClassification
. If the later dict has emptyProperties
andConfigurations
, the earlier dict will be simply deleted.
- max_concurrent_steps (
--max-concurrent-steps
) : string Default: 1
How many steps may an EMR cluster run at the same time? This affects both clusters launched by our job, and, if using cluster pooling, which clusters our job will join.
Prior to AMI 5.28.0, EMR clusters could only ever run one step at a time.
New in version 0.7.4.
- release_label (
--release-label
) : string Default:
None
EMR Release to use (e.g.
emr-4.0.0
). This overrides image_version.For more information about Release Labels, see Differences Introduced in 4.x.
Monitoring your job¶
See also check_cluster_every, ssh_tunnel.
- enable_emr_debugging (
--enable-emr-debugging
) : boolean Default:
False
store Hadoop logs in SimpleDB
Cluster pooling¶
- max_clusters_in_pool (
--max-clusters-in-pool
) : integer Default: 0 (disabled)
Don’t create a new pooled cluster if there are already this many active (not terminated) clusters in our pool; instead wait until one of the clusters is available to join or terminates.
To deal with the situation where several jobs start at once, before creating a cluster, we wait a random number of seconds (see pool_jitter_seconds and double-check before creating a new cluster).
New in version 0.7.4.
- min_available_mb (
--min-available-mb
) : integer Default: 0 (disabled)
When joining a pooled cluster, connect to its YARN resource manager’s metrics API and make sure that
availableMB
is at least this high.This requires SSH to work, so ec2_key_pair and ec2_key_pair_file must be set.
If you enable this option, pooling will no longer query clusters about their instance groups/fleets, since this information is mostly redundant.
New in version 0.7.4.
- min_available_virtual_cores (
--min-available-virtual-cores
) : integer Default: 0 (disabled)
When joining a pooled cluster, connect to its YARN resource manager’s metrics API and make sure that
availableVirtualCores
is at least this high.Like with min_available_mb, this requires SSH to work and disables querying clusters about their instances.
New in version 0.7.4.
- pool_clusters (
--pool-clusters
) : string Default:
True
Try to run the job on a
WAITING
pooled cluster with the same bootstrap configuration. Prefer the one with the most compute units. If we can’t join an existing cluster, create our own (unless max_clusters_in_pool or pool_wait_minutes disallow it).
- pool_jitter_seconds (
--pool-jitter-seconds
) : string Default: 60
Wait a random number of seconds between 0 and this many before double-checking active clusters in the pool for max_clusters_in_pool or to bypass pool_wait_minutes.
The main point of this option is so that if several jobs start simultaneously, they can double-check if the other jobs have launched a cluster before launching one themselves. You may need wish to adjust this based on your maximum pool size and the number of jobs you expect to launch simultaneously.
New in version 0.7.4.
- pool_name (
--pool-name
) : string Default:
'default'
Specify a pool name to join. Does not imply pool_clusters.
- pool_timeout_minutes (
--pool-timeout-minutes
) : string Default: 0 (disabled)
If we can’t create or join a cluster after this many minutes, raise an exception and bail out.
New in version 0.7.4.
- pool_wait_minutes (
--pool-wait-minutes
) : string Default: 0
If pooling is enabled and no cluster is available, retry finding a cluster every 30 seconds until this many minutes have passed, then start a new cluster instead of joining one.
Changed in version 0.7.4: If there aren’t any active clusters with a matching pool name and hash, we may create our own cluster before pool_wait_minutes is up. We first wait a random number of seconds and double-check that other clusters have not been created (see pool_jitter_seconds).
S3 Filesystem¶
See also cloud_tmp_dir, cloud_part_size_mb
- cloud_log_dir (
--cloud-log-dir
) : string Default: append
logs
to cloud_tmp_dirWhere on S3 to put logs, for example
s3://yourbucket/logs/
. Logs for your cluster will go into a subdirectory, e.g.s3://yourbucket/logs/j-CLUSTERID/
.
Docker¶
- docker_client_config (
--docker-client-config
) : string Default:
None
An
hdfs://
URI pointing to the client config, which is used to authenticate with a private Docker registry (e.g. ECR). This is mostly useful on AMIs prior to 6.1.0; otherwise you can use auto-authentication (see this page).See “Using ECR” on this page for information about how to fetch working credentials. Because ECR credentials only last 12 hours, if you want to use ECR and Docker for multiple jobs on a long-running cluster, you may wish to set up a cron job at bootstrap time.
New in version 0.7.4.
- docker_image (
--docker-image
,--no-docker
) : string Default:
None
The repository, name, and optionally, tag of a docker image, in the format
registry/repository:tag
. Ifregistry/
is omitted, we assume the default registry on Docker Hub (library
). If registry is a hostname, we connect to that host instead (e.g. for use of ECR).Other
docker_*
options will do nothing if this is not set.Note that you must be running at least AMI 6.0.0 to use Docker on EMR.
New in version 0.7.4.
- docker_mounts (
--docker-mount
) : string list Default:
[]
Optional mounting instructions to pass to Docker, in the format
/local/path:/path/inside/docker:ro_or_rw
.New in version 0.7.4.
API Endpoints¶
Note
You usually don’t want to set *_endpoint
options unless you have a
challenging network situation (e.g. you have to use a proxy to get around
a firewall).
- ec2_endpoint (
--ec2-endpoint
) : string Default: (automatic)
New in version 0.6.5.
Force mrjob to connect to EC2 on this endpoint (e.g.
ec2.us-gov-west-1.amazonaws.com
).
- emr_endpoint (
--emr-endpoint
) : string Default: infer from region
Force mrjob to connect to EMR on this endpoint (e.g.
us-west-1.elasticmapreduce.amazonaws.com
).
- iam_endpoint (
--iam-endpoint
) : string Default: (automatic)
Force mrjob to connect to IAM on this endpoint (e.g.
iam.us-gov.amazonaws.com
).
- s3_endpoint (
--s3-endpoint
) : string Default: (automatic)
Force mrjob to connect to S3 on this endpoint, rather than letting it choose the appropriate endpoint for each S3 bucket.
Warning
If you set this to a region-specific endpoint (e.g.
's3-us-west-1.amazonaws.com'
) mrjob may not be able to access buckets located in other regions.
Other rarely used options¶
- add_steps_in_batch (
--add-steps-in-batch
,--no-add-steps-in-batch
) : boolean Default:
True
for AMIs before 5.28.0,False
otherwiseFor a multi-step job, should we submit all steps at once, or one at a time? By default, we only submit steps all at once if the AMI doesn’t support running concurrent steps (that is, before AMI 5.28.0).
New in version 0.7.4.
- additional_emr_info (
--additional-emr-info
) : special Default:
None
Special parameters to select additional features, mostly to support beta EMR features. Pass a JSON string on the command line or use data structures in the config file (which is itself basically JSON).
- emr_action_on_failure (
--emr-action-on-failure
) : string Default: (automatic)
What happens if step of your job fails
'CANCEL_AND_WAIT'
cancels all steps on the cluster'CONTINUE'
continues to the next step (useful when submitting severaljobs to the same cluster)
'TERMINATE_CLUSTER'
shuts down the cluster entirely
The default is
'CANCEL_AND_WAIT'
when using pooling (see pool_clusters) or an existing cluster (see cluster_id), and'TERMINATE_CLUSTER'
otherwise.
- hadoop_streaming_jar_on_emr (
--hadoop-streaming-jar-on-emr
) : string Default: AWS default
- ssh_add_bin (
--ssh-add-bin
) : command Default:
'ssh-add'
Path to the
ssh-add
binary. Used on EMR to access logs on the non-master node, without copying your SSH key to the master node.New in version 0.7.2.
- ssh_bin (
--ssh-bin
) : command Default:
'ssh'
Path to the ssh binary; may include switches (e.g.
'ssh -v'
or['ssh', '-v']
). Defaults to ssh.On EMR, mrjob uses SSH to tunnel to the job tracker (see ssh_tunnel), as a fallback way of fetching job progress, and as a quicker way of accessing your job’s logs.
Changed in version 0.6.8: Setting this to an empty value (
--ssh-bin ''
) instructs mrjob to use the default value (used to effectively disable SSH).
- tags (
--tag
) : dict Default:
{}
Metadata tags to apply to the EMR cluster after its creation. See Tagging Amazon EMR Clusters for more information on applying metadata tags to EMR clusters.
Tag names and values are strings. On the command line, to set a tag use
--tag KEY=VALUE
:--tag team=development
In the config file,
tags
is a dict:runners: emr: tags: team: development project: mrjob