Cloud runner options

These options are generally available whenever you run your job on a Hadoop cloud service (AWS Elastic MapReduce or Google Cloud Dataproc).

All options from Options available to all runners and Hadoop-related options are also available on cloud services.

Google credentials

See Getting started with Google Cloud for specific instructions about setting these options.

Choosing/creating a cluster to join

cluster_id (--cluster-id) : string

Default: automatically create a cluster and use it

The ID of a persistent cluster to run jobs in (on Dataproc, this is the same thing as “cluster name”).

It’s fine for other jobs to be using the cluster; we give our job’s steps a unique ID.

Job placement

region (--region) : string

Default: 'us-west-2' on EMR, 'us-west1' on Dataproc

Geographic region to run jobs in (e.g. us-central-1).

If you set region, you do not need to set zone; a zone will be chosen for you automatically.

subnet (--subnet) : string

Default: None

Optional subnet(s) to launch your job in.

On Amazon EMR, this is the ID of a VPC subnet to launch cluster in (e.g. 'subnet-12345678'). This can also be a list of possible subnets if you are using instance_fleets.

On Google Cloud Dataproc, this is the name of a subnetwork (e.g. 'default'). Specifying subnet rather than network will ensure that your cluster only has access to one specific geographic region, rather than the entire VPC.

Changed in version 0.6.3: Works on Google Cloud Dataproc as well as AWS Elastic MapReduce.

New in version 0.5.3.

zone (--zone) : string

Default: None

Zone within a specific geographic region to run your job in.

If you set this, you do not neet to set region.

Changed in version 0.5.4: This option used to be named aws_availability_zone

Number and type of instances

instance_type (--instance-type) : string

Default: m1.medium (usually) on EMR, n1-standard-1 on Dataproc

Type of instance that runs your Hadoop tasks.

On Amazon EMR, mrjob picks the cheapest instance type that will work at all. This is m1.medium, with two exceptions:

  • m1.large if you’re running Spark (see bootstrap_spark)
  • m1.small if you’re running on the (deprecated) 2.x AMIs

Once you’ve tested a job and want to run it at scale, it’s usually a good idea to use instances larger than the default. For EMR, see Amazon EC2 Instance Types and for Dataproc, see Machine Types.

Note

Many EC2 instance types can only run in a VPC (see subnet).

If you’re running multiple nodes (see num_core_instances), this option doesn’t apply to the master node because it’s just coordinating tasks, not running them. Use master_instance_type instead.

Changed in version 0.5.6: Used to default to m1.medium in all cases.

Changed in version 0.5.4: This option used to be named ec2_instance_type.

core_instance_type (--core-instance-type) : string

Default: value of instance_type

like instance_type, but only for the core (worker) Hadoop nodes; these nodes run tasks and host HDFS. Usually you just want to use instance_type.

Changed in version 0.5.4: This replaces the ec2_core_instance_type and ec2_slave_instance_type options.

num_core_instances (--num-core-instances) : integer

Default: 0 on EMR, 2 on Dataproc

Number of core (worker) instances to start up. These run your job and host HDFS. This is in addition to the single master instance.

On Google Cloud Dataproc, this must be at least 2.

Changed in version 0.5.4: This option used to be named num_ec2_core_instances.

master_instance_type (--master-instance-type) : string

Default: (automatic)

like instance_type, but only for the master Hadoop node. This node hosts the task tracker/resource manager and HDFS, and runs tasks if there are no other nodes.

If you’re running a single node (no num_core_instances or num_task_instances), this will default to the value of instance_type.

Otherwise, on Dataproc, defaults to n1-standard-1, and on EMR defaults to m1.medium (exception: m1.small on the deprecated 2.x AMIs), which is usually adequate for all but the largest jobs.

Changed in version 0.5.4: This option used to be named ec2_master_instance_type.

task_instance_type (--task-instance-type) : string

Default: value of core_instance_type

like instance_type, but only for the task (secondary worker) Hadoop nodes; these nodes run tasks but do not host HDFS. Usually you just want to use instance_type.

Changed in version 0.5.4: This option used to be named ec2_task_instance_type.

num_task_instances (--num-task-instances) : integer

Default: 0

Number of task (secondary worker) instances to start up. These run your job but do not host HDFS.

You must have at least one core instance (see num_core_instances) to run task instances; otherwise there’s nowhere to host HDFS.

Changed in version 0.5.4: This used to be called num_ec2_task_instances.

Cluster software configuration

image_version (--image-version) : string

Default: '5.8.0' on EMR, '1.0' on Dataproc

Machine image version to use. This controls which Hadoop version(s) are available and which version of Python is installed, among other things.

See the AMI version docs (EMR) or the Dataproc version docs for more details.

You can use this instead of release_label on EMR, even for 4.x+ AMIs; mrjob will just prepend emr- to form the release label.

Changed in version 0.5.7: Default on EMR used to be '3.11.0'.

Changed in version 0.5.4: This used to be called ami_version.

Warning

The deprecated ami_version alias for this option is completely ignored by mrjob 0.5.4 (it works in later 0.5.x versions).

Warning

The 2.x series of AMIs is deprecated by Amazon and not recommended.

Warning

The 1.x series of AMIs is no longer supported because they use Python 2.5.

bootstrap (--bootstrap) : string list

Default: []

A list of lines of shell script to run once on each node in your cluster, at bootstrap time.

This option is complex and powerful. On EMR, the best way to get started is to read the EMR Bootstrapping Cookbook.

Passing expressions like path#name will cause path to be automatically uploaded to the task’s working directory with the filename name, marked as executable, and interpolated into the script by their absolute path on the machine running the script.

path may also be a URI, and ~ and environment variables within path will be resolved based on the local environment. name is optional. For details of parsing, see parse_setup_cmd().

Unlike with setup, archives are not supported (unpack them yourself).

Remember to put sudo before commands requiring root privileges!

bootstrap_python (--bootstrap-python, --no-bootstrap-python) : boolean

Default: True on Dataproc, as needed on EMR.

Attempt to install a compatible (major) version of Python at bootstrap time, including header files and pip (see Installing Python packages with pip).

The only reason to set this to False is if you want to customize Python/pip installation using bootstrap.

New in version 0.5.0.

Changed in version 0.5.4: no longer installs Python 3 on AMI version 4.6.0+ by default

extra_cluster_params (--extra-cluster-param) : dict

Default: {}

An escape hatch that allows you to pass extra parameters to the EMR/Dataproc API at cluster create time, to access API features that mrjob does not yet support.

For EMR, see the API documentation for RunJobFlow for the full list of options.

Option names are strings, and values are data structures. On the command line, --extra-cluster-param name=value:

--extra-cluster-param SupportedProducts='["mapr-m3"]'
--extra-cluster-param AutoScalingRole=HankPym

value can be either a JSON or a string (unless it starts with {, [, or ", so that we don’t convert malformed JSON to strings). Parameters can be suppressed by setting them to null:

--extra-cluster-param LogUri=null

This also works with Google dataproc:

--extra-cluster-param labels='{"name": "wrench"}'

In the config file, extra_cluster_param is a dict:

runners:
  emr:
    extra_cluster_params:
      AutoScalingRole: HankPym
      LogUri: null  # !clear works too
      SupportedProducts:
      - mapr-m3

New in version 0.6.0: This replaces the old emr_api_params option, which only worked with boto 2.

Monitoring your job

check_cluster_every (--check-cluster-every) : float

Default: 10 seconds on Dataproc, 30 seconds on EMR

How often to check on the status of your job, in seconds.

(Higher on EMR to keep the API from throttling you.)

Changed in version 0.5.4: This used to be called check_emr_status_every

ssh_tunnel (--ssh-tunnel, --no-ssh-tunnel) : boolean

Default: False

If True, create an ssh tunnel to the job tracker/resource manager and listen on a randomly chosen port.

On EMR, this requires you to set ec2_key_pair and ec2_key_pair_file. See Configuring SSH credentials for detailed instructions.

On Dataproc, you don’t need to set a key, but you do need to have the gcloud utility installed and set up (make sure you ran gcloud auth login and gcloud config set project <project_id>). See Installing gcloud, gsutil, and other utilities.

Changed in version 0.6.3: Enabled on Google Cloud Dataproc

Changed in version 0.5.0: This option used to be named ssh_tunnel_to_job_tracker.

ssh_tunnel_is_open (--ssh-tunnel-is-open) : boolean

Default: False

if True, any host can connect to the job tracker through the SSH tunnel you open. Mostly useful if your browser is running on a different machine from your job runner.

Does nothing unless ssh_tunnel is set.

ssh_bind_ports (--ssh-bind-ports) : list of integers

Default: range(40001, 40841)

A list of ports that are safe to listen on.

The main reason to set this is if your firewall blocks the default range of ports, or if you want to pick a single port for consistency.

On the command line, this looks like --ssh-bind-ports 2000[:2001][,2003,2005:2008,etc], where commas separate ranges and colons separate range endpoints.

Cloud Filesystem

cloud_fs_sync_secs (--cloud-fs-sync-secs) : float

Default: 5.0

How long to wait for cloud filesystem (e.g. S3, GCS) to reach eventual consistency? This is typically less than a second, but the default is 5 seconds to be safe.

Changed in version 0.5.4: This used to be called s3_sync_wait_time

cloud_part_size_mb (--cloud-part-size-mb) : integer

Default: 100

Upload files to cloud filesystem in parts no bigger than this many megabytes (technically, mebibytes). Default is 100 MiB.

Set to 0 to disable multipart uploading entirely.

Currently, Amazon requires parts to be between 5 MiB and 5 GiB. mrjob does not enforce these limits.

Changed in version 0.6.3: Enabled on Google Cloud Storage. This used to be called cloud_upload_part_size.

Changed in version 0.5.4: This used to be called s3_upload_part_size.

cloud_tmp_dir (--cloud-tmp-dir) : string

Default: (automatic)

Directory on your cloud filesystem to use as temp space (e.g. s3://yourbucket/tmp/, gs://yourbucket/tmp/).

By default, mrjob looks for a bucket belong to you whose name starts with mrjob- and which matches region. If it can’t find one, it creates one with a random name. This option is then set to tmp/ in this bucket (e.g. s3://mrjob-01234567890abcdef/tmp/).

Changed in version 0.5.4: This used to be called s3_tmp_dir.

Changed in version 0.5.0: This used to be called s3_scratch_uri.

Auto-termination

max_mins_idle (--max-mins-idle) : float

Default: 10.0

Automatically terminate your cluster after it has been idle at least this many minutes. You cannot turn this off (clusters left idle rack up billing charges).

If your cluster is only running a single job, mrjob will attempt to terminate it as soon as your job finishes. This acts as an additional safeguard, as well as affecting Cluster Pooling on EMR.

Changed in version 0.6.3: Uses Dataproc’s built-in cluster termination feature rather than a custom script. The API will not allow you to set an idle time of less than 10 minutes.

Changed in version 0.6.2: No matter how small a value you set this to, there is a grace period of 10 minutes between when the idle termination daemon launches and when it may first terminate the cluster, to allow Hadoop to accept your first job.

Changed in version 0.6.0: All clusters launched by mrjob now auto-terminate when idle. In previous versions, you needed to set max_hours_idle, set this option explicitly, or use terminate-idle-clusters.

max_hours_idle (--max-hours-idle) : float

Default: None

Deprecated since version 0.6.0: Use max_mins_idle instead.