Cloud runner options

These options are generally available whenever you run your job on a Hadoop cloud service (AWS Elastic MapReduce or Google Cloud Dataproc).

All options from Options available to all runners and Hadoop-related options are also available on cloud services.

Google credentials

See Getting started with Google Cloud for specific instructions about setting these options.

Choosing/creating a cluster to join

cluster_id (--cluster-id) : string

Default: automatically create a cluster and use it

The ID of a persistent cluster to run jobs in (on Dataproc, this is the same thing as “cluster name”).

It’s fine for other jobs to be using the cluster; we give our job’s steps a unique ID.

Job placement

region (--region) : string

Default: 'us-west-2' on EMR, 'us-west1' on Dataproc

Geographic region to run jobs in (e.g. us-central-1).

If mrjob create a temporary bucket, it will be created in this region as well.

If you set region, you do not need to set zone; a zone will be chosen for you automatically.

subnet (--subnet) : string

Default: None

Optional subnet(s) to launch your job in.

On Amazon EMR, this is the ID of a VPC subnet to launch cluster in (e.g. 'subnet-12345678'). This can also be a list of possible subnets if you are using instance_fleets.

On Google Cloud Dataproc, this is the name of a subnetwork (e.g. 'default'). Specifying subnet rather than network will ensure that your cluster only has access to one specific geographic region, rather than the entire VPC.

Changed in version 0.6.8: --subnet '' un-sets the subnet on EMR (used to be ignored).

Changed in version 0.6.3: Works on Google Cloud Dataproc as well as AWS Elastic MapReduce.

zone (--zone) : string

Default: None

Zone within a specific geographic region to run your job in.

If you set this, you do not neet to set region.

Number and type of instances

instance_type (--instance-type) : string

Default: m4.large or m5.xlarge on EMR, n1-standard-1 on Dataproc

Type of instance that runs your Hadoop tasks.

Once you’ve tested a job and want to run it at scale, it’s usually a good idea to use instances larger than the default. For EMR, see Amazon EC2 Instance Types and for Dataproc, see Machine Types.

Note

Many EC2 instance types can only run in a VPC (see subnet).

If you’re running multiple nodes (see num_core_instances), this option doesn’t apply to the master node because it’s just coordinating tasks, not running them. Use master_instance_type instead.

Changed in version 0.6.11: Default on EMR is m5.xlarge on AMI version 5.13.0 and later, m4.large on earlier versions

Changed in version 0.6.10: Default on EMR changed to m5.xlarge

Changed in version 0.6.6: Default on EMR changed to m4.large. Was previously m1.large` if running Spark, m1.small if running on the (deprecated) 2.x AMIs, and m1.medium otherwise

core_instance_type (--core-instance-type) : string

Default: value of instance_type

like instance_type, but only for the core (worker) Hadoop nodes; these nodes run tasks and host HDFS. Usually you just want to use instance_type.

num_core_instances (--num-core-instances) : integer

Default: 0 on EMR, 2 on Dataproc

Number of core (worker) instances to start up. These run your job and host HDFS. This is in addition to the single master instance.

On Google Cloud Dataproc, this must be at least 2.

master_instance_type (--master-instance-type) : string

Default: (automatic)

like instance_type, but only for the master Hadoop node. This node hosts the task tracker/resource manager and HDFS, and runs tasks if there are no other nodes.

If you’re running a single node (no num_core_instances or num_task_instances), this will default to the value of instance_type.

Otherwise, on Dataproc, defaults to n1-standard-1, and on EMR defaults to m1.medium (exception: m1.small on the deprecated 2.x AMIs), which is usually adequate for all but the largest jobs.

task_instance_type (--task-instance-type) : string

Default: value of core_instance_type

like instance_type, but only for the task (secondary worker) Hadoop nodes; these nodes run tasks but do not host HDFS. Usually you just want to use instance_type.

num_task_instances (--num-task-instances) : integer

Default: 0

Number of task (secondary worker) instances to start up. These run your job but do not host HDFS.

You must have at least one core instance (see num_core_instances) to run task instances; otherwise there’s nowhere to host HDFS.

Cluster software configuration

image_id (--image-id) : string

Default: None

ID of a custom machine image.

On EMR, this is complimentary with image_version; you can install packages and libraries on your custom AMI, but it’s up to EMR to install Hadoop, create the hadoop user, etc. image_version may not be less than 5.7.0.

You can use describe_base_emr_images() to identify Amazon Linux images that are compatible with EMR.

For more details about how to create a custom AMI that works with EMR, see Best Practices and Considerations.

Note

This is not yet implemented in the Dataproc runner.

New in version 0.6.5.

image_version (--image-version) : string

Default: '6.0.0' on EMR, '1.3' on Dataproc

Machine image version to use. This controls which Hadoop version(s) are available and which version of Python is installed, among other things.

See the AMI version docs (EMR) or the Dataproc version docs for more details.

You can use this instead of release_label on EMR, even for 4.x+ AMIs; mrjob will just prepend emr- to form the release label.

Changed in version 0.6.12: Default on Dataproc changed from 1.0 to 1.3

Changed in version 0.6.11: Default on EMR is now 5.27.0

Changed in version 0.6.5: Default on EMR is now 5.16.0 (was 5.8.0)

Warning

The 2.x series of AMIs is deprecated by Amazon and not recommended.

Warning

The 1.x series of AMIs is no longer supported because they use Python 2.5.

bootstrap (--bootstrap) : string list

Default: []

A list of lines of shell script to run once on each node in your cluster, at bootstrap time.

This option is complex and powerful. On EMR, the best way to get started is to read the EMR Bootstrapping Cookbook.

Passing expressions like path#name will cause path to be automatically uploaded to the task’s working directory with the filename name, marked as executable, and interpolated into the script by their absolute path on the machine running the script.

path may also be a URI, and ~ and environment variables within path will be resolved based on the local environment. name is optional. For details of parsing, see parse_setup_cmd().

Unlike with setup, archives are not supported (unpack them yourself).

Remember to put sudo before commands requiring root privileges!

bootstrap_python (--bootstrap-python, --no-bootstrap-python) : boolean

Default: True on Dataproc, as needed on EMR.

Attempt to install a compatible (major) version of Python at bootstrap time, including header files and pip (see Installing Python packages with pip).

The only reason to set this to False is if you want to customize Python/pip installation using bootstrap.

extra_cluster_params (--extra-cluster-param) : dict

Default: {}

An escape hatch that allows you to pass extra parameters to the EMR/Dataproc API at cluster create time, to access API features that mrjob does not yet support.

For EMR, see the API documentation for RunJobFlow for the full list of options.

Option names are strings, and values are data structures. On the command line, --extra-cluster-param name=value:

--extra-cluster-param SupportedProducts='["mapr-m3"]'
--extra-cluster-param AutoScalingRole=HankPym

value can be either a JSON or a string (unless it starts with {, [, or ", so that we don’t convert malformed JSON to strings). Parameters can be suppressed by setting them to null:

--extra-cluster-param LogUri=null

This also works with Google dataproc:

--extra-cluster-param labels='{"name": "wrench"}'

In the config file, extra_cluster_param is a dict:

runners:
  emr:
    extra_cluster_params:
      AutoScalingRole: HankPym
      LogUri: null  # !clear works too
      SupportedProducts:
      - mapr-m3

Changed in version 0.7.2: Dictionaries will be recursively merged into existing parameters. For example:

runners:
  emr:
    extra_cluster_params:
      Instances:
        EmrManagedMasterSecurityGroup: sg-foo

Changed in version 0.6.8: You may use a name with dots in it to set (or unset) nested properties. For example: --extra-cluster-param Instances.EmrManagedMasterSecurityGroup=sg-foo.

Monitoring your job

check_cluster_every (--check-cluster-every) : float

Default: 10 seconds on Dataproc, 30 seconds on EMR

How often to check on the status of your job, in seconds.

Changed in version 0.6.5: When the EMR client encounters a transient error, it will wait at least this many seconds before trying again.

ssh_tunnel (--ssh-tunnel, --no-ssh-tunnel) : boolean

Default: False

If True, create an ssh tunnel to the job tracker/resource manager and listen on a randomly chosen port.

On EMR, this requires you to set ec2_key_pair and ec2_key_pair_file. See Configuring SSH credentials for detailed instructions.

On Dataproc, you don’t need to set a key, but you do need to have the gcloud utility installed and set up (make sure you ran gcloud auth login and gcloud config set project <project_id>). See Installing gcloud, gsutil, and other utilities.

Changed in version 0.6.3: Enabled on Google Cloud Dataproc

ssh_tunnel_is_open (--ssh-tunnel-is-open) : boolean

Default: False

if True, any host can connect to the job tracker through the SSH tunnel you open. Mostly useful if your browser is running on a different machine from your job runner.

Does nothing unless ssh_tunnel is set.

ssh_bind_ports (--ssh-bind-ports) : list of integers

Default: range(40001, 40841)

A list of ports that are safe to listen on.

The main reason to set this is if your firewall blocks the default range of ports, or if you want to pick a single port for consistency.

On the command line, this looks like --ssh-bind-ports 2000[:2001][,2003,2005:2008,etc], where commas separate ranges and colons separate range endpoints.

Cloud Filesystem

cloud_fs_sync_secs (--cloud-fs-sync-secs) : float

Default: 5.0

How long to wait for cloud filesystem (e.g. S3, GCS) to reach eventual consistency? This is typically less than a second, but the default is 5 seconds to be safe.

cloud_part_size_mb (--cloud-part-size-mb) : integer

Default: 100

Upload files to cloud filesystem in parts no bigger than this many megabytes (technically, mebibytes). Default is 100 MiB.

Set to 0 to disable multipart uploading entirely.

Currently, Amazon requires parts to be between 5 MiB and 5 GiB. mrjob does not enforce these limits.

Changed in version 0.6.3: Enabled on Google Cloud Storage. This used to be called cloud_upload_part_size.

cloud_tmp_dir (--cloud-tmp-dir) : string

Default: (automatic)

Directory on your cloud filesystem to use as temp space (e.g. s3://yourbucket/tmp/, gs://yourbucket/tmp/).

By default, mrjob looks for a bucket belong to you whose name starts with mrjob- and which matches region. If it can’t find one, it creates one with a random name. This option is then set to tmp/ in this bucket (e.g. s3://mrjob-01234567890abcdef/tmp/).

Auto-termination

max_mins_idle (--max-mins-idle) : float

Default: 10.0

Automatically terminate your cluster after it has been idle at least this many minutes. You cannot turn this off (clusters left idle rack up billing charges).

If your cluster is only running a single job, mrjob will attempt to terminate it as soon as your job finishes. This acts as an additional safeguard, as well as affecting Cluster Pooling on EMR.

Changed in version 0.6.5: EMR’s idle termination script is more robust against sudo shutdown -h now being ignored, and logs the script’s stdout and stderr to /var/log/bootstrap-actions/mrjob-idle-termination.log.

Changed in version 0.6.3: Uses Dataproc’s built-in cluster termination feature rather than a custom script. The API will not allow you to set an idle time of less than 10 minutes.

Changed in version 0.6.2: No matter how small a value you set this to, there is a grace period of 10 minutes between when the idle termination daemon launches and when it may first terminate the cluster, to allow Hadoop to accept your first job.