Dataproc runner options

All options from Options available to all runners and Hadoop-related options are available to Dataproc runner.

Google credentials

See Configuring Google Cloud Platform (GCP) credentials for specific instructions about setting these options.

Choosing/creating a cluster to join

cluster_id (--cluster-id) : string

Default: automatically create a cluster and use it

The ID of a persistent Dataproc cluster to run jobs in. It’s fine for other jobs to be using the cluster; we give our job’s steps a unique ID.

Cluster creation and configuration

zone (--zone) : string

Default: gcloud SDK default

Availability zone to run the job in

region (--region) : string

Default: gcloud SDK default

region to run Dataproc jobs on (e.g. us-central-1). Also used by mrjob to create temporary buckets if you don’t set cloud_tmp_dir explicitly.

image_version (--image-version) : string

Default: 1.0

Cloud Image to run Dataproc jobs on. See the Dataproc docs on specifying the Dataproc version. for details.

Bootstrapping

These options apply at bootstrap time, before the Hadoop cluster has started. Bootstrap time is a good time to install Debian packages or compile and install another Python binary.

bootstrap (--bootstrap) : string list

Default: []

A list of lines of shell script to run once on each node in your cluster, at bootstrap time.

Passing expressions like path#name will cause path to be automatically uploaded to the task’s working directory with the filename name, marked as executable, and interpolated into the script by their absolute path on the machine running the script. path may also be a URI, and ~ and environment variables within path will be resolved based on the local environment. name is optional. For details of parsing, see parse_setup_cmd().

Unlike with setup, archives are not supported (unpack them yourself).

Remember to put sudo before commands requiring root privileges!

bootstrap_python (--bootstrap-python, --no-bootstrap-python) : boolean

Default: True

Attempt to install a compatible version of Python at bootstrap time, including pip and development libraries (so you can build Python packages written in C).

This is useful even in Python 2, which is installed by default, but without pip and development libraries.

Monitoring the cluster

check_cluster_every (--check-cluster-every) : string

Default: 10

How often to check on the status of Dataproc jobs in seconds. If you set this too low, GCP will throttle you.

Number and type of instances

instance_type (--instance-type) : string

Default: 'n1-standard-1'

What sort of GCE instance(s) to use on the nodes that actually run tasks (see https://cloud.google.com/compute/docs/machine-types). When you run multiple instances (see instance_type), the master node is just coordinating the other nodes, so usually the default instance type (n1-standard-1) is fine, and using larger instances is wasteful.

master_instance_type (--master-instance-type) : string

Default: 'n1-standard-1'

like instance_type, but only for the master Hadoop node. This node hosts the task tracker and HDFS, and runs tasks if there are no other nodes. Usually you just want to use instance_type.

core_instance_type (--core-instance-type) : string

Default: value of instance_type

like instance_type, but only for worker Hadoop nodes; these nodes run tasks and host HDFS. Usually you just want to use instance_type.

task_instance_type (--task-instance-type) : string

Default: value of instance_type

like instance_type, but only for the task Hadoop nodes; these nodes run tasks but do not host HDFS. Usually you just want to use instance_type.

num_core_instances (--num-core-instances) : string

Default: 2

Number of worker instances to start up. These run your job and host HDFS.

num_task_instances (--num-task-instances) : string

Default: 0

Number of task instances to start up. These run your job but do not host HDFS. If you use this, you must set num_core_instances; Dataproc does not allow you to run task instances without core instances (because there’s nowhere to host HDFS).

FS paths and options

MRJob uses google-api-python-client to manipulate/access FS.

cloud_tmp_dir (--cloud-tmp-dir) : string

Default: (automatic)

GCS directory (URI ending in /) to use as temp space, e.g. gs://yourbucket/tmp/.

By default, mrjob looks for a bucket belong to you whose name starts with mrjob- and which matches region. If it can’t find one, it creates one with a random name. This option is then set to tmp/ in this bucket (e.g. gs://mrjob-01234567890abcdef/tmp/).

cloud_fs_sync_secs (--cloud-fs-sync-secs) : string

Default: 5.0

How long to wait for GCS to reach eventual consistency. This is typically less than a second, but the default is 5.0 to be safe.