Dataproc runner options

All options from Options available to all runners, Hadoop-related options, and Cloud runner options are available when running jobs on Google Cloud Dataproc.

Google credentials

Basic credentials are not set in the config file; see Getting started with Google Cloud for details.

project_id (--project-id) : string

Default: read from credentials config file

The ID of the Google Cloud Project to run under.

Changed in version 0.6.2: This used to be called gcp_project

service_account (--service-account) :

Default: None

Optional service account to use when creating a cluster. For more information see Service Accounts.

New in version 0.6.3.

service_account_scopes (--service-account-scopes) :

Default: (automatic)

Optional service account scopes to pass to the API when creating a cluster.

Generally it’s suggested that you instead create a service_account with the scopes you want.

New in version 0.6.3.

Job placement

See also subnet, region, zone

network (--network) : string

Default: None

Name or URI of network to launch cluster in. Incompatible with with subnet.

New in version 0.6.3.

Cluster configuration

cluster_properties (--cluster-property) :

Default: None

A dictionary of properties to set in the cluster’s config files (e.g. mapred-site.xml). For details, see Cluster properties.

core_instance_config (--core-instance-config) :

Default: None

A dictionary of additional parameters to pass as config.worker_config when creating the cluster. Follows the format of InstanceGroupConfig except that it uses snake_case instead of camel_case.

For example, to specify 100GB of disk space on core instances, add this to your config file:

runners:
  dataproc:
    core_instance_config:
      disk_config:
        boot_disk_size_gb: 100

To set this option on the command line, pass in JSON:

--core-instance-config '{"disk_config": {"boot_disk_size_gb": 100}}'

This option can be used to set number of core instances (num_instances) or instance type (machine_type_uri), but usually you’ll want to use num_core_instances and core_instance_type along with this option.

New in version 0.6.3.

master_instance_config (--master-instance-config) :

Default: None

A dictionary of additional parameters to pass as config.master_config when creating the cluster. See core_instance_config for more details.

New in version 0.6.3.

task_instance_config (--task-instance-config) :

Default: None

A dictionary of additional parameters to pass as config.secondary_worker_config when creating the cluster. See task_instance_config for more details.

To make task instances preemptible, add this to your config file:

runners:
  dataproc:
    task_instance_config:
      is_preemptible: true

Note that this config won’t be applied unless you specify at least one task instance (either through num_task_instances or by passing num_instances to this option).

New in version 0.6.3.

Other rarely used options

gcloud_bin (--gcloud-bin) : command

Default: 'gcloud'

Path to the gcloud binary; may include switches (e.g. 'gcloud -v' or ['gcloud', '-v']). Defaults to gcloud.

Used only as a way to create an SSH tunnel to the Resource Manager.

Changed in version 0.6.8: Setting this to an empty value (--gcloud-bin '') instructs mrjob to use the default (used to disable SSH).