Dataproc runner options¶

All options from Options available to all runners, Hadoop-related options, and Cloud runner options are available when running jobs on Google Cloud Dataproc.

Google credentials¶

Basic credentials are not set in the config file; see Getting started with Google Cloud for details.

project_id (--project-id) : string

Default: read from credentials config file

The ID of the Google Cloud Project to run under.

Changed in version 0.6.2: This used to be called gcp_project

service_account (--service-account) :

Default: None

Optional service account to use when creating a cluster. For more information see Service Accounts.

New in version 0.6.3.

service_account_scopes (--service-account-scopes) :

Default: (automatic)

Optional service account scopes to pass to the API when creating a cluster.

Generally it’s suggested that you instead create a service_account with the scopes you want.

New in version 0.6.3.

Job placement¶

Cluster configuration¶

cluster_properties (--cluster-property) :

Default: None

A dictionary of properties to set in the cluster’s config files (e.g. mapred-site.xml). For details, see Cluster properties.

core_instance_config (--core-instance-config) :

Default: None

A dictionary of additional parameters to pass as config.worker_config when creating the cluster. Follows the format of InstanceGroupConfig except that it uses snake_case instead of camel_case.

For example, to specify 100GB of disk space on core instances, add this to your config file:

runners:
  dataproc:
    core_instance_config:
      disk_config:
        boot_disk_size_gb: 100

To set this option on the command line, pass in JSON:

--core-instance-config '{"disk_config": {"boot_disk_size_gb": 100}}'

This option can be used to set number of core instances (num_instances) or instance type (machine_type_uri), but usually you’ll want to use num_core_instances and core_instance_type along with this option.

New in version 0.6.3.

master_instance_config (--master-instance-config) :

Default: None

A dictionary of additional parameters to pass as config.master_config when creating the cluster. See core_instance_config for more details.

New in version 0.6.3.

task_instance_config (--task-instance-config) :

Default: None

A dictionary of additional parameters to pass as config.secondary_worker_config when creating the cluster. See task_instance_config for more details.

To make task instances preemptible, add this to your config file:

runners:
  dataproc:
    task_instance_config:
      is_preemptible: true

Note that this config won’t be applied unless you specify at least one task instance (either through num_task_instances or by passing num_instances to this option).

New in version 0.6.3.

Other rarely used options¶

gcloud_bin (--gcloud-bin) : command

Default: 'gcloud'

Path to the gcloud binary; may include switches (e.g. 'gcloud -v' or ['gcloud', '-v']). Defaults to gcloud.

Used only as a way to create an SSH tunnel to the Resource Manager.

Changed in version 0.6.8: Setting this to an empty value (--gcloud-bin '') instructs mrjob to use the default (used to disable SSH).

mrjob v0.7.4 documentation

Dataproc runner options¶

Google credentials¶

Job placement¶

Cluster configuration¶

Other rarely used options¶

Table Of Contents

Need help?

This Page