Dataproc runner options¶
All options from Options available to all runners, Hadoop-related options, and Cloud runner options are available when running jobs on Google Cloud Dataproc.
Google credentials¶
Basic credentials are not set in the config file; see Getting started with Google Cloud for details.
- project_id (
--project-id
) : string Default: read from credentials config file
The ID of the Google Cloud Project to run under.
Changed in version 0.6.2: This used to be called gcp_project
- service_account (
--service-account
) : Default:
None
Optional service account to use when creating a cluster. For more information see Service Accounts.
New in version 0.6.3.
- service_account_scopes (
--service-account-scopes
) : Default: (automatic)
Optional service account scopes to pass to the API when creating a cluster.
Generally it’s suggested that you instead create a service_account with the scopes you want.
New in version 0.6.3.
Job placement¶
Cluster configuration¶
- cluster_properties (
--cluster-property
) : Default:
None
A dictionary of properties to set in the cluster’s config files (e.g.
mapred-site.xml
). For details, see Cluster properties.
- core_instance_config (
--core-instance-config
) : Default:
None
A dictionary of additional parameters to pass as
config.worker_config
when creating the cluster. Follows the format of InstanceGroupConfig except that it uses snake_case instead of camel_case.For example, to specify 100GB of disk space on core instances, add this to your config file:
runners: dataproc: core_instance_config: disk_config: boot_disk_size_gb: 100
To set this option on the command line, pass in JSON:
--core-instance-config '{"disk_config": {"boot_disk_size_gb": 100}}'
This option can be used to set number of core instances (
num_instances
) or instance type (machine_type_uri
), but usually you’ll want to use num_core_instances and core_instance_type along with this option.New in version 0.6.3.
- master_instance_config (
--master-instance-config
) : Default:
None
A dictionary of additional parameters to pass as
config.master_config
when creating the cluster. See core_instance_config for more details.New in version 0.6.3.
- task_instance_config (
--task-instance-config
) : Default:
None
A dictionary of additional parameters to pass as
config.secondary_worker_config
when creating the cluster. See task_instance_config for more details.To make task instances preemptible, add this to your config file:
runners: dataproc: task_instance_config: is_preemptible: true
Note that this config won’t be applied unless you specify at least one task instance (either through num_task_instances or by passing
num_instances
to this option).New in version 0.6.3.
Other rarely used options¶
- gcloud_bin (
--gcloud-bin
) : command Default:
'gcloud'
Path to the gcloud binary; may include switches (e.g.
'gcloud -v'
or['gcloud', '-v']
). Defaults to gcloud.Used only as a way to create an SSH tunnel to the Resource Manager.
Changed in version 0.6.8: Setting this to an empty value (
--gcloud-bin ''
) instructs mrjob to use the default (used to disable SSH).