Configuration quick reference¶
Setting configuration options¶
You can set an option by:
Passing it on the command line with the switch version (like
--some-option
)Passing it as a keyword argument to the runner constructor, if you are creating the runner programmatically
Putting it in one of the included config files under a runner name, like this:
runners: local: python_bin: python3.6 # only used in local runner emr: python_bin: python3 # only used in Elastic MapReduce runner
See Config file format and location for information on where to put config files.
Options that can’t be set from mrjob.conf (all runners)¶
There are some options that it makes no sense to set in the config file.
These options can be set via command-line switches:
Config | Command line | Default | Type |
---|---|---|---|
cat_output | –cat-output, –no-cat-output | output if output_dir is not set | boolean |
conf_paths | -c, –conf-path, –no-conf | see find_mrjob_conf() |
path list |
output_dir | –output-dir | (automatic) | string |
step_output_dir | –step-output-dir | (automatic) | string |
These options can be set by overriding attributes or methods in your job class:
Option | Attribute | Method | Default |
---|---|---|---|
hadoop_input_format | HADOOP_INPUT_FORMAT |
hadoop_input_format() |
None |
hadoop_output_format | HADOOP_OUTPUT_FORMAT |
hadoop_output_format() |
None |
partitioner | PARTITIONER |
partitioner() |
None |
These options can be set by overriding your job’s
configure_args()
to call the appropriate method:
Option | Method | Default |
---|---|---|
extra_args | add_passthru_arg() |
[] |
All of the above can be passed as keyword arguments to
MRJobRunner.__init__()
(this is what makes them runner options), but you usually don’t want to
instantiate runners directly.
Other options for all runners¶
These options can be passed to any runner without an error, though some runners may ignore some options. See the text after the table for specifics.
LocalMRJobRunner
takes no additional options, but:
- bootstrap_mrjob is
False
by default - cmdenv uses the local system path separator instead of
:
all the time (so;
on Windows, no change elsewhere) - python_bin defaults to the current Python interpreter
In addition, it ignores hadoop_input_format, hadoop_output_format, hadoop_streaming_jar, and jobconf
InlineMRJobRunner
works like
LocalMRJobRunner
, only it also ignores
bootstrap_mrjob, cmdenv, python_bin,
upload_archives, and upload_files.
Additional options for DataprocJobRunner
¶
Config | Command line | Default | Type |
---|---|---|---|
cluster_properties | –cluster-property | None |
|
core_instance_config | –core-instance-config | None |
|
gcloud_bin | –gcloud-bin | 'gcloud' |
command |
master_instance_config | –master-instance-config | None |
|
network | –network | None |
string |
project_id | –project-id | read from credentials config file | string |
service_account | –service-account | None |
|
service_account_scopes | –service-account-scopes | (automatic) | |
task_instance_config | –task-instance-config | None |
Additional options for EMRJobRunner
¶
Additional options for HadoopJobRunner
¶
Config | Command line | Default | Type |
---|---|---|---|
hadoop_bin | –hadoop-bin | (automatic) | command |
hadoop_log_dirs | –hadoop-log-dir | (automatic) | path list |
hadoop_tmp_dir | –hadoop-tmp-dir | tmp/mrjob |
path |
spark_deploy_mode | –spark-deploy-mode | 'client' |
string |
spark_master | –spark-master | 'yarn' |
string |
spark_submit_bin | –spark-submit-bin | (automatic) | command |