Configuration quick reference¶

Setting configuration options¶

You can set an option by:

Passing it on the command line with the switch version (like --some-option)
Passing it as a keyword argument to the runner constructor, if you are creating the runner programmatically

Putting it in one of the included config files under a runner name, like this:

runners:
    local:
        python_bin: python3.6  # only used in local runner
    emr:
        python_bin: python3  # only used in Elastic MapReduce runner

See Config file format and location for information on where to put config files.

Options that can’t be set from mrjob.conf (all runners)¶

There are some options that it makes no sense to set in the config file.

These options can be set via command-line switches:

Config	Command line	Default	Type
cat_output	–cat-output, –no-cat-output	output if output_dir is not set	boolean
conf_paths	-c, –conf-path, –no-conf	see `find_mrjob_conf()`	path list
output_dir	–output-dir	(automatic)	string
step_output_dir	–step-output-dir	(automatic)	string

These options can be set by overriding attributes or methods in your job class:

Option	Attribute	Method	Default
hadoop_input_format	`HADOOP_INPUT_FORMAT`	`hadoop_input_format()`	`None`
hadoop_output_format	`HADOOP_OUTPUT_FORMAT`	`hadoop_output_format()`	`None`
partitioner	`PARTITIONER`	`partitioner()`	`None`

These options can be set by overriding your job’s configure_args() to call the appropriate method:

Option	Method	Default
extra_args	`add_passthru_arg()`	`[]`

All of the above can be passed as keyword arguments to MRJobRunner.__init__() (this is what makes them runner options), but you usually don’t want to instantiate runners directly.

Other options for all runners¶

These options can be passed to any runner without an error, though some runners may ignore some options. See the text after the table for specifics.

Config	Command line	Default	Type
bootstrap_mrjob	–bootstrap-mrjob, –no-bootstrap-mrjob	`True`	boolean
check_input_paths	–check-input-paths, –no-check-input-paths	`True`	boolean
cleanup	–cleanup	`'ALL'`	string
cleanup_on_failure	–cleanup-on-failure	`'NONE'`	string
cmdenv	–cmdenv	`{}`	environment variable dict
hadoop_extra_args	–hadoop-args	`[]`	string list
hadoop_streaming_jar	–hadoop-streaming-jar	(automatic)	string
jobconf	-D, –jobconf	`{}`	jobconf dict
label	–label	script’s module name, or `no_script`	string
libjars	–libjars	`[]`	string list
local_tmp_dir	–local-tmp-dir	value of `tempfile.gettempdir()`	path
owner	–owner	`getpass.getuser()`, or `no_user` if that fails	string
py_files	–py-files	`[]`	path list
python_bin	–python-bin	(automatic)	command
read_logs	–read-logs, –no-read-logs	`True`	boolean
setup	–setup	`[]`	string list
sh_bin	–sh-bin	/bin/sh -ex	command
spark_args	–spark-args	`[]`	string list
task_python_bin	–task-python-bin	same as python_bin	command
upload_archives	–archives	`[]`	path list
upload_dirs	–dirs	`[]`	path list
upload_files	–files	`[]`	path list

LocalMRJobRunner takes no additional options, but:

bootstrap_mrjob is False by default
cmdenv uses the local system path separator instead of : all the time (so ; on Windows, no change elsewhere)
python_bin defaults to the current Python interpreter

In addition, it ignores hadoop_input_format, hadoop_output_format, hadoop_streaming_jar, and jobconf

InlineMRJobRunner works like LocalMRJobRunner, only it also ignores bootstrap_mrjob, cmdenv, python_bin, upload_archives, and upload_files.

Additional options for `DataprocJobRunner`¶

Config	Command line	Default	Type
cluster_properties	–cluster-property	`None`
core_instance_config	–core-instance-config	`None`
gcloud_bin	–gcloud-bin	`'gcloud'`	command
master_instance_config	–master-instance-config	`None`
network	–network	`None`	string
project_id	–project-id	read from credentials config file	string
service_account	–service-account	`None`
service_account_scopes	–service-account-scopes	(automatic)
task_instance_config	–task-instance-config	`None`

Additional options for `EMRJobRunner`¶

Config	Command line	Default	Type
add_steps_in_batch	–add-steps-in-batch, –no-add-steps-in-batch	`True` for AMIs before 5.28.0, `False` otherwise	boolean
additional_emr_info	–additional-emr-info	`None`	special
applications	–application, –applications	`[]`	string list
aws_access_key_id		`None`	string
aws_secret_access_key	–aws-secret-access-key	`None`	string
aws_session_token		`None`	string
bootstrap_actions	–bootstrap-actions	`[]`	string list
bootstrap_spark	–bootstrap-spark, –no-bootstrap-spark	(automatic)	boolean
cloud_log_dir	–cloud-log-dir	append `logs` to cloud_tmp_dir	string
core_instance_bid_price	–core-instance-bid-price	`None`	string
docker_client_config	–docker-client-config	`None`	string
docker_image	–docker-image, –no-docker	`None`	string
docker_mounts	–docker-mount	`[]`	string list
ebs_root_volume_gb	–ebs-root-volume-gb	`None`	integer
ec2_endpoint	–ec2-endpoint	(automatic)	string
ec2_key_pair	–ec2-key-pair	`None`	string
ec2_key_pair_file	–ec2-key-pair-file	`None`	path
emr_action_on_failure	–emr-action-on-failure	(automatic)	string
emr_configurations	–emr-configuration	`[]`	list of dicts
emr_endpoint	–emr-endpoint	infer from region	string
enable_emr_debugging	–enable-emr-debugging	`False`	boolean
hadoop_streaming_jar_on_emr	–hadoop-streaming-jar-on-emr	AWS default	string
iam_endpoint	–iam-endpoint	(automatic)	string
iam_instance_profile	–iam-instance-profile	(automatic)	string
iam_service_role	–iam-service-role	(automatic)	string
instance_fleets	–instance-fleet	`None`
instance_groups	–instance-groups	`None`
master_instance_bid_price	–master-instance-bid-price	`None`	string
max_clusters_in_pool	–max-clusters-in-pool	0 (disabled)	integer
max_concurrent_steps	–max-concurrent-steps	1	string
min_available_mb	–min-available-mb	0 (disabled)	integer
min_available_virtual_cores	–min-available-virtual-cores	0 (disabled)	integer
pool_clusters	–pool-clusters	`True`	string
pool_jitter_seconds	–pool-jitter-seconds	60	string
pool_name	–pool-name	`'default'`	string
pool_timeout_minutes	–pool-timeout-minutes	0 (disabled)	string
pool_wait_minutes	–pool-wait-minutes	0	string
release_label	–release-label	`None`	string
s3_endpoint	–s3-endpoint	(automatic)	string
ssh_add_bin	–ssh-add-bin	`'ssh-add'`	command
ssh_bin	–ssh-bin	`'ssh'`	command
tags	–tag	`{}`	dict
task_instance_bid_price	–task-instance-bid-price	`None`	string

Additional options for `HadoopJobRunner`¶

Config	Command line	Default	Type
hadoop_bin	–hadoop-bin	(automatic)	command
hadoop_log_dirs	–hadoop-log-dir	(automatic)	path list
hadoop_tmp_dir	–hadoop-tmp-dir	`tmp/mrjob`	path
spark_deploy_mode	–spark-deploy-mode	`'client'`	string
spark_master	–spark-master	`'yarn'`	string
spark_submit_bin	–spark-submit-bin	(automatic)	command

mrjob v0.7.4 documentation

Configuration quick reference¶

Setting configuration options¶

Options that can’t be set from mrjob.conf (all runners)¶

Other options for all runners¶

Additional options for `DataprocJobRunner`¶

Additional options for `EMRJobRunner`¶

Additional options for `HadoopJobRunner`¶

Table Of Contents

Need help?

This Page

Configuration quick reference¶

Setting configuration options¶

Options that can’t be set from mrjob.conf (all runners)¶

Other options for all runners¶

Additional options for DataprocJobRunner¶

Additional options for EMRJobRunner¶

Additional options for HadoopJobRunner¶

Additional options for `DataprocJobRunner`¶

Additional options for `EMRJobRunner`¶

Additional options for `HadoopJobRunner`¶