Configuration quick reference

Setting configuration options

You can set an option by:

  • Passing it on the command line with the switch version (like --some-option)

  • Passing it as a keyword argument to the runner constructor, if you are creating the runner programmatically

  • Putting it in one of the included config files under a runner name, like this:

    runners:
        local:
            python_bin: python3.6  # only used in local runner
        emr:
            python_bin: python3  # only used in Elastic MapReduce runner
    

    See Config file format and location for information on where to put config files.

Options that can’t be set from mrjob.conf (all runners)

There are some options that it makes no sense to set in the config file.

These options can be set via command-line switches:

Config Command line Default Type
cat_output –cat-output, –no-cat-output output if output_dir is not set boolean
conf_paths -c, –conf-path, –no-conf see find_mrjob_conf() path list
output_dir –output-dir (automatic) string
step_output_dir –step-output-dir (automatic) string

These options can be set by overriding attributes or methods in your job class:

Option Attribute Method Default
hadoop_input_format HADOOP_INPUT_FORMAT hadoop_input_format() None
hadoop_output_format HADOOP_OUTPUT_FORMAT hadoop_output_format() None
partitioner PARTITIONER partitioner() None

These options can be set by overriding your job’s configure_args() to call the appropriate method:

Option Method Default
extra_args add_passthru_arg() []

All of the above can be passed as keyword arguments to MRJobRunner.__init__() (this is what makes them runner options), but you usually don’t want to instantiate runners directly.

Other options for all runners

These options can be passed to any runner without an error, though some runners may ignore some options. See the text after the table for specifics.

Config Command line Default Type
bootstrap_mrjob –bootstrap-mrjob, –no-bootstrap-mrjob True boolean
check_input_paths –check-input-paths, –no-check-input-paths True boolean
cleanup –cleanup 'ALL' string
cleanup_on_failure –cleanup-on-failure 'NONE' string
cmdenv –cmdenv {} environment variable dict
hadoop_extra_args –hadoop-args [] string list
hadoop_streaming_jar –hadoop-streaming-jar (automatic) string
jobconf -D, –jobconf {} jobconf dict
label –label script’s module name, or no_script string
libjars –libjars [] string list
local_tmp_dir –local-tmp-dir value of tempfile.gettempdir() path
owner –owner getpass.getuser(), or no_user if that fails string
py_files –py-files [] path list
python_bin –python-bin (automatic) command
read_logs –read-logs, –no-read-logs True boolean
setup –setup [] string list
sh_bin –sh-bin /bin/sh -ex command
spark_args –spark-args [] string list
task_python_bin –task-python-bin same as python_bin command
upload_archives –archives [] path list
upload_dirs –dirs [] path list
upload_files –files [] path list

LocalMRJobRunner takes no additional options, but:

  • bootstrap_mrjob is False by default
  • cmdenv uses the local system path separator instead of : all the time (so ; on Windows, no change elsewhere)
  • python_bin defaults to the current Python interpreter

In addition, it ignores hadoop_input_format, hadoop_output_format, hadoop_streaming_jar, and jobconf

InlineMRJobRunner works like LocalMRJobRunner, only it also ignores bootstrap_mrjob, cmdenv, python_bin, upload_archives, and upload_files.

Additional options for EMRJobRunner

Config Command line Default Type
add_steps_in_batch –add-steps-in-batch, –no-add-steps-in-batch True for AMIs before 5.28.0, False otherwise boolean
additional_emr_info –additional-emr-info None special
applications –application, –applications [] string list
aws_access_key_id None string
aws_secret_access_key –aws-secret-access-key None string
aws_session_token None string
bootstrap_actions –bootstrap-actions [] string list
bootstrap_spark –bootstrap-spark, –no-bootstrap-spark (automatic) boolean
cloud_log_dir –cloud-log-dir append logs to cloud_tmp_dir string
core_instance_bid_price –core-instance-bid-price None string
docker_client_config –docker-client-config None string
docker_image –docker-image, –no-docker None string
docker_mounts –docker-mount [] string list
ebs_root_volume_gb –ebs-root-volume-gb None integer
ec2_endpoint –ec2-endpoint (automatic) string
ec2_key_pair –ec2-key-pair None string
ec2_key_pair_file –ec2-key-pair-file None path
emr_action_on_failure –emr-action-on-failure (automatic) string
emr_configurations –emr-configuration [] list of dicts
emr_endpoint –emr-endpoint infer from region string
enable_emr_debugging –enable-emr-debugging False boolean
hadoop_streaming_jar_on_emr –hadoop-streaming-jar-on-emr AWS default string
iam_endpoint –iam-endpoint (automatic) string
iam_instance_profile –iam-instance-profile (automatic) string
iam_service_role –iam-service-role (automatic) string
instance_fleets –instance-fleet None
instance_groups –instance-groups None
master_instance_bid_price –master-instance-bid-price None string
max_clusters_in_pool –max-clusters-in-pool 0 (disabled) integer
max_concurrent_steps –max-concurrent-steps 1 string
min_available_mb –min-available-mb 0 (disabled) integer
min_available_virtual_cores –min-available-virtual-cores 0 (disabled) integer
pool_clusters –pool-clusters True string
pool_jitter_seconds –pool-jitter-seconds 60 string
pool_name –pool-name 'default' string
pool_timeout_minutes –pool-timeout-minutes 0 (disabled) string
pool_wait_minutes –pool-wait-minutes 0 string
release_label –release-label None string
s3_endpoint –s3-endpoint (automatic) string
ssh_add_bin –ssh-add-bin 'ssh-add' command
ssh_bin –ssh-bin 'ssh' command
tags –tag {} dict
task_instance_bid_price –task-instance-bid-price None string