Options available to all runners

The format of each item in this document is:

mrjob_conf_option_name (--command-line-option-name) : option_type

Default: default value

Description of option behavior

Options that take multiple values can be passed multiple times on the command line. All options can be passed as keyword arguments to the runner if initialized programmatically.

Making files available to tasks

Most jobs have dependencies of some sort - Python packages, Debian packages, data files, etc. This section covers options available to all runners that mrjob uses to upload files to your job’s execution environments. See File options if you want to write your own command line options related to file uploading.

Warning

You must wait to read files until after class initialization. That means you should use the *_init() methods to read files. Trying to read files into class variables will not work.

bootstrap_mrjob (--bootstrap-mrjob, --no-bootstrap-mrjob) : boolean

Default: (automatic)

Should we automatically zip up the mrjob library and install it when we run job? By default, we do unless interpreter is set.

Set this to False if you’ve already installed mrjob on your Hadoop cluster or install it by some other method.

mrjob used to be bootstrapped as a tarball.

py_files (--py-file) : path list

Default: []

List of .egg or .zip files to add to your job’s PYTHONPATH.

This is based on a Spark feature, but it works just as well with streaming jobs.

New in version 0.5.7.

upload_archives (--archive) : path list

Default: []

A list of archives (e.g. tarballs) to unpack in the local directory of the mr_job script when it runs. You can set the name in the job’s working directory we unpack into by appending #nameinworkingdir to the path; otherwise we just use the name of the archive file (e.g. foo.tar.gz is unpacked to the directory foo.tar.gz/, and foo.tar.gz#stuff is unpacked to the directory stuff/).

Changed in version 0.5.7: This works with Spark as well.

upload_dirs (--dir) : path list

Default: []

A list of directories to copy to the local directory of the mr_job script when it runs (mrjob does this by tarballing the directory and submitting the tarball to Hadoop as an archive).

You can set the name in the job’s working directory of the directory we copy by appending #nameinworkingdir to the path; otherwise we just use its name.

This works with Spark as well.

New in version 0.5.8.

upload_files (--file) : path list

Default: []

Files to copy to the local directory of the mr_job script when it runs. You can set the name of the file in the job’s working directory by appending #nameinworkingdir to the path; otherwise we just use the name of the file.

In the config file:

upload_files:
  - file_1.txt
  - file_2.sqlite

On the command line:

--file file_1.txt --file file_2.sqlite

Changed in version 0.5.7: This works with Spark as well.

Temp files and cleanup

cleanup (--cleanup) : string

Default: 'ALL'

List of which kinds of directories to delete when a job succeeds. Valid choices are:

  • 'ALL': delete logs and local and remote temp files; stop cluster

    if on EMR and the job is not done when cleanup is run.

  • 'CLUSTER': terminate EMR cluster if job not done when cleanup is run

  • 'JOB': stop job if not done when cleanup runs (temporarily disabled)

  • 'LOCAL_TMP': delete local temp files only

  • 'LOGS': delete logs only

  • 'NONE': delete nothing

  • 'REMOTE_TMP': delete remote temp files only

  • 'TMP': delete local and remote temp files, but not logs

In the config file:

cleanup: [LOGS, JOB]

On the command line:

--cleanup=LOGS,JOB

Changed in version 0.5.0: Options ending in TMP used to end in SCRATCH

cleanup_on_failure (--cleanup-on-failure) : string

Default: 'NONE'

Which kinds of directories to clean up when a job fails. Valid choices are the same as cleanup.

local_tmp_dir : path

Default: value of tempfile.gettempdir()

Alternate local temp directory.

There isn’t a command-line switch for this option; just set TMPDIR or any other environment variable respected by tempfile.gettempdir().

Changed in version 0.5.0: This option used to be named base_tmp_dir.

output_dir (--output-dir) : string

Default: (automatic)

An empty/non-existent directory where Hadoop streaming should put the final output from the job. If you don’t specify an output directory, we’ll output into a subdirectory of this job’s temporary directory. You can control this from the command line with --output-dir. This option cannot be set from configuration files. If used with the hadoop runner, this path does not need to be fully qualified with hdfs:// URIs because it’s understood that it has to be on HDFS.

no_output (--no-output) : boolean

Default: False

Don’t stream output to STDOUT after job completion. This is often used in conjunction with --output-dir to store output only in HDFS or S3.

step_output_dir (--step-output-dir) : string

Default: (automatic)

For a multi-step job, where to put output from job steps other than the last one. Each step’s output will go into a numbered subdirectory of this one (0000/, 0001/, etc.)

This option can be useful for debugging. By default, intermediate output goes into HDFS, which is fastest but not easily accessible on EMR or Dataproc.

This option currently does nothing on local and inline runners.

Job execution context

cmdenv (--cmdenv) : environment variable dict

Default: {}

Dictionary of environment variables to pass to the job inside Hadoop streaming.

In the config file:

cmdenv:
    PYTHONPATH: $HOME/stuff
    TZ: America/Los_Angeles

On the command line:

--cmdenv PYTHONPATH=$HOME/stuff,TZ=America/Los_Angeles

Changed in version 0.5.7: This works with Spark too. In client mode (hadoop runner), these environment variables are passed directly to spark-submit.

interpreter (--interpreter) : string

Default: None

Non-Python command to launch your script with (e.g. 'ruby'). This will also be used to query the script about steps unless you set steps_interpreter.

If you want to use an alternate Python command to run the job, use python_bin.

This takes precedence over python_bin and steps_python_bin.

python_bin (--python-bin) : command

Default: (automatic)

Name/path of alternate Python binary for wrapper scripts and mappers/reducers (e.g. 'python -v').

If you’re on Python 3, this always defaults to 'python3'.

If you’re on Python 2, this defaults to 'python' (except on EMR AMIs prior to 4.3.0, where it will be 'python2.7').

This option also affects which Python binary is used for file locking in setup scripts, so it might be useful to set even if you’re using a non-Python interpreter. It’s also used by EMRJobRunner to compile mrjob after bootstrapping it (see bootstrap_mrjob).

Unlike interpreter, this does not affect the binary used to query the job about its steps (use steps_python_bin).

setup (--setup) : string list

Default: []

A list of lines of shell script to run before each task (mapper/reducer).

This option is complex and powerful; the best way to get started is to read the Job Environment Setup Cookbook.

Using this option replaces your task with a shell “wrapper” script that executes the setup commands, and then executes the task as the last line of the script. This means that environment variables set by hadoop (e.g. $mapred_job_id) are available to setup commands, and that you can pass environment variables to the task (e.g. $PYTHONPATH) using export.

We use file locking around the setup commands (not the task) to ensure that multiple tasks running on the same node won’t run them simultaneously (it’s safe to run make). Before running the task, we cd back to the original working directory.

In addition, passing expressions like path#name will cause path to be automatically uploaded to the task’s working directory with the filename name, marked as executable, and interpolated into the script by its absolute path on the machine running the script.

path may also be a URI, and ~ and environment variables within path will be resolved based on the local environment. name is optional.

You can indicate that an archive should be unarchived into a directory by putting a / after name (e.g. foo.tar.gz#foo/).

You can indicate that a directory should be copied into the job’s working directory by putting a / after path (e.g. src-tree/#). You may optionally put a / after name as well (e.g. cd src-tree/#/subdir).

New in version 0.5.8: support for directories (above)

For more details of parsing, see parse_setup_cmd().

sh_bin (--sh-bin) : command

Default: sh -ex (with exceptions below)

Name/path of alternate shell binary to use for setup and bootstrap. Needs to be backwards compatible with Bourne Shell (e.g. 'sh', 'bash', 'zsh').

This is also used to wrap mappers, reducers, etc. that require piping one command into another (see e.g. mapper_pre_filter()).

On Dataproc and EMR, this defaults to /bin/sh -ex.

Changed in version 0.5.9: Starting with EMR AMI 5.2.0, sh -e is broken, so we emulate the -e switch by using /bin/sh -x as our shell, and prepending set -e to any shell script generated by mrjob. set -e is not prepended if you set sh_bin yourself; you could add it with setup.

steps_interpreter (--steps-interpreter) : command

Default: current Python interpreter

Alternate (non-Python) command to use to query the job about its steps. Usually it’s good enough to set interpreter.

If you want to use an alternate Python command to get the job’s steps, use steps_python_bin.

This takes precedence over steps_python_bin.

steps_python_bin (--steps-python-bin) : command

Default: (current Python interpreter)

Name/path of alternate python binary to use to query the job about its steps. Rarely needed. If not set, we use sys.executable (the current Python interpreter).

task_python_bin (--task-python-bin) : command

Default: same as python_bin

Name/path of alternate python binary to run the job (invoking it with --mapper, --spark or anything other than --steps).

In most cases, you’re better off setting python_bin, which this defaults to. This option exists mostly to support running tasks inside Docker while using a normal Python binary in setup wrapper scripts.

Other

conf_paths (-c, --conf-path, --no-conf) : path list

Default: see find_mrjob_conf()

List of paths to configuration files. This option cannot be used in configuration files, because that would cause a universe-ending causality paradox. Use –no-conf on the command line or conf_paths=[] to force mrjob to load no configuration files at all. If no config path flags are given, mrjob will look for one in the locations specified in Config file format and location.

Config path flags can be used multiple times to combine config files, much like the include config file directive. Using --no-conf will cause mrjob to ignore all preceding config path flags.

For example, this line will cause mrjob to combine settings from left.conf and right .conf:

python my_job.py -c left.conf -c right.conf

This line will cause mrjob to read no config file at all:

python my_job.py --no-conf

This line will cause mrjob to read only right.conf, because --no-conf nullifies -c left.conf:

python my_job.py -c left.conf --no-conf -c right.conf

Options ignored by the local and inline runners

These options are ignored because they require a real instance of Hadoop:

Options ignored by the inline runner

These options are ignored because the inline runner does not invoke the job as a subprocess: