mrjob v0.7.4 documentation

  • ← Options available to all runners
  • Spark runner options →
  • Home
  • Guides

Hadoop-related options¶

Since mrjob is geared toward Hadoop, there are a few Hadoop-specific options. However, due to the difference between the different runners, the Hadoop platform, and Elastic MapReduce, they are not all available for all runners.

Options specific to the local and inline runners¶

hadoop_version (--hadoop-version) : string

Default: None

Set the version of Hadoop to simulate (this currently only matters for jobconf).

If you don’t set this, the local and inline runners will run in a version-agnostic mode, where anytime the runner sets a simulated jobconf variable, it’ll use every possible name for it (e.g. user.name and mapreduce.job.user.name).

num_cores (--num-cores) : integer

Default: None

Maximum number of tasks to handle at one time. If not set, defaults to the number of CPUs on your system.

This also affects the number of input file splits the runner makes (the only impact in inline mode).

New in version 0.6.2.

Options available to local, hadoop, and emr runners¶

These options are both used by Hadoop and simulated by the local and inline runners to some degree.

jobconf (-D, --jobconf) : jobconf dict

Default: {}

-D args to pass to hadoop streaming. This should be a map from property name to value. Equivalent to passing ['-D', 'KEY1=VALUE1', '-D', 'KEY2=VALUE2', ...] to hadoop_extra_args

Changed in version 0.6.6: added the -D switch on the command line, to match Hadoop.

Changed in version 0.6.6: boolean true and false values in config files are passed correctly to Hadoop (see JobConf dicts)

Options available to hadoop and emr runners¶

hadoop_extra_args (--hadoop-args) : string list

Default: []

Extra arguments to pass to hadoop streaming.

hadoop_streaming_jar (--hadoop-streaming-jar) : string

Default: (automatic)

Path to a custom hadoop streaming jar.

On EMR, this can be either a local path or a URI (s3://...). If you want to use a jar at a path on the master node, use a file:// URI.

On Hadoop, mrjob tries its best to find your hadoop streaming jar, searching these directories (recursively) for a .jar file with hadoop followed by streaming in its name:

  • $HADOOP_PREFIX
  • $HADOOP_HOME
  • $HADOOP_INSTALL
  • $HADOOP_MAPRED_HOME
  • the parent of the directory containing the Hadoop binary (see hadoop_bin), unless it’s one of /, /usr or /usr/local
  • $HADOOP_*_HOME (in alphabetical order by environment variable name)
  • /home/hadoop/contrib
  • /usr/lib/hadoop-mapreduce

(The last two paths allow the Hadoop runner to work out-of-the box inside EMR.)

libjars (--libjars) : string list

Default: []

List of paths of JARs to be passed to Hadoop with the -libjars switch.

~ and environment variables within paths will be resolved based on the local environment.

Changed in version 0.6.7: Deprecated --libjar in favor of --libjars

Note

mrjob does not yet support libjars on Google Cloud Dataproc.

label (--label) : string

Default: script’s module name, or no_script

Alternate label for the job

owner (--owner) : string

Default: getpass.getuser(), or no_user if that fails

Who is running this job (if different from the current user)

check_input_paths (--check-input-paths, --no-check-input-paths) : boolean

Default: True

Option to skip the input path check. With --no-check-input-paths, input paths to the runner will be passed straight through, without checking if they exist.

spark_args (--spark-args) : string list

Default: []

Extra arguments to pass to spark-submit.

Warning

Don’t use this to set --master or --deploy-mode. On the Hadoop runner, you can change these with spark_master and spark_deploy_mode. Other runners don’t allow you to set these because they can only handle the defaults.

Options available to hadoop runner only¶

hadoop_bin (--hadoop-bin) : command

Default: (automatic)

Name/path of your hadoop binary (may include arguments).

mrjob tries its best to find hadoop, checking all of the following places for an executable file named hadoop:

  • $HADOOP_PREFIX/bin
  • $HADOOP_HOME/bin
  • $HADOOP_INSTALL/bin
  • $HADOOP_INSTALL/hadoop/bin
  • $PATH
  • $HADOOP_*_HOME/bin (in alphabetical order by environment variable name)

If all else fails, we just use hadoop and hope for the best.

Changed in version 0.6.8: Setting this to an empty value (--hadoop-bin '') means to search for the Hadoop binary (used to effectively disable use of the hadoop command).

hadoop_log_dirs (--hadoop-log-dir) : path list

Default: (automatic)

Where to look for Hadoop logs (to find counters and probable cause of job failure). These can be (local) paths or URIs (hdfs:///...).

If this is not set, mrjob will try its best to find the logs, searching in:

  • $HADOOP_LOG_DIR
  • $YARN_LOG_DIR (on YARN only)
  • hdfs:///tmp/hadoop-yarn/staging (on YARN only)
  • <job output dir>/_logs (usually this is on HDFS)
  • $HADOOP_PREFIX/logs
  • $HADOOP_HOME/logs
  • $HADOOP_INSTALL/logs
  • $HADOOP_MAPRED_HOME/logs
  • <dir containing hadoop bin>/logs (see hadoop_bin), unless the hadoop binary is in /bin, /usr/bin, or /usr/local/bin
  • $HADOOP_*_HOME/logs (in alphabetical order by environment variable name)
  • /var/log/hadoop-yarn (on YARN only)
  • /mnt/var/log/hadoop-yarn (on YARN only)
  • /var/log/hadoop
  • /mnt/var/log/hadoop

hadoop_tmp_dir (--hadoop-tmp-dir) : path

Default: tmp/mrjob

Scratch space on HDFS. This path does not need to be fully qualified with hdfs:// URIs because it’s understood that it has to be on HDFS.

spark_deploy_mode (--spark-deploy-mode) : string

Default: 'client'

Deploy mode (client or cluster) to pass to the --deploy-mode argument of spark-submit.

New in version 0.6.6.

spark_master (--spark-master) : string

Default: 'yarn'

Name or URL to pass to the --master argument of spark-submit (e.g. spark://host:port, yarn).

Note that archives (see upload_archives) only work when this is set to yarn.

spark_submit_bin (--spark-submit-bin) : command

Default: (automatic)

Name/path of your spark-submit binary (may include arguments).

mrjob tries its best to find spark-submit, checking all of the following places for an executable file named spark-submit:

  • $SPARK_HOME/bin
  • $PATH
  • your pyspark installation’s bin/ directory
  • /usr/lib/spark/bin
  • /usr/local/spark/bin
  • /usr/local/lib/spark/bin

If all else fails, we just use spark-submit and hope for the best.

Changed in version 0.6.8: Searches for spark-submit in pyspark installation.

Table Of Contents

  • Hadoop-related options
    • Options specific to the local and inline runners
    • Options available to local, hadoop, and emr runners
    • Options available to hadoop and emr runners
    • Options available to hadoop runner only

Need help?

Join the mailing list by visiting the Google group page or sending an email to mrjob+subscribe@googlegroups.com.

This Page

  • Show Source
  • ← Options available to all runners
  • Spark runner options →
  • Home
  • Guides
© 2009-2018 Yelp and Contributors. Created using Sphinx 1.3.1 with the better theme.