Hadoop-related options¶
Since mrjob is geared toward Hadoop, there are a few Hadoop-specific options. However, due to the difference between the different runners, the Hadoop platform, and Elastic MapReduce, they are not all available for all runners.
Options specific to the local and inline runners¶
- hadoop_version (
--hadoop-version) : string Default:
NoneSet the version of Hadoop to simulate (this currently only matters for jobconf).
If you don’t set this, the
localandinlinerunners will run in a version-agnostic mode, where anytime the runner sets a simulated jobconf variable, it’ll use every possible name for it (e.g.user.nameandmapreduce.job.user.name).
- num_cores (
--num-cores) : integer Default:
NoneMaximum number of tasks to handle at one time. If not set, defaults to the number of CPUs on your system.
This also affects the number of input file splits the runner makes (the only impact in
inlinemode).New in version 0.6.2.
Options available to local, hadoop, and emr runners¶
These options are both used by Hadoop and simulated by the local
and inline runners to some degree.
- jobconf (
-D,--jobconf) : jobconf dict Default:
{}-Dargs to pass to hadoop streaming. This should be a map from property name to value. Equivalent to passing['-D', 'KEY1=VALUE1', '-D', 'KEY2=VALUE2', ...]to hadoop_extra_argsChanged in version 0.6.6: added the
-Dswitch on the command line, to match Hadoop.Changed in version 0.6.6: boolean
trueandfalsevalues in config files are passed correctly to Hadoop (see JobConf dicts)
Options available to hadoop and emr runners¶
- hadoop_extra_args (
--hadoop-args) : string list Default:
[]Extra arguments to pass to hadoop streaming.
- hadoop_streaming_jar (
--hadoop-streaming-jar) : string Default: (automatic)
Path to a custom hadoop streaming jar.
On EMR, this can be either a local path or a URI (
s3://...). If you want to use a jar at a path on the master node, use afile://URI.On Hadoop, mrjob tries its best to find your hadoop streaming jar, searching these directories (recursively) for a
.jarfile withhadoopfollowed bystreamingin its name:$HADOOP_PREFIX$HADOOP_HOME$HADOOP_INSTALL$HADOOP_MAPRED_HOME- the parent of the directory containing the Hadoop binary (see hadoop_bin), unless it’s one of
/,/usror/usr/local $HADOOP_*_HOME(in alphabetical order by environment variable name)/home/hadoop/contrib/usr/lib/hadoop-mapreduce
(The last two paths allow the Hadoop runner to work out-of-the box inside EMR.)
- libjars (
--libjars) : string list Default:
[]List of paths of JARs to be passed to Hadoop with the
-libjarsswitch.~and environment variables within paths will be resolved based on the local environment.Changed in version 0.6.7: Deprecated
--libjarin favor of--libjarsNote
mrjob does not yet support libjars on Google Cloud Dataproc.
- label (
--label) : string Default: script’s module name, or
no_scriptAlternate label for the job
- owner (
--owner) : string Default:
getpass.getuser(), orno_userif that failsWho is running this job (if different from the current user)
- check_input_paths (
--check-input-paths,--no-check-input-paths) : boolean Default:
TrueOption to skip the input path check. With
--no-check-input-paths, input paths to the runner will be passed straight through, without checking if they exist.
- spark_args (
--spark-args) : string list Default:
[]Extra arguments to pass to spark-submit.
Warning
Don’t use this to set
--masteror--deploy-mode. On the Hadoop runner, you can change these with spark_master and spark_deploy_mode. Other runners don’t allow you to set these because they can only handle the defaults.
Options available to hadoop runner only¶
- hadoop_bin (
--hadoop-bin) : command Default: (automatic)
Name/path of your hadoop binary (may include arguments).
mrjob tries its best to find hadoop, checking all of the following places for an executable file named
hadoop:$HADOOP_PREFIX/bin$HADOOP_HOME/bin$HADOOP_INSTALL/bin$HADOOP_INSTALL/hadoop/bin$PATH$HADOOP_*_HOME/bin(in alphabetical order by environment variable name)
If all else fails, we just use
hadoopand hope for the best.Changed in version 0.6.8: Setting this to an empty value (
--hadoop-bin '') means to search for the Hadoop binary (used to effectively disable use of the hadoop command).
- hadoop_log_dirs (
--hadoop-log-dir) : path list Default: (automatic)
Where to look for Hadoop logs (to find counters and probable cause of job failure). These can be (local) paths or URIs (
hdfs:///...).If this is not set, mrjob will try its best to find the logs, searching in:
$HADOOP_LOG_DIR$YARN_LOG_DIR(on YARN only)hdfs:///tmp/hadoop-yarn/staging(on YARN only)<job output dir>/_logs(usually this is on HDFS)$HADOOP_PREFIX/logs$HADOOP_HOME/logs$HADOOP_INSTALL/logs$HADOOP_MAPRED_HOME/logs<dir containing hadoop bin>/logs(see hadoop_bin), unless the hadoop binary is in/bin,/usr/bin, or/usr/local/bin$HADOOP_*_HOME/logs(in alphabetical order by environment variable name)/var/log/hadoop-yarn(on YARN only)/mnt/var/log/hadoop-yarn(on YARN only)/var/log/hadoop/mnt/var/log/hadoop
- hadoop_tmp_dir (
--hadoop-tmp-dir) : path Default:
tmp/mrjobScratch space on HDFS. This path does not need to be fully qualified with
hdfs://URIs because it’s understood that it has to be on HDFS.
- spark_deploy_mode (
--spark-deploy-mode) : string Default:
'client'Deploy mode (
clientorcluster) to pass to the--deploy-modeargument of spark-submit.New in version 0.6.6.
- spark_master (
--spark-master) : string Default:
'yarn'Name or URL to pass to the
--masterargument of spark-submit (e.g.spark://host:port,yarn).Note that archives (see upload_archives) only work when this is set to
yarn.
- spark_submit_bin (
--spark-submit-bin) : command Default: (automatic)
Name/path of your spark-submit binary (may include arguments).
mrjob tries its best to find spark-submit, checking all of the following places for an executable file named
spark-submit:$SPARK_HOME/bin$PATH- your
pysparkinstallation’sbin/directory /usr/lib/spark/bin/usr/local/spark/bin/usr/local/lib/spark/bin
If all else fails, we just use
spark-submitand hope for the best.Changed in version 0.6.8: Searches for spark-submit in
pysparkinstallation.