Hadoop-related options¶
Since mrjob is geared toward Hadoop, there are a few Hadoop-specific options. However, due to the difference between the different runners, the Hadoop platform, and Elastic MapReduce, they are not all available for all runners.
Options specific to the local and inline runners¶
- hadoop_version (
--hadoop-version
) : string Default:
None
Set the version of Hadoop to simulate (this currently only matters for jobconf).
If you don’t set this, the
local
andinline
runners will run in a version-agnostic mode, where anytime the runner sets a simulated jobconf variable, it’ll use every possible name for it (e.g.user.name
andmapreduce.job.user.name
).
- num_cores (
--num-cores
) : integer Default:
None
Maximum number of tasks to handle at one time. If not set, defaults to the number of CPUs on your system.
This also affects the number of input file splits the runner makes (the only impact in
inline
mode).New in version 0.6.2.
Options available to local, hadoop, and emr runners¶
These options are both used by Hadoop and simulated by the local
and inline
runners to some degree.
- jobconf (
-D
,--jobconf
) : jobconf dict Default:
{}
-D
args to pass to hadoop streaming. This should be a map from property name to value. Equivalent to passing['-D', 'KEY1=VALUE1', '-D', 'KEY2=VALUE2', ...]
to hadoop_extra_argsChanged in version 0.6.6: added the
-D
switch on the command line, to match Hadoop.Changed in version 0.6.6: boolean
true
andfalse
values in config files are passed correctly to Hadoop (see JobConf dicts)
Options available to hadoop and emr runners¶
- hadoop_extra_args (
--hadoop-args
) : string list Default:
[]
Extra arguments to pass to hadoop streaming.
- hadoop_streaming_jar (
--hadoop-streaming-jar
) : string Default: (automatic)
Path to a custom hadoop streaming jar.
On EMR, this can be either a local path or a URI (
s3://...
). If you want to use a jar at a path on the master node, use afile://
URI.On Hadoop, mrjob tries its best to find your hadoop streaming jar, searching these directories (recursively) for a
.jar
file withhadoop
followed bystreaming
in its name:$HADOOP_PREFIX
$HADOOP_HOME
$HADOOP_INSTALL
$HADOOP_MAPRED_HOME
- the parent of the directory containing the Hadoop binary (see hadoop_bin), unless it’s one of
/
,/usr
or/usr/local
$HADOOP_*_HOME
(in alphabetical order by environment variable name)/home/hadoop/contrib
/usr/lib/hadoop-mapreduce
(The last two paths allow the Hadoop runner to work out-of-the box inside EMR.)
- libjars (
--libjars
) : string list Default:
[]
List of paths of JARs to be passed to Hadoop with the
-libjars
switch.~
and environment variables within paths will be resolved based on the local environment.Changed in version 0.6.7: Deprecated
--libjar
in favor of--libjars
Note
mrjob does not yet support libjars on Google Cloud Dataproc.
- label (
--label
) : string Default: script’s module name, or
no_script
Alternate label for the job
- owner (
--owner
) : string Default:
getpass.getuser()
, orno_user
if that failsWho is running this job (if different from the current user)
- check_input_paths (
--check-input-paths
,--no-check-input-paths
) : boolean Default:
True
Option to skip the input path check. With
--no-check-input-paths
, input paths to the runner will be passed straight through, without checking if they exist.
- spark_args (
--spark-args
) : string list Default:
[]
Extra arguments to pass to spark-submit.
Warning
Don’t use this to set
--master
or--deploy-mode
. On the Hadoop runner, you can change these with spark_master and spark_deploy_mode. Other runners don’t allow you to set these because they can only handle the defaults.
Options available to hadoop runner only¶
- hadoop_bin (
--hadoop-bin
) : command Default: (automatic)
Name/path of your hadoop binary (may include arguments).
mrjob tries its best to find hadoop, checking all of the following places for an executable file named
hadoop
:$HADOOP_PREFIX/bin
$HADOOP_HOME/bin
$HADOOP_INSTALL/bin
$HADOOP_INSTALL/hadoop/bin
$PATH
$HADOOP_*_HOME/bin
(in alphabetical order by environment variable name)
If all else fails, we just use
hadoop
and hope for the best.Changed in version 0.6.8: Setting this to an empty value (
--hadoop-bin ''
) means to search for the Hadoop binary (used to effectively disable use of the hadoop command).
- hadoop_log_dirs (
--hadoop-log-dir
) : path list Default: (automatic)
Where to look for Hadoop logs (to find counters and probable cause of job failure). These can be (local) paths or URIs (
hdfs:///...
).If this is not set, mrjob will try its best to find the logs, searching in:
$HADOOP_LOG_DIR
$YARN_LOG_DIR
(on YARN only)hdfs:///tmp/hadoop-yarn/staging
(on YARN only)<job output dir>/_logs
(usually this is on HDFS)$HADOOP_PREFIX/logs
$HADOOP_HOME/logs
$HADOOP_INSTALL/logs
$HADOOP_MAPRED_HOME/logs
<dir containing hadoop bin>/logs
(see hadoop_bin), unless the hadoop binary is in/bin
,/usr/bin
, or/usr/local/bin
$HADOOP_*_HOME/logs
(in alphabetical order by environment variable name)/var/log/hadoop-yarn
(on YARN only)/mnt/var/log/hadoop-yarn
(on YARN only)/var/log/hadoop
/mnt/var/log/hadoop
- hadoop_tmp_dir (
--hadoop-tmp-dir
) : path Default:
tmp/mrjob
Scratch space on HDFS. This path does not need to be fully qualified with
hdfs://
URIs because it’s understood that it has to be on HDFS.
- spark_deploy_mode (
--spark-deploy-mode
) : string Default:
'client'
Deploy mode (
client
orcluster
) to pass to the--deploy-mode
argument of spark-submit.New in version 0.6.6.
- spark_master (
--spark-master
) : string Default:
'yarn'
Name or URL to pass to the
--master
argument of spark-submit (e.g.spark://host:port
,yarn
).Note that archives (see upload_archives) only work when this is set to
yarn
.
- spark_submit_bin (
--spark-submit-bin
) : command Default: (automatic)
Name/path of your spark-submit binary (may include arguments).
mrjob tries its best to find spark-submit, checking all of the following places for an executable file named
spark-submit
:$SPARK_HOME/bin
$PATH
- your
pyspark
installation’sbin/
directory /usr/lib/spark/bin
/usr/local/spark/bin
/usr/local/lib/spark/bin
If all else fails, we just use
spark-submit
and hope for the best.Changed in version 0.6.8: Searches for spark-submit in
pyspark
installation.