Config file format and location¶
We look for mrjob.conf
in these locations:
- The location specified by
MRJOB_CONF
~/.mrjob.conf
/etc/mrjob.conf
You can specify one or more configuration files with the --conf-path
flag. See Options available to all runners for more information.
The point of mrjob.conf
is to let you set up things you want every
job to have access to so that you don’t have to think about it. For example:
- libraries and source code you want to be available for your jobs
- where temp directories and logs should go
- security credentials
mrjob.conf
is just a YAML- or JSON-encoded dictionary containing default values to pass in
to the constructors of the various runner classes. Here’s a minimal
mrjob.conf
:
runners:
emr:
cmdenv:
TZ: America/Los_Angeles
Now whenever you run mr_your_script.py -r emr
,
EMRJobRunner
will automatically set TZ
to
America/Los_Angeles
in your job’s environment when it runs on EMR.
If you don’t have the yaml
module installed, you can use JSON
in your mrjob.conf
instead (JSON is a subset of YAML, so it’ll still
work once you install yaml
). Here’s how you’d render the above
example in JSON:
{
"runners": {
"emr": {
"cmdenv": {
"TZ": "America/Los_Angeles"
}
}
}
}
Precedence and combining options¶
Options specified on the command-line take precedence over
mrjob.conf
. Usually this means simply overriding the option in
mrjob.conf
. However, we know that cmdenv contains environment
variables, so we do the right thing. For example, if your mrjob.conf
contained:
runners:
emr:
cmdenv:
PATH: /usr/local/bin
TZ: America/Los_Angeles
and you ran your job as:
mr_your_script.py -r emr --cmdenv TZ=Europe/Paris --cmdenv PATH=/usr/sbin
We’d automatically handle the PATH
variables and your job’s environment would be:
{'TZ': 'Europe/Paris', 'PATH': '/usr/sbin:/usr/local/bin'}
What’s going on here is that cmdenv is associated with
combine_envs()
. Each option is associated with an appropriate
combiner function that that combines options in an appropriate way.
Combiner functions can also do useful things like expanding environment variables and globs in paths. For example, you could set:
runners:
local:
upload_files: &upload_files
- $DATA_DIR/*.db
hadoop:
upload_files: *upload_files
emr:
upload_files: *upload_files
and every time you ran a job, every job in your .db
file in $DATA_DIR
would automatically be loaded into your job’s current working directory.
Also, if you specified additional files to upload with --file
, those
files would be uploaded in addition to the .db
files, rather than instead
of them.
See Configuration quick reference for the entire dizzying array of configurable options.
Option data types¶
The same option may be specified multiple times and be one of several data
types. For example, the AWS region may be specified in mrjob.conf
, in the
arguments to EMRJobRunner
, and on the command line. These are the rules
used to determine what value to use at runtime.
Values specified “later” refer to an option being specified at a higher
priority. For example, a value in mrjob.conf
is specified “earlier” than a
value passed on the command line.
When there are multiple values, they are “combined with” a combiner function. The combiner function for each data type is listed in its description.
Simple data types¶
When these are specified more than once, the last non-None
value is used.
- String
- Simple, unchanged string. Combined with
combine_values()
.
- Command
- String containing all ASCII characters to be parsed with
shlex.split()
, or list of command + arguments. Combined withcombine_cmds()
.
- Path
- Local path with
~
and environment variables (e.g.$TMPDIR
) resolved. Combined withcombine_paths()
.
List data types¶
The values of these options are specified as lists. When specified more than once, the lists are concatenated together.
- String list
- List of strings. Combined with
combine_lists()
.
- Path list
- List of paths. Combined with
combine_path_lists()
.
Strings and non-sequence data types (e.g. numbers) are treated as single-item lists.
For example,
runners:
emr:
setup: /run/some/command with args
is equivalent to:
runners:
emr:
setup:
- /run/some/command with args
Dict data types¶
The values of these options are specified as dictionaries. When specified more than once, each has custom behavior described below.
- Plain dict
- Values specified later override values specified earlier. Combined with
combine_dicts()
.
JobConf Dicts
New in version 0.6.6: Like plain dicts except that non-string values are converted into a
format that Java understands. For example, the boolean value
true
here:
jobconf:
mapreduce.output.fileoutputformat.compress: true
gets passed through to Hadoop in Java format (true
), not
Python format (True
).
Keys whose values are None
are not passed to Hadoop at all.
Warning
Prior to version 0.6.6, you should use "true"
and "false"
,
for boolean jobconf values in config files, not
true
and false
.
- Environment variable dict
Values specified later override values specified earlier, except for those with keys ending in PATH, in which values are concatenated and separated by a colon (
:
) rather than overwritten. The later value comes first.For example, this config:
runners: emr: cmdenv: PATH: /usr/bin
when run with this command:
python my_job.py --cmdenv PATH=/usr/local/bin
will result in the following value of
cmdenv
:/usr/local/bin:/usr/bin
The function that handles this is
combine_envs()
.The one exception to this behavior is in the
local
runner, which uses the local system separator (on Windows;
, on everything else still:
) instead of always using:
. In local mode, the function that combines config values iscombine_local_envs()
.
Using multiple config files¶
If you have several standard configurations, you may want to have several
config files “inherit” from a base config file. For example, you may have one
set of AWS credentials, but two code bases and default instance sizes. To
accomplish this, use the include
option:
~/mrjob.very-large.conf
:
include: ~/.mrjob.base.conf
runners:
emr:
num_core_instances: 20
core_instance_type: m1.xlarge
~/mrjob.very-small.conf
:
include: $HOME/.mrjob.base.conf
runners:
emr:
num_core_instances: 2
core_instance_type: m1.small
~/.mrjob.base.conf
:
runners:
emr:
aws_access_key_id: HADOOPHADOOPBOBADOOP
aws_secret_access_key: MEMIMOMADOOPBANANAFANAFOFADOOPHADOOP
region: us-west-1
Options that are lists, commands, dictionaries, etc. combine the same way they do between the config files and the command line (with combiner functions).
You can use $ENVIRONMENT_VARIABLES
and ~/file_in_your_home_dir
inside
include
.
You can inherit from multiple config files by passing include
a list
instead of a string. Files on the right will have precedence over files on the
left. To continue the above examples, this config:
~/.mrjob.everything.conf
include:
- ~/.mrjob.very-small.conf
- ~/.mrjob.very-large.conf
will be equivalent to this one:
~/.mrjob.everything-2.conf
runners:
emr:
aws_access_key_id: HADOOPHADOOPBOBADOOP
aws_secret_access_key: MEMIMOMADOOPBANANAFANAFOFADOOPHADOOP
core_instance_type: m1.xlarge
num_core_instances: 20
region: us-west-1
In this case, ~/.mrjob.very-large.conf
has taken precedence over
~/.mrjob.very-small.conf
.
Relative includes¶
Relative include:
paths are relative to the real (after resolving
symlinks) path of the including conf file.
For example, you could do this:
~/.mrjob/base.conf
:
runners:
...
~/.mrjob/default.conf
:
include: base.conf
You could then load your configs via a symlink ~/.mrjob.conf
to
~/.mrjob/default.conf
and ~/.mrjob/base.conf
would still be
included (even though it’s not in the same directory as the symlink).
Clearing configs¶
Sometimes, you just want to override a list-type config (e.g. setup
) or
a *PATH
environment variable, rather than having mrjob cleverly concatenate
it with previous configs.
You can do this in YAML config files by tagging the values you want to take
precedence with the !clear
tag.
For example:
~/.mrjob.base.conf
runners:
emr:
aws_access_key_id: HADOOPHADOOPBOBADOOP
aws_secret_access_key: MEMIMOMADOOPBANANAFANAFOFADOOPHADOOP
cmdenv:
PATH: /this/nice/path
PYTHONPATH: /here/be/serpents
USER: dave
setup:
- /run/this/command
~/.mrjob.conf
include: ~/mrjob.base.conf
runners:
emr:
cmdenv:
PATH: !clear /this/even/better/path/yay
PYTHONPATH: !clear
setup: !clear
- /run/this/other/command
is equivalent to:
runners:
emr:
aws_access_key_id: HADOOPHADOOPBOBADOOP
aws_secret_access_key: MEMIMOMADOOPBANANAFANAFOFADOOPHADOOP
cmdenv:
PATH: /this/even/better/path/yay
USER: dave
setup:
- /run/this/other/command
If you specify multiple config files (e.g.
-c ~/mrjob.base.conf -c ~/mrjob.conf
), a !clear
in a later file will
override earlier files. include:
is really just another way to prepend
to the list of config files to load.
If you find it more readable, you may put the !clear
tag before the
key you want to clear. For example,
runners:
emr:
!clear setup:
- /run/this/other/command
is equivalent to:
runners:
emr:
setup: !clear
- /run/this/other/command
!clear
tags in lists are ignored. You cannot currently clear an entire set
of configs (e.g. runners: emr: !clear ...
does not work).