Job Environment Setup Cookbook¶
Many jobs have significant external dependencies, both libraries and other source code.
Combining shell syntax with Hadoop’s DistributedCache notation, mrjob’s setup option provides a powerful, dynamic alternative to pre-installing your Hadoop dependencies on every node.
mrjob.conf examples below are for the
but these work equally well with the
emr runner. Also, if you are using
EMR, take a look at the EMR Bootstrapping Cookbook.
Uploading your source tree¶
This relies on a feature that was added in 0.5.8. See below for how to do it in earlier versions.
mrjob can automatically tarball your source directory and include it
in your job’s working directory. We can use setup scripts to upload the
directory and then add it to
Run your job with:
--setup 'export PYTHONPATH=$PYTHONPATH:your-src-code/#'
/ before the
# tells mrjob that
your-src-code is a
directory. You may optionally include a
/ after the
# as well
If every job you run is going to want to use
your-src-code, you can do
this in your
runners: hadoop: setup: - export PYTHONPATH=$PYTHONPATH:your-src-code/#
Uploading your source tree as an archive¶
If you’re using an earlier version of Python, you’ll have to build the tarball yourself:
tar -C your-src-code -f your-src-code.tar.gz -z -c .
Then, run your job with:
--setup 'export PYTHONPATH=$PYTHONPATH:your-src-code.tar.gz#/'
/ after the
# (without one before it) is what tells mrjob that
your-src-code.tar.gz is an archive that Hadoop should unpack.
To do the same thing in
runners: hadoop: setup: - export PYTHONPATH=$PYTHONPATH:your-src-code.tar.gz#/
Running a makefile inside your source dir¶
--setup 'cd your-src-dir.tar.gz#/' --setup 'make'
or, in mrjob.conf:
runners: hadoop: setup: - cd your-src-dir.tar.gz# - make
If Hadoop runs multiple tasks on the same node, your source dir will be shared between them. This is not a problem; mrjob automatically adds locking around setup commands to ensure that multiple copies of your setup script don’t run simultaneously.
Making data files available to your job¶
Best practice for one or a few files is to use passthrough options; see
If you’re a setup purist, you can also do something like this:
--setup 'true your-file#desired-name'
since true has no effect and ignores its arguments.
Using a virtualenv¶
What if you can’t install the libraries you need on your Hadoop cluster?
You could do something like this in your
runners: hadoop: setup: - virtualenv venv - . venv/bin/activate - pip install mr3po
However, now the locking feature that protects make becomes a liability; each task on the same node has its own virtualenv, but one task has to finish setting up before the next can start.
The solution is to share the virtualenv between all tasks on the same machine, something like this:
runners: hadoop: setup: - VENV=/tmp/$mapreduce_job_id - if [ ! -e $VENV ]; then virtualenv $VENV; fi - . $VENV/bin/activate - pip install mr3po
With Hadoop 1, you’d want to use
$mapred_job_id instead of
Other ways to use pip to install Python packages¶
If you have a lot of dependencies, best practice is to make a
pip requirements file
and use the
--setup 'pip install -r path/to/requirements.txt#'
Note that pip can also install from tarballs (which is useful for custom-built packages):
--setup 'pip install $MY_PYTHON_PKGS/*.tar.gz#'