Job Environment Setup Cookbook
==============================

Many jobs have significant external dependencies, both libraries and other
source code.

Combining shell syntax with Hadoop's DistributedCache notation, mrjob's
:mrjob-opt:`setup` option provides a powerful, dynamic alternative to
pre-installing your Hadoop dependencies on every node.

All our :file:`mrjob.conf` examples below are for the ``hadoop`` runner,
but these work equally well with the ``emr`` runner. Also, if you are using
EMR, take a look at the :doc:`emr-bootstrap-cookbook`.

.. note::

   Setup scripts don't work with :doc:`spark`; try :mrjob-opt:`py_files`
   instead.

.. _cookbook-src-tree-pythonpath:

Uploading your source tree
--------------------------

.. note:: If you're using mrjob 0.6.4 or later, check out
          :ref:`uploading-modules-and-packages` first.

mrjob can automatically tarball your source directory and include it
in your job's working directory. We can use setup scripts to upload the
directory and then add it to :envvar:`PYTHONPATH`.

Run your job with:

.. code-block:: sh

    --setup 'export PYTHONPATH=$PYTHONPATH:your-src-code/#'

The ``/`` before the ``#`` tells mrjob that :file:`your-src-code` is a
directory. You may optionally include a ``/`` after the ``#`` as well
(e.g. ``export PYTHONPATH=$PYTHONPATH:your-source-code/#/your-lib``).

If every job you run is going to want to use :file:`your-src-code`, you can do
this in your :file:`mrjob.conf`:

.. code-block:: yaml

    runners:
      hadoop:
        setup:
        - export PYTHONPATH=$PYTHONPATH:your-src-code/#

Uploading your source tree as an archive
----------------------------------------

Prior to mrjob 0.5.8, you had to archive directories yourself before uploading
them.

.. code-block:: sh

    tar -C your-src-code -f your-src-code.tar.gz -z -c .

Then, run your job with:

.. code-block:: sh

    --setup 'export PYTHONPATH=$PYTHONPATH:your-src-code.tar.gz#/'

The ``/`` after the ``#`` (without one before it) is what tells mrjob that
``your-src-code.tar.gz`` is an archive that Hadoop should unpack.

To do the same thing in :file:`mrjob.conf`:

.. code-block:: yaml

    runners:
      hadoop:
        setup:
        - export PYTHONPATH=$PYTHONPATH:your-src-code.tar.gz#/

Running a makefile inside your source dir
-----------------------------------------

.. code-block:: sh

    --setup 'cd your-src-dir.tar.gz#/' --setup 'make'

or, in mrjob.conf:

.. code-block:: yaml

    runners:
      hadoop:
        setup:
        - cd your-src-dir.tar.gz#/
        - make

If Hadoop runs multiple tasks on the same node, your source dir will be shared
between them. This is not a problem; mrjob automatically adds locking around
setup commands to ensure that multiple copies of your setup script don't
run simultaneously.

Making data files available to your job
---------------------------------------

Best practice for one or a few files is to use passthrough options; see
:py:meth:`~mrjob.job.MRJob.add_passthru_arg`.

You can also use :mrjob-opt:`upload_files` to upload file(s) into a task's
working directory (or :mrjob-opt:`upload_archives` for tarballs and other
archives).

If you're a :mrjob-opt:`setup` purist, you can also do something like this:

.. code-block:: sh

    --setup 'true your-file#desired-name'

since :command:`true` has no effect and ignores its arguments.

.. _using-a-virtualenv:

Using a virtualenv
------------------

What if you can't install the libraries you need on your Hadoop cluster?

You could do something like this in your :file:`mrjob.conf`:

.. code-block:: yaml

    runners:
      hadoop:
        setup:
        - virtualenv venv
        - . venv/bin/activate
        - pip install mr3po

However, now the locking feature that protects :command:`make` becomes a
liability; each task on the same node has its own virtualenv, but one task has
to finish setting up before the next can start.

The solution is to share the virtualenv between all tasks on the same
machine, something like this:

.. code-block:: yaml

    runners:
      hadoop:
        setup:
        - VENV=/tmp/$mapreduce_job_id
        - if [ ! -e $VENV ]; then virtualenv $VENV; fi
        - . $VENV/bin/activate
        - pip install mr3po

With Hadoop 1, you'd want to use ``$mapred_job_id`` instead of
``$mapreduce_job_id``.

Other ways to use pip to install Python packages
------------------------------------------------

If you have a lot of dependencies, best practice is to make a
`pip requirements file <http://www.pip-installer.org/en/latest/cookbook.html>`_
and use the ``-r`` switch:

.. code-block:: sh

    --setup 'pip install -r path/to/requirements.txt#'

Note that :command:`pip` can also install from tarballs (which is useful
for custom-built packages):

.. code-block:: sh

    --setup 'pip install $MY_PYTHON_PKGS/*.tar.gz#'