EMR Bootstrapping Cookbook

Bootstrapping allows you to run commands to customize EMR machines, at the time the cluster is created.

When to use bootstrap, and when to use setup

You can use bootstrap and setup together.

Generally, you want to use bootstrap for things that are part of your general production environment, and setup for things that are specific to your particular job. This makes things work as expected if you are Pooling Clusters.

All these examples use bootstrap. Not saying it’s a good idea, but all these examples will work with setup as well (yes, Hadoop tasks on EMR apparently have access to sudo).

Installing Python packages with pip

The only tricky thing is making sure you install packages for the correct version of Python.

Figure out which version of Python you’ll be running on EMR (see python_bin for defaults).

  • If it’s Python 2.6, use pip
  • If it’s Python 2.7, use pip-2.7
  • If it’s Python 3, use pip-3.4

For example, to install ujson on Python 2.7:

runners:
  emr:
    bootstrap:
    - sudo pip-2.7 install ujson

See PyPI for a the full list of available Python packages.

You can also install packages from a requirements file:

runners:
  emr:
    bootstrap:
    - sudo pip-2.7 install -r /local/path/of/requirements.txt#

Or a tarball:

runners:
  emr:
    bootstrap:
    - sudo pip-2.7 install /local/path/of/tarball.tar.gz#

Note

If for some reason you must run on AMI version 2.4.2 or earlier (protip: don’t do that), see below for how to get pip working.

Warning

If you’re trying to run jobs on AMI version 3.0.0 (protip: don’t do that either) pip appears not to work due to out-of-date SSL certificate information.

Installing System Packages

EMR gives you access to a variety of different Amazon Machine Images, or AMIs for short (see image_version).

3.x and 4.x AMIs

Starting with 3.0.0, EMR AMIs use Amazon Linux, which uses yum to install packages. For example, to install NumPy:

runners:
  emr:
    bootstrap:
    - sudo yum install -y python27-numpy

(Don’t forget the -y!)

Amazon Linux currently has few packages for Python 3 libraries; if you’re on Python 3, just use pip.

Here are the package lists for all the various versions of Amazon Linux used by EMR:

Note

The package lists gloss over Python versions; wherever you see a package named python-<lib name>, you’ll want to install python26-<lib name> or python27-<lib name> instead.

2.x AMIs

The 2.x AMIs are based on a version of Debian that is so old it has been “archived,” which makes their package installer, apt-get, no longer work out-of-the-box.

If you must use the 2.x AMIs, you can get apt-get working again by fixing /etc/apt/sources.list and running apt-get update. For example, to install pip for Python 2.6:

runners:
  emr:
    bootstrap:
    - sudo echo "deb http://archive.debian.org/debian/ squeeze main contrib non-free" > /etc/apt/sources.list
    - sudo apt-get update
    - sudo apt-get install -y python-pip

Note

pip-2.7 is already installed by default on AMI version 2.4.3 and later.

See the full list of Squeeze packages for all the (very old versions of) software you can install.

Installing Python from source

If you really must use a version of Python that’s not available on EMR (e.g. Python 3.5 or a very specific patch version), you can download and compile Python from source.

Note

This adds an extra 5 to 10 minutes before the cluster can run your job.

Here’s how you download and install a Python tarball:

runners:
  emr:
    bootstrap:
    - wget -S -T 10 -t 5 https://www.python.org/ftp/python/x.y.z/Python-x.y.z.tgz
    - tar xfz Python-x.y.z.tgz
    - cd Python-x.y.z; ./configure && make && sudo make install; cd ..
    bootstrap_python: false
    python_bin: /usr/local/bin/python

(Replace x.y.z with a specific version of Python.)

Python 3.4+ comes with pip by default, but earlier versions do not, so you’ll want to tack on get-pip.py:

runners:
  emr:
    bootstrap:
    ...
    - wget -S -T 10 -t 5 https://bootstrap.pypa.io/get-pip.py
    - sudo /usr/local/bin/python get-pip.py

Also, pip will be installed in /usr/local/bin, which is not in the path for sudo, so use its full path:

runners:
  emr:
    bootstrap:
    ...
    - sudo /usr/local/bin/pip install ...