EMR Bootstrapping Cookbook

Bootstrapping allows you to run commands to customize EMR machines, at the time the cluster is created.

When to use bootstrap, and when to use setup

You can use bootstrap and setup together.

Generally, you want to use bootstrap for things that are part of your general production environment, and setup for things that are specific to your particular job. This makes things work as expected if you are using Cluster Pooling.

EMR will generally not allow you to use sudo in setup commands. See Job Environment Setup Cookbook for how to install libraries, etc. without using sudo.

Installing Python packages with pip

The only tricky thing is making sure you install packages for the correct version of Python.

Figure out which version of Python you’ll be running on EMR (see python_bin for defaults).

  • If it’s Python 2, use pip-2.7 (just plain pip also works on AMI 4.3.0 and later)
  • If it’s Python 3, use pip-3.6 on AMI 5.20.0+, and pip-3.4 for earlier AMIs

For example, to install ujson on Python 2:

runners:
  emr:
    bootstrap:
    - sudo pip-2.7 install ujson

See PyPI for a the full list of available Python packages.

You can also install packages from a requirements file:

runners:
  emr:
    bootstrap:
    - sudo pip-2.7 install -r /local/path/of/requirements.txt#

Or a tarball:

runners:
  emr:
    bootstrap:
    - sudo pip-2.7 install /local/path/of/tarball.tar.gz#

Warning

If you’re trying to run jobs on AMI version 3.0.0 (protip: don’t do that) pip appears not to work due to out-of-date SSL certificate information.

Installing PyPy

First, download the version of PyPy you want to use from Portable PyPy Distributions for Linux.

Then instruct EMR to un-tar it and link to the binary in /usr/bin. For example:

runners:
  emr:
    bootstrap:
    - sudo tar xvfj /local/path/to/pypy-7.1.1-linux_x86_64-portable.tar.bz2# -C /opt
    - sudo ln -s /opt/pypy-7.1.1-linux_x86_64-portable/bin/pypy /usr/bin/pypy

Installing System Packages

EMR gives you access to a variety of different Amazon Machine Images, or AMIs for short (see image_version).

3.x and later AMIs

Starting with 3.0.0, EMR AMIs use Amazon Linux, which uses yum to install packages. For example, to install NumPy:

runners:
  emr:
    bootstrap:
    - sudo yum install -y python-numpy

(Don’t forget the -y!)

Amazon Linux’s Python packages generally only work for Python 2. If you’re on Python 3, just use pip.

The most recent list of Amazon linux packages can be found here (click on “Packages List” in the left sidebar).

2.x AMIs

Probably not worth the trouble. The 2.x AMIs are based on a version of Debian that is so old it has been “archived,” which makes their package installer, apt-get, no longer work out-of-the-box. Moreover, Python system packages work for Python 2.6, not 2.7.

Instead, just use pip-2.7 to install Python libraries.