EMR Bootstrapping Cookbook¶
Bootstrapping allows you to run commands to customize EMR machines, at the time the cluster is created.
When to use bootstrap, and when to use setup¶
Generally, you want to use bootstrap for things that are part of your general production environment, and setup for things that are specific to your particular job. This makes things work as expected if you are Pooling Clusters.
Installing Python packages with pip¶
The only tricky thing is making sure you install packages for the correct version of Python.
Figure out which version of Python you’ll be running on EMR (see python_bin for defaults).
- If it’s Python 2.6, use pip
- If it’s Python 2.7, use pip-2.7
- If it’s Python 3, use pip-3.4
For example, to install
ujson on Python 2.7:
runners: emr: bootstrap: - sudo pip-2.7 install ujson
See PyPI for a the full list of available Python packages.
You can also install packages from a requirements file:
runners: emr: bootstrap: - sudo pip-2.7 install -r /local/path/of/requirements.txt#
Or a tarball:
runners: emr: bootstrap: - sudo pip-2.7 install /local/path/of/tarball.tar.gz#
If for some reason you must run on AMI version 2.4.2 or earlier (protip: don’t do that), see below for how to get pip working.
If you’re trying to run jobs on AMI version 3.0.0 (protip: don’t do that either) pip appears not to work due to out-of-date SSL certificate information.
Installing System Packages¶
EMR gives you access to a variety of different Amazon Machine Images, or AMIs for short (see image_version).
3.x and 4.x AMIs¶
Starting with 3.0.0, EMR AMIs use Amazon Linux, which uses yum to install packages. For example, to install NumPy:
runners: emr: bootstrap: - sudo yum install -y python27-numpy
(Don’t forget the
Amazon Linux currently has few packages for Python 3 libraries; if you’re on Python 3, just use pip.
Here are the package lists for all the various versions of Amazon Linux used by EMR:
The package lists gloss over Python versions; wherever you see a package
python-<lib name>, you’ll want to install
python27-<lib name> instead.
The 2.x AMIs are based on a version of Debian that is so old it has been “archived,” which makes their package installer, apt-get, no longer work out-of-the-box.
If you must use the 2.x AMIs, you can get apt-get working
again by fixing
/etc/apt/sources.list and running
apt-get update. For example, to install pip for Python
runners: emr: bootstrap: - sudo echo "deb http://archive.debian.org/debian/ squeeze main contrib non-free" > /etc/apt/sources.list - sudo apt-get update - sudo apt-get install -y python-pip
pip-2.7 is already installed by default on AMI version 2.4.3 and later.
See the full list of Squeeze packages for all the (very old versions of) software you can install.
Installing Python from source¶
If you really must use a version of Python that’s not available on EMR (e.g. Python 3.5 or a very specific patch version), you can download and compile Python from source.
This adds an extra 5 to 10 minutes before the cluster can run your job.
Here’s how you download and install a Python tarball:
runners: emr: bootstrap: - wget -S -T 10 -t 5 https://www.python.org/ftp/python/x.y.z/Python-x.y.z.tgz - tar xfz Python-x.y.z.tgz - cd Python-x.y.z; ./configure && make && sudo make install; cd .. bootstrap_python: false python_bin: /usr/local/bin/python
x.y.z with a specific version of Python.)
Python 3.4+ comes with pip by default, but earlier versions do not,
so you’ll want to tack on
runners: emr: bootstrap: ... - wget -S -T 10 -t 5 https://bootstrap.pypa.io/get-pip.py - sudo /usr/local/bin/python get-pip.py
Also, pip will be installed in
/usr/local/bin, which is not in
the path for sudo, so use its full path:
runners: emr: bootstrap: ... - sudo /usr/local/bin/pip install ...