mrjob.setup - job environment setup

Utilities for setting up the environment jobs run in by uploading files and running setup scripts.

The general idea is to use Hadoop DistributedCache-like syntax to find and parse expressions like /path/to/file#name_in_working_dir into “path dictionaries” like {'type': 'file', 'path': '/path/to/file', 'name': 'name_in_working_dir'}}.

You can then pass these into a WorkingDirManager to keep track of which files need to be uploaded, catch name collisions, and assign names to unnamed paths (e.g. /path/to/file#). Note that WorkingDirManager.name() can take a path dictionary as keyword arguments.

If you need to upload files from the local filesystem to a place where Hadoop can see them (HDFS or S3), we provide UploadDirManager.

Path dictionaries are meant to be immutable; all state is handled by manager classes.

class mrjob.setup.UploadDirManager(prefix)

Represents a directory on HDFS or S3 where we want to upload local files for consumption by Hadoop.

UploadDirManager tries to give files the same name as their filename in the path (for ease of debugging), but handles collisions gracefully.

UploadDirManager assumes URIs to not need to be uploaded and thus does not store them. uri() maps URIs to themselves.

add(path)
Add a path. If path hasn’t been added before, assign it a name.
If path is a URI don’t add it; just return the URI.
Returns:the URI assigned to the path
path_to_uri()

Get a map from path to URI for all paths that were added, so we can figure out which files we need to upload.

uri(path)

Get the URI for the given path. If path is a URI, just return it.

class mrjob.setup.WorkingDirManager(archive_file_suffix='')

Represents the working directory of hadoop/Spark tasks (or bootstrap commands in the cloud).

To support Hadoop’s distributed cache, paths can be for ordinary files, or for archives (which are automatically uncompressed into a directory by Hadoop).

When adding a file, you may optionally assign it a name; if you don’t; we’ll lazily assign it a name as needed. Name collisions are not allowed, so being lazy makes it easier to avoid unintended collisions.

If you wish, you may assign multiple names to the same file, or add a path as both a file and an archive (though not mapped to the same name).

add(type, path, name=None)

Add a path as either a file or an archive, optionally assigning it a name.

Parameters:
  • type – either 'archive' or 'file'
  • path – path/URI to add
  • name – optional name that this path must be assigned, or None to assign this file a name later.

if type is archive, we’ll also add path as an auto-named archive_file. This reserves space in the working dir in case we need to copy the archive into the working dir and un-archive it ourselves.

name(type, path, name=None)

Get the name for a path previously added to this WorkingDirManager, assigning one as needed.

This is primarily for getting the name of auto-named files. If the file was added with an assigned name, you must include it (and we’ll just return name).

We won’t ever give an auto-name that’s the same an assigned name (even for the same path and type).

Parameters:
  • type – either 'archive' or 'file'
  • path – path/URI
  • name – known name of the file
name_to_path(type=None)

Get a map from name (in the setup directory) to path for all known files/archives, so we can build -file and -archive options to Hadoop (or fake them in a bootstrap script).

Parameters:type – either 'archive' or 'file'
paths(type=None)

Get a set of all paths tracked by this WorkingDirManager.

mrjob.setup.name_uniquely(path, names_taken=(), proposed_name=None, unhide=False, strip_ext=False, suffix='')

Come up with a unique name for path.

Parameters:
  • names_taken – a dictionary or set of names not to use.
  • proposed_name – name to use if it is not taken. If this is not set, we propose a name based on the filename.
  • unhide – make sure final name doesn’t start with periods or underscores
  • strip_ext – if we propose a name, it shouldn’t have a file extension
  • suffix – if set to a string, add this to the end of any filename we propose. Should include the ..

If the proposed name is taken, we add a number to the end of the filename, keeping the extension the same. For example:

>>> name_uniquely('foo.txt', {'foo.txt'})
'foo-1.txt'
>>> name_uniquely('bar.tar.gz', {'bar'}, strip_ext=True)
'bar-1'
mrjob.setup.parse_legacy_hash_path(type, path, must_name=None)

Parse hash paths from old setup/bootstrap options.

This is similar to parsing hash paths out of shell commands (see parse_setup_cmd()) except that we pass in path type explicitly, and we don’t always require the # character.

Parameters:
  • type – Type of the path ('archive' or 'file')
  • path – Path to parse, possibly with a #
  • must_name – If set, use path‘s filename as its name if there is no '#' in path, and raise an exception if there is just a '#' with no name. Set must_name to the name of the relevant option so we can print a useful error message. This is intended for options like upload_files that merely upload a file without tracking it.
mrjob.setup.parse_setup_cmd(cmd)

Parse a setup/bootstrap command, finding and pulling out Hadoop Distributed Cache-style paths (“hash paths”).

Parameters:cmd (string) – shell command to parse
Returns:a list containing dictionaries (parsed hash paths) and strings (parts of the original command, left unparsed)

Hash paths look like path#name, where path is either a local path or a URI pointing to something we want to upload to Hadoop/EMR, and name is the name we want it to have when we upload it; name is optional (no name means to pick a unique one).

If name is followed by a trailing slash, that indicates path is an archive (e.g. a tarball), and should be unarchived into a directory on the remote system. The trailing slash will also be kept as part of the original command.

If path is followed by a trailing slash, that indicates path is a directory and should be tarballed and later unarchived into a directory on the remote system. The trailing slash will also be kept as part of the original command. You may optionally include a slash after name as well (this will only result in a single slash in the final command).

Parsed hash paths are dicitionaries with the keys path, name, and type (either 'file', 'archive', or 'dir').

Most of the time, this function will just do what you expect. Rules for finding hash paths:

  • we only look for hash paths outside of quoted strings
  • path may not contain quotes or whitespace
  • path may not contain : or = unless it is a URI (starts with <scheme>://); this allows you to do stuff like export PYTHONPATH=$PYTHONPATH:foo.egg#.
  • name may not contain whitespace or any of the following characters: '":;><|=/#, so you can do stuff like sudo dpkg -i fooify.deb#; fooify bar

If you really want to include forbidden characters, you may use backslash escape sequences in path and name. (We can’t guarantee Hadoop/EMR will accept them though!). Also, remember that shell syntax allows you to concatenate strings like""this.

Environment variables and ~ (home dir) in path will be resolved (use backslash escapes to stop this). We don’t resolve name because it doesn’t make sense. Environment variables and ~ elsewhere in the command are considered to be part of the script and will be resolved on the remote system.