mrjob.util - general utility functions

Utility functions for MRJob

mrjob.util.cmd_line(args)

build a command line that works in a shell.

mrjob.util.expand_path(path)

Resolve ~ (home dir) and environment variables in path.

If path is None, return None.

mrjob.util.file_ext(filename)

return the file extension, including the .

>>> file_ext('foo.tar.gz')
'.tar.gz'
mrjob.util.log_to_null(name=None)

Set up a null handler for the given stream, to suppress “no handlers could be found” warnings.

mrjob.util.log_to_stream(name=None, stream=None, format=None, level=None, debug=False)

Set up logging.

Parameters:
  • name (str) – name of the logger, or None for the root logger
  • stream (file object) – stream to log to (default is sys.stderr)
  • format (str) – log message format (default is ‘%(message)s’)
  • level – log level to use
  • debug (bool) – quick way of setting the log level: if true, use logging.DEBUG, otherwise use logging.INFO
mrjob.util.parse_and_save_options(option_parser, args)

Return a map from option name (dest) to a list of the arguments in args that correspond to that dest.

This won’t modify option_parser.

Deprecated since version 0.6.0.

mrjob.util.random_identifier()

A random 16-digit hex string.

mrjob.util.read_file(path, fileobj=None, yields_lines=True, cleanup=None)

Yields lines from a file, possibly decompressing it based on file extension.

Currently we handle compressed files with the extensions .gz and .bz2.

Parameters:
  • path (string) – file path. Need not be a path on the local filesystem (URIs are okay) as long as you specify fileobj too.
  • fileobj – file object to read from. Need not be seekable. If this is omitted, we open(path).
  • yields_lines – Does iterating over fileobj yield lines (like file objects are supposed to)? If not, set this to False (useful for objects that correspond to objects on cluster filesystems)
  • cleanup – Optional callback to call with no arguments when EOF is reached or an exception is thrown.

Deprecated since version 0.6.0.

mrjob.util.read_input(path, stdin=None)

Stream input the way Hadoop would.

  • Resolve globs (foo_*.gz).
  • Decompress .gz and .bz2 files.
  • If path is '-', read from stdin
  • If path is a directory, recursively read its contents.

You can redefine stdin for ease of testing. stdin can actually be any iterable that yields lines (e.g. a list).

Deprecated since version 0.6.0.

mrjob.util.safeeval(expr, globals=None, locals=None)

Like eval, but with nearly everything in the environment blanked out, so that it’s difficult to cause mischief.

globals and locals are optional dictionaries mapping names to values for those names (just like in eval()).

mrjob.util.save_current_environment(*args, **kwds)

Context manager that saves os.environ and loads it back again after execution

mrjob.util.save_cwd(*args, **kwds)

Context manager that saves the current working directory, and chdir’s back to it after execution.

mrjob.util.shlex_split(s)

Wrapper around shlex.split(), but convert to str if Python version < 2.7.3 when unicode support was added.

mrjob.util.strip_microseconds(delta)

Return the given datetime.timedelta, without microseconds.

Useful for printing datetime.timedelta objects.

mrjob.util.to_lines(chunks)

Take in data as a sequence of bytes, and yield it, one line at a time.

Only breaks lines on \n (not \r), and does not add a trailing newline.

For efficiency, passes through anything with a readline() attribute.

mrjob.util.unarchive(archive_path, dest)

Extract the contents of a tar or zip file at archive_path into the directory dest.

Parameters:
  • archive_path (str) – path to archive file
  • dest (str) – path to directory where archive will be extracted

dest will be created if it doesn’t already exist.

tar files can be gzip compressed, bzip2 compressed, or uncompressed. Files within zip files can be deflated or stored.

mrjob.util.unique(items)

Yield items from item in order, skipping duplicates.

mrjob.util.which(cmd, path=None)

Like the UNIX which command: search in path for the executable named cmd. path defaults to PATH. Returns None if no such executable found.

This is basically shutil.which() (which was introduced in Python 3.3) without the mode argument. Best practice is to always specify path as a keyword argument.

mrjob.util.zip_dir(dir, out_path, filter=None, prefix='')

Compress the given dir into a zip file at out_path.

If we encounter symlinks, include the actual file, not the symlink.

Parameters:
  • dir (str) – dir to tar up
  • out_path (str) – where to write the tarball too
  • filter – if defined, a function that takes paths (relative to dir and returns True if we should keep them
  • prefix (str) – subdirectory inside the tarball to put everything into (e.g. 'mrjob')