mrjob.cat - auto-decompress files based on extension

Emulating the way Hadoop handles input files, decompressing compressed files based on their file extension.

This module also functions as a cat substitute that can handle compressed files. It it used by local mode and can function without the rest of the mrjob library.

mrjob.cat.bunzip2_stream(fileobj, bufsize=1024)

Decompress gzipped data on the fly.

Parameters:
  • fileobj – object supporting read()
  • bufsize – number of bytes to read from fileobj at a time.

Warning

This yields decompressed chunks; it does not split on lines. To get lines, wrap this in to_lines().

mrjob.cat.decompress(readable, path, bufsize=1024)

Take readable which support the .read() method correponding to the given path and returns an iterator that yields chunks of bytes, possibly decompressing based on path.

if readable appears to be a fileobj, pass it through as-is.

Unlike open_input(), this can deal with things that don’t support the full fileobj interface (e.g. boto3‘s StreamingBody).

mrjob.cat.gunzip_stream(fileobj, bufsize=1024)

Decompress gzipped data on the fly.

Parameters:
  • fileobj – object supporting read()
  • bufsize – number of bytes to read from fileobj at a time. The default is the same as in gzip.

Warning

This yields decompressed chunks; it does not split on lines. To get lines, wrap this in to_lines().

mrjob.cat.open_input(path)

Open the given path and return a fileobj or, if it’s a compressed file, a fileobj-like object.

mrjob.cat.to_chunks(readable, bufsize=1024)

Convert readable, which is any object supporting read() (e.g. fileobjs) to a stream of non-empty bytes.