mrjob.cat - decompress files based on extension

Emulating the way Hadoop handles input files, decompressing compressed files based on their file extension.

This module also functions as a cat substitute that can handle compressed files. It it used by local mode and can function without the rest of the mrjob library.

mrjob.cat.bunzip2_stream(fileobj, bufsize=1024)

Decompress gzipped data on the fly.

Parameters:
  • fileobj – object supporting read()
  • bufsize – number of bytes to read from fileobj at a time.

Warning

This yields decompressed chunks; it does not split on lines. To get lines, wrap this in to_lines().

mrjob.cat.decompress(readable, path, bufsize=1024)

Take a readable which supports the .read() method correponding to the given path and returns an iterator that yields chunks of bytes, possibly decompressing based on path.

if readable appears to be a fileobj, pass it through as-is.

if readable does not have a read() method, assume that it’s a generator that yields chunks of bytes

mrjob.cat.gunzip_stream(fileobj, bufsize=1024)

Decompress gzipped data on the fly.

Parameters:
  • fileobj – object supporting read()
  • bufsize – number of bytes to read from fileobj at a time. The default is the same as in gzip.

Warning

This yields decompressed chunks; it does not split on lines. To get lines, wrap this in to_lines().

mrjob.cat.to_chunks(readable, bufsize=1024)

Convert readable, which is any object supporting read() (e.g. fileobjs) to a stream of non-empty bytes.

If readable has an __iter__ method but not a read method, pass through as-is.