mrjob.cat - auto-decompress files based on extension

Emulating the way Hadoop handles input files, decompressing compressed files based on their file extension.

This module also functions as a cat substitute that can handle compressed files. It it used by local mode and can function without the rest of the mrjob library.

mrjob.cat.bunzip2_stream(fileobj, bufsize=1024)

Decompress gzipped data on the fly.

Parameters:
  • fileobj – object supporting read()
  • bufsize – number of bytes to read from fileobj at a time.

Warning

This yields decompressed chunks; it does not split on lines. To get lines, wrap this in to_lines().

mrjob.cat.decompress(fileobj, path, bufsize=1024)

Take a fileobj correponding to the given path and returns an iterator that yield chunks of bytes, or, if path doesn’t correspond to a compressed file type, fileobj itself.

mrjob.cat.gunzip_stream(fileobj, bufsize=1024)

Decompress gzipped data on the fly.

Parameters:
  • fileobj – object supporting read()
  • bufsize – number of bytes to read from fileobj at a time. The default is the same as in gzip.

Warning

This yields decompressed chunks; it does not split on lines. To get lines, wrap this in to_lines().