mrjob.cat - decompress files based on extension¶
Emulating the way Hadoop handles input files, decompressing compressed files based on their file extension.
This module also functions as a cat substitute that can handle
compressed files. It it used by local mode and can
function without the rest of the mrjob library.
-
mrjob.cat.bunzip2_stream(fileobj, bufsize=1024)¶ Decompress gzipped data on the fly.
Parameters: - fileobj – object supporting
read() - bufsize – number of bytes to read from fileobj at a time.
Warning
This yields decompressed chunks; it does not split on lines. To get lines, wrap this in
to_lines().- fileobj – object supporting
-
mrjob.cat.decompress(readable, path, bufsize=1024)¶ Take a readable which supports the
.read()method correponding to the given path and returns an iterator that yields chunks of bytes, possibly decompressing based on path.if readable appears to be a fileobj, pass it through as-is.
if readable does not have a
read()method, assume that it’s a generator that yields chunks of bytes
-
mrjob.cat.gunzip_stream(fileobj, bufsize=1024)¶ Decompress gzipped data on the fly.
Parameters: - fileobj – object supporting
read() - bufsize – number of bytes to read from fileobj at a time. The
default is the same as in
gzip.
Warning
This yields decompressed chunks; it does not split on lines. To get lines, wrap this in
to_lines().- fileobj – object supporting
-
mrjob.cat.to_chunks(readable, bufsize=1024)¶ Convert readable, which is any object supporting
read()(e.g. fileobjs) to a stream of non-emptybytes.If readable has an
__iter__method but not areadmethod, pass through as-is.