mrjob.cat - decompress files based on extension¶
Emulating the way Hadoop handles input files, decompressing compressed files based on their file extension.
This module also functions as a cat substitute that can handle
compressed files. It it used by local
mode and can
function without the rest of the mrjob library.
-
mrjob.cat.
bunzip2_stream
(fileobj, bufsize=1024)¶ Decompress gzipped data on the fly.
Parameters: - fileobj – object supporting
read()
- bufsize – number of bytes to read from fileobj at a time.
Warning
This yields decompressed chunks; it does not split on lines. To get lines, wrap this in
to_lines()
.- fileobj – object supporting
-
mrjob.cat.
decompress
(readable, path, bufsize=1024)¶ Take a readable which supports the
.read()
method correponding to the given path and returns an iterator that yields chunks of bytes, possibly decompressing based on path.if readable appears to be a fileobj, pass it through as-is.
if readable does not have a
read()
method, assume that it’s a generator that yields chunks of bytes
-
mrjob.cat.
gunzip_stream
(fileobj, bufsize=1024)¶ Decompress gzipped data on the fly.
Parameters: - fileobj – object supporting
read()
- bufsize – number of bytes to read from fileobj at a time. The
default is the same as in
gzip
.
Warning
This yields decompressed chunks; it does not split on lines. To get lines, wrap this in
to_lines()
.- fileobj – object supporting
-
mrjob.cat.
to_chunks
(readable, bufsize=1024)¶ Convert readable, which is any object supporting
read()
(e.g. fileobjs) to a stream of non-emptybytes
.If readable has an
__iter__
method but not aread
method, pass through as-is.