mrjob.protocol - input and output¶
Protocols translate raw bytes into key, value pairs.
Typically, protocols encode a key and value into bytes, and join them together with a tab character.
However, protocols with Value
in their name ignore
keys and simply read/write values (with key read in as None
), allowing
you to read and write data in arbitrary formats.
For more information, see Protocols and Writing custom protocols.
Strings¶
-
class
mrjob.protocol.
RawValueProtocol
¶ Just output
value
(astr
), and discardkey
(key
is read in asNone
).This is the default protocol used by jobs to read input.
This is an alias for
RawValueProtocol
on Python 2 andTextValueProtocol
on Python 3.
-
class
mrjob.protocol.
BytesValueProtocol
¶ Read line (without trailing newline) directly into
value
(key
is alwaysNone
). Outputvalue
(bytes) directly, discardingkey
.This is the default protocol used by jobs to read input on Python 2.
Warning
Typical usage on Python 2 is to have your mapper parse (byte) strings out of your input files, and then include them in the output to the reducer. Since this output is then (by default) JSON-encoded, encoding will fail if the bytestrings are not UTF-8 decodable. If this is an issue, consider using
TextValueProtocol
instead.
-
class
mrjob.protocol.
TextValueProtocol
¶ Attempt to UTF-8 decode line (without trailing newline) into
value
, falling back to latin-1. (key
is alwaysNone
). Outputvalue
UTF-8 encoded, discardingkey
.This is the default protocol used by jobs to read input on Python 3.
This is a good solution for reading text files which are mostly ASCII but may have some other bytes of unknown encoding (e.g. logs).
If you wish to enforce a particular encoding, use
BytesValueProtocol
instead:class MREncodingEnforcer(MRJob): INPUT_PROTOCOL = BytesValueProtocol def mapper(self, _, value): value = value.decode('utf_8') ...
-
class
mrjob.protocol.
RawProtocol
¶ Output
key
(str
) andvalue
(str
), separated by a tab character.This is an alias for
BytesProtocol
on Python 2 andTextProtocol
on Python 3.
-
class
mrjob.protocol.
BytesProtocol
¶ Encode
(key, value)
(bytestrings) askey
andvalue
separated by a tab.If
key
orvalue
isNone
, don’t include a tab. When decoding a line with no tab in it,value
will beNone
.When reading from a line with multiple tabs, we break on the first one.
Your key should probably not be
None
or have tab characters in it, but we don’t check.
-
class
mrjob.protocol.
TextProtocol
¶ UTF-8 encode
key
andvalue
(unicode strings) and join them with a tab character. When reading input, we fall back to latin-1 if we can’t UTF-8 decode the line.If
key
orvalue
isNone
, don’t include a tab. When decoding a line with no tab in it,value
will beNone
.When reading from a line with multiple tabs, we break on the first one.
Your key should probably not be
None
or have tab characters in it, but we don’t check.
JSON¶
-
class
mrjob.protocol.
JSONProtocol
¶ Encode
(key, value)
as two JSONs separated by a tab.This is the default protocol used by jobs to write output and communicate between steps.
This is an alias for the first one of
UltraJSONProtocol
,RapidJSONProtocol
,SimpleJSONProtocol
, orStandardJSONProtocol
for which the underlying library is available.
-
class
mrjob.protocol.
UltraJSONProtocol
¶ Implements
JSONProtocol
using theujson
library.Warning
ujson
is about five times faster than the standard implementation, but is more willing to encode things that aren’t strictly JSON-encodable, including sets, dictionaries with tuples as keys, UTF-8 encoded bytes, and objects (!). Relying on this behavior won’t stop your job from working, but it can make your job dependent onujson
, rather than just using it as a speedup.Note
ujson
also differs from the standard implementation in that it doesn’t add spaces to its JSONs ({"foo":"bar"}
versus{"foo": "bar"}
). This probably won’t affect anything but test cases and readability.
-
class
mrjob.protocol.
RapidJSONProtocol
¶ Implements
JSONProtocol
using therapidjson
library.
-
class
mrjob.protocol.
SimpleJSONProtocol
¶ Implements
JSONProtocol
using thesimplejson
library.
-
class
mrjob.protocol.
StandardJSONProtocol
¶ Implements
JSONProtocol
using Python’s built-in JSON library.Note
The built-in
json
library is (appropriately) strict about the JSON standard; it won’t accept dictionaries with non-string keys, sets, or (on Python 3) bytestrings.
-
class
mrjob.protocol.
JSONValueProtocol
¶ Encode
value
as a JSON and discardkey
(key
is read in asNone
).This is an alias for the first one of
UltraJSONValueProtocol
,RapidJSONValueProtocol
,SimpleJSONValueProtocol
, orStandardJSONValueProtocol
for which the underlying library is available.
-
class
mrjob.protocol.
UltraJSONValueProtocol
¶ Implements
JSONValueProtocol
using theujson
library.
-
class
mrjob.protocol.
RapidJSONValueProtocol
¶ Implements
JSONValueProtocol
using therapidjson
library.
-
class
mrjob.protocol.
SimpleJSONValueProtocol
¶ Implements
JSONValueProtocol
using thesimplejson
library.
-
class
mrjob.protocol.
StandardJSONValueProtocol
¶ Implements
JSONValueProtocol
using Python’s built-in JSON library.
Repr¶
-
class
mrjob.protocol.
ReprProtocol
¶ Encode
(key, value)
as two reprs separated by a tab.This only works for basic types (we use
mrjob.util.safeeval()
).Warning
The repr format changes between different versions of Python (for example, braces for sets in Python 2.7, and different string contants in Python 3). Plan accordingly.
-
class
mrjob.protocol.
ReprValueProtocol
¶ Encode
value
as a repr and discardkey
(key
is read in as None).See
ReprProtocol
for details.
Pickle¶
-
class
mrjob.protocol.
PickleProtocol
¶ Encode
(key, value)
as two string-escaped pickles separated by a tab.We string-escape the pickles to avoid having to deal with stray
\t
and\n
characters, which would confuse Hadoop Streaming.Ugly, but should work for any type.
Warning
Pickling is only backwards-compatible across Python versions. If your job uses this as an output protocol, you should use at least the same version of Python to parse the job’s output. Vice versa for using this as an input protocol.
-
class
mrjob.protocol.
PickleValueProtocol
¶ Encode
value
as a string-escaped pickle and discardkey
(key
is read in asNone
).See
PickleProtocol
for details.