mrjob.protocol - input and output¶
Protocols translate raw bytes into key, value pairs.
Typically, protocols encode a key and value into bytes, and join them together with a tab character.
However, protocols with Value in their name ignore
keys and simply read/write values (with key read in as None), allowing
you to read and write data in arbitrary formats.
For more information, see Protocols and Writing custom protocols.
Strings¶
-
class
mrjob.protocol.RawValueProtocol¶ Just output
value(astr), and discardkey(keyis read in asNone).This is the default protocol used by jobs to read input.
This is an alias for
RawValueProtocolon Python 2 andTextValueProtocolon Python 3.
-
class
mrjob.protocol.BytesValueProtocol¶ Read line (without trailing newline) directly into
value(keyis alwaysNone). Outputvalue(bytes) directly, discardingkey.This is the default protocol used by jobs to read input on Python 2.
Warning
Typical usage on Python 2 is to have your mapper parse (byte) strings out of your input files, and then include them in the output to the reducer. Since this output is then (by default) JSON-encoded, encoding will fail if the bytestrings are not UTF-8 decodable. If this is an issue, consider using
TextValueProtocolinstead.
-
class
mrjob.protocol.TextValueProtocol¶ Attempt to UTF-8 decode line (without trailing newline) into
value, falling back to latin-1. (keyis alwaysNone). OutputvalueUTF-8 encoded, discardingkey.This is the default protocol used by jobs to read input on Python 3.
This is a good solution for reading text files which are mostly ASCII but may have some other bytes of unknown encoding (e.g. logs).
If you wish to enforce a particular encoding, use
BytesValueProtocolinstead:class MREncodingEnforcer(MRJob): INPUT_PROTOCOL = BytesValueProtocol def mapper(self, _, value): value = value.decode('utf_8') ...
-
class
mrjob.protocol.RawProtocol¶ Output
key(str) andvalue(str), separated by a tab character.This is an alias for
BytesProtocolon Python 2 andTextProtocolon Python 3.
-
class
mrjob.protocol.BytesProtocol¶ Encode
(key, value)(bytestrings) askeyandvalueseparated by a tab.If
keyorvalueisNone, don’t include a tab. When decoding a line with no tab in it,valuewill beNone.When reading from a line with multiple tabs, we break on the first one.
Your key should probably not be
Noneor have tab characters in it, but we don’t check.
-
class
mrjob.protocol.TextProtocol¶ UTF-8 encode
keyandvalue(unicode strings) and join them with a tab character. When reading input, we fall back to latin-1 if we can’t UTF-8 decode the line.If
keyorvalueisNone, don’t include a tab. When decoding a line with no tab in it,valuewill beNone.When reading from a line with multiple tabs, we break on the first one.
Your key should probably not be
Noneor have tab characters in it, but we don’t check.
JSON¶
-
class
mrjob.protocol.JSONProtocol¶ Encode
(key, value)as two JSONs separated by a tab.This is the default protocol used by jobs to write output and communicate between steps.
This is an alias for the first one of
UltraJSONProtocol,RapidJSONProtocol,SimpleJSONProtocol, orStandardJSONProtocolfor which the underlying library is available.
-
class
mrjob.protocol.UltraJSONProtocol¶ Implements
JSONProtocolusing theujsonlibrary.Warning
ujsonis about five times faster than the standard implementation, but is more willing to encode things that aren’t strictly JSON-encodable, including sets, dictionaries with tuples as keys, UTF-8 encoded bytes, and objects (!). Relying on this behavior won’t stop your job from working, but it can make your job dependent onujson, rather than just using it as a speedup.Note
ujsonalso differs from the standard implementation in that it doesn’t add spaces to its JSONs ({"foo":"bar"}versus{"foo": "bar"}). This probably won’t affect anything but test cases and readability.
-
class
mrjob.protocol.RapidJSONProtocol¶ Implements
JSONProtocolusing therapidjsonlibrary.
-
class
mrjob.protocol.SimpleJSONProtocol¶ Implements
JSONProtocolusing thesimplejsonlibrary.
-
class
mrjob.protocol.StandardJSONProtocol¶ Implements
JSONProtocolusing Python’s built-in JSON library.Note
The built-in
jsonlibrary is (appropriately) strict about the JSON standard; it won’t accept dictionaries with non-string keys, sets, or (on Python 3) bytestrings.
-
class
mrjob.protocol.JSONValueProtocol¶ Encode
valueas a JSON and discardkey(keyis read in asNone).This is an alias for the first one of
UltraJSONValueProtocol,RapidJSONValueProtocol,SimpleJSONValueProtocol, orStandardJSONValueProtocolfor which the underlying library is available.
-
class
mrjob.protocol.UltraJSONValueProtocol¶ Implements
JSONValueProtocolusing theujsonlibrary.
-
class
mrjob.protocol.RapidJSONValueProtocol¶ Implements
JSONValueProtocolusing therapidjsonlibrary.
-
class
mrjob.protocol.SimpleJSONValueProtocol¶ Implements
JSONValueProtocolusing thesimplejsonlibrary.
-
class
mrjob.protocol.StandardJSONValueProtocol¶ Implements
JSONValueProtocolusing Python’s built-in JSON library.
Repr¶
-
class
mrjob.protocol.ReprProtocol¶ Encode
(key, value)as two reprs separated by a tab.This only works for basic types (we use
mrjob.util.safeeval()).Warning
The repr format changes between different versions of Python (for example, braces for sets in Python 2.7, and different string contants in Python 3). Plan accordingly.
-
class
mrjob.protocol.ReprValueProtocol¶ Encode
valueas a repr and discardkey(keyis read in as None).See
ReprProtocolfor details.
Pickle¶
-
class
mrjob.protocol.PickleProtocol¶ Encode
(key, value)as two string-escaped pickles separated by a tab.We string-escape the pickles to avoid having to deal with stray
\tand\ncharacters, which would confuse Hadoop Streaming.Ugly, but should work for any type.
Warning
Pickling is only backwards-compatible across Python versions. If your job uses this as an output protocol, you should use at least the same version of Python to parse the job’s output. Vice versa for using this as an input protocol.
-
class
mrjob.protocol.PickleValueProtocol¶ Encode
valueas a string-escaped pickle and discardkey(keyis read in asNone).See
PickleProtocolfor details.