mrjob.protocol - input and output

Protocols translate raw bytes into key, value pairs.

Typically, protocols encode a key and value into bytes, and join them together with a tab character.

However, protocols with Value in their name ignore keys and simply read/write values (with key read in as None), allowing you to read and write data in arbitrary formats.

For more information, see Protocols and Writing custom protocols.

Strings

class mrjob.protocol.RawValueProtocol

Just output value (a str), and discard key (key is read in as None).

This is the default protocol used by jobs to read input.

This is an alias for RawValueProtocol on Python 2 and TextValueProtocol on Python 3.

class mrjob.protocol.BytesValueProtocol

Read line (without trailing newline) directly into value (key is always None). Output value (bytes) directly, discarding key.

This is the default protocol used by jobs to read input on Python 2.

Warning

Typical usage on Python 2 is to have your mapper parse (byte) strings out of your input files, and then include them in the output to the reducer. Since this output is then (by default) JSON-encoded, encoding will fail if the bytestrings are not UTF-8 decodable. If this is an issue, consider using TextValueProtocol instead.

class mrjob.protocol.TextValueProtocol

Attempt to UTF-8 decode line (without trailing newline) into value, falling back to latin-1. (key is always None). Output value UTF-8 encoded, discarding key.

This is the default protocol used by jobs to read input on Python 3.

This is a good solution for reading text files which are mostly ASCII but may have some other bytes of unknown encoding (e.g. logs).

If you wish to enforce a particular encoding, use BytesValueProtocol instead:

class MREncodingEnforcer(MRJob):

    INPUT_PROTOCOL = BytesValueProtocol

    def mapper(self, _, value):
        value = value.decode('utf_8')
        ...
class mrjob.protocol.RawProtocol

Output key (str) and value (str), separated by a tab character.

This is an alias for BytesProtocol on Python 2 and TextProtocol on Python 3.

class mrjob.protocol.BytesProtocol

Encode (key, value) (bytestrings) as key and value separated by a tab.

If key or value is None, don’t include a tab. When decoding a line with no tab in it, value will be None.

When reading from a line with multiple tabs, we break on the first one.

Your key should probably not be None or have tab characters in it, but we don’t check.

class mrjob.protocol.TextProtocol

UTF-8 encode key and value (unicode strings) and join them with a tab character. When reading input, we fall back to latin-1 if we can’t UTF-8 decode the line.

If key or value is None, don’t include a tab. When decoding a line with no tab in it, value will be None.

When reading from a line with multiple tabs, we break on the first one.

Your key should probably not be None or have tab characters in it, but we don’t check.

JSON

class mrjob.protocol.JSONProtocol

Encode (key, value) as two JSONs separated by a tab.

This is the default protocol used by jobs to write output and communicate between steps.

This is an alias for UltraJSONProtocol if ujson is installed, SimpleJSONProtocol if simplejson is installed and ujson is not and StandardJSONProtocol if neither is installed.

class mrjob.protocol.UltraJSONProtocol

Implements JSONProtocol using the ujson library.

Warning

ujson is about five times faster than the standard implementation, but is more willing to encode things that aren’t strictly JSON-encodable, including sets, dictionaries with tuples as keys, UTF-8 encoded bytes, and objects (!). Relying on this behavior won’t stop your job from working, but it can make your job dependent on ujson, rather than just using it as a speedup.

Note

ujson also differs from the standard implementation in that it doesn’t add spaces to its JSONs ({"foo":"bar"} versus {"foo": "bar"}). This probably won’t affect anything but test cases and readability.

class mrjob.protocol.SimpleJSONProtocol

Implements JSONProtocol using the simplejson library.

class mrjob.protocol.StandardJSONProtocol

Implements JSONProtocol using Python’s built-in JSON library.

Note

The built-in json library is (appropriately) strict about the JSON standard; it won’t accept dictionaries with non-string keys, sets, or (on Python 3) bytestrings.

class mrjob.protocol.JSONValueProtocol

Encode value as a JSON and discard key (key is read in as None).

This is an alias for UltraJSONValueProtocol if ujson is installed, SimpleJSONValueProtocol if simplejson is installed and ujson is not and StandardJSONValueProtocol if neither is installed.
class mrjob.protocol.UltraJSONValueProtocol

Implements JSONValueProtocol using the ujson library.

class mrjob.protocol.SimpleJSONValueProtocol

Implements JSONValueProtocol using the simplejson library.

class mrjob.protocol.StandardJSONValueProtocol

Implements JSONValueProtocol using Python’s built-in JSON library.

Repr

class mrjob.protocol.ReprProtocol

Encode (key, value) as two reprs separated by a tab.

This only works for basic types (we use mrjob.util.safeeval()).

Warning

The repr format changes between different versions of Python (for example, braces for sets in Python 2.7, and different string contants in Python 3). Plan accordingly.

class mrjob.protocol.ReprValueProtocol

Encode value as a repr and discard key (key is read in as None).

See ReprProtocol for details.

Pickle

class mrjob.protocol.PickleProtocol

Encode (key, value) as two string-escaped pickles separated by a tab.

We string-escape the pickles to avoid having to deal with stray \t and \n characters, which would confuse Hadoop Streaming.

Ugly, but should work for any type.

Warning

Pickling is only backwards-compatible across Python versions. If your job uses this as an output protocol, you should use at least the same version of Python to parse the job’s output. Vice versa for using this as an input protocol.

class mrjob.protocol.PickleValueProtocol

Encode value as a string-escaped pickle and discard key (key is read in as None).

See PickleProtocol for details.