Python 2 vs. Python 3¶
Both because we don’t want to break mrjob for Python 2 users, and to make writing jobs simple, jobs read their input as
strs by default (even though
str means bytes in Python 2 and unicode in Python 3).
If you care about this distinction, you may want to explicitly set
INPUT_PROTOCOL to one of these. If your input has a well-defined encoding, probably you want
BytesValueProtocol, and if it’s a bunch of text that’s mostly ASCII, with like, some stuff that... might be UTF-8? (i.e. most log files), you probably want
TextValueProtocol. But most of the time it’ll just work.
Bytes vs. strings¶
- The following things are bytes in any version of Python (which means you need to use the
b'...'constant in Python 3):
stderr attributes of
MRJobs are always bytestreams (so, for example,
self.stderr defaults to
sys.stderr.buffer in Python 3).
Everything else (including file paths, URIs, arguments to commands, and logging messages) are strings; that is,
str on Python 3, and either
unicode on Python 2. Like with
RawValueProtocol, most of the time it’ll just work even if you don’t think about it.