Python 2 vs. Python 3

Raw protocols

Both because we don’t want to break mrjob for Python 2 users, and to make writing jobs simple, jobs read their input as strs by default (even though str means bytes in Python 2 and unicode in Python 3).

The way this works in mrjob is that RawValueProtocol is actually an alias for one of two classes, BytesValueProtocol if you’re in Python 2, and TextValueProtocol if you’re in Python 3.

If you care about this distinction, you may want to explicitly set INPUT_PROTOCOL to one of these. If your input has a well-defined encoding, probably you want BytesValueProtocol, and if it’s a bunch of text that’s mostly ASCII, with like, some stuff that... might be UTF-8? (i.e. most log files), you probably want TextValueProtocol. But most of the time it’ll just work.

Bytes vs. strings

The following things are bytes in any version of Python (which means you need to use the bytes type and/or b'...' constant in Python 3):

The stdin, stdout, and stderr attributes of MRJobs are always bytestreams (so, for example, self.stderr defaults to sys.stderr.buffer in Python 3).

Everything else (including file paths, URIs, arguments to commands, and logging messages) are strings; that is, strs on Python 3, and either unicodes or ASCII strs on Python 2. Like with RawValueProtocol, most of the time it’ll just work even if you don’t think about it.

python_bin

python_bin defaults to python3 in Python 3, and python in Python 2 (except on EMR AMIs prior to 4.3.0, where we use python2.7)

Your Hadoop cluster

Whatever version of Python you use, you’ll have to have a compatible version of Python installed on your Hadoop cluster. mrjob does its best to make this work on Elastic MapReduce (see bootstrap_python), but if you’re running on your own Hadoop cluster, this is up to you.