Python 2 vs. Python 3¶
Raw protocols¶
Both because we don’t want to break mrjob for Python 2 users, and to make writing jobs simple, jobs read their input as str
s by default (even though str
means bytes in Python 2 and unicode in Python 3).
The way this works in mrjob is that RawValueProtocol
is actually an alias for one of two classes, BytesValueProtocol
if you’re in Python 2, and TextValueProtocol
if you’re in Python 3.
If you care about this distinction, you may want to explicitly set INPUT_PROTOCOL
to one of these. If your input has a well-defined encoding, probably you want BytesValueProtocol
, and if it’s a bunch of text that’s mostly ASCII, with like, some stuff that... might be UTF-8? (i.e. most log files), you probably want TextValueProtocol
. But most of the time it’ll just work.
Bytes vs. strings¶
- The following things are bytes in any version of Python (which means you need to use the
bytes
type and/orb'...'
constant in Python 3): - data read or written by Protocols
- lines yielded by
cat_output()
- anything read from
cat()
The stdin
, stdout
, and stderr
attributes of MRJob
s are always bytestreams (so, for example, self.stderr
defaults to sys.stderr.buffer
in Python 3).
Everything else (including file paths, URIs, arguments to commands, and logging messages) are strings; that is, str
s on Python 3, and either unicode
s or ASCII str
s on Python 2. Like with RawValueProtocol
, most of the time it’ll just work even if you don’t think about it.
python_bin¶
python_bin defaults to python3 in Python 3, and python in Python 2 (except on EMR AMIs prior to 4.3.0, where we use python2.7)
Your Hadoop cluster¶
Whatever version of Python you use, you’ll have to have a compatible version of Python installed on your Hadoop cluster. mrjob does its best to make this work on Elastic MapReduce (see bootstrap_python), but if you’re running on your own Hadoop cluster, this is up to you.