For a complete list of changes, see CHANGES.txt
The EMR runner now correctly re-starts the SSH tunnel to the job tracker/resource manager when a cluster it tries to run a job on auto-terminates. It also no longer requires a working SSH tunnel to fetch job progress (you still a working SSH; see ec2_key_pair_file).
The emr_applications option has been renamed to applications.
The terminate-idle-clusters utility is now slightly more robust in cases where your S3 temp directory is an different region from your clusters.
Finally, there a couple of changes that probably only matter if you’re trying to wrap your Hadoop tasks (mappers, reducers, etc.) in docker:
- You can set just the python binary for tasks with task_python_bin. This allows you to use a wrapper script in place of Python without perturbing setup scripts.
- Local mode now no longer relies on an absolute path to access the
mrjob.catutility it uses to handle compressed input files; copying the job’s working directory into Docker is enough.
--setup 'export PYTHONPATH=$PYTHONPATH:your-src-code/#'
mrjob will automatically tarball these directories and pass them to Hadoop as archives.
For multi-step jobs, you can now specify where inter-step output goes
with step_output_dir (
--step-output-dir), which can be useful
job step types now take the jobconf keyword
argument to set Hadoop properties for that step.
--help printout is now better-organized and less verbose.
Made several fixes to pre-filters (commands that pipe into streaming steps):
- you can once again add pre-filters to a single step job by re-defining
- local mode now ignores non-zero return codes from pre-filters (this matters for BSD grep)
- local mode can now run pre-filters on compressed input files
mrjob now respects sh_bin when it needs to wrap a command
sh before passing it to Hadoop (e.g. to support pipes)
On EMR, mrjob now fetches logs from task nodes when determining probable cause of error, not just core nodes (the ones that run tasks and host HDFS).
Several unused functions in
mrjob.util are now deprecated:
SSHFilesystem.ssh_slave_hosts() has been deprecated.
Option group attributes in
MRJobs have been deprecated,
as has the
mrjob can now add up to 1,000 steps on
pooled clusters on EMR (except on very old AMIs).
mrjob now prints debug messages explaining why your job matched
a particular pooled cluster when running in verbose mode (the
Fixed a bug that caused pooling to fail when there was no need for a master
bootstrap script (e.g. when running with
Log interpretation is much more efficient at determining a job’s probable cause of failure (this works with Spark as well).
The hadoop_streaming_jar option now supports environment variables
mrjob now works in Anaconda3 Jupyter Notebook.
Added several missing command-line switches, including
--no-bootstrap-python on Dataproc. Made a major refactor that should
prevent these kinds of issues in the future.
Fixed a bug that caused mrjob to crash when the ssh binary (see ssh_bin) was missing or not executable.
Fixed a bug that erroneously reported failed or just-started jobs as 100% complete.
Fixed a bug where timestamps were erroneously recognized as URIs.
mrjob now only recognizes strings containing
:// as URIs (see
The following are deprecated and will be removed in v0.6.0:
Fixed a critical bug that caused Dataproc runner to always crash when determining Hadoop version.
Log interpretation now prioritizes task errors (e.g. a traceback from your Python script) as probable cause of failure, even if they aren’t the most recent error. Log interpretation will now continue to download and parse task logs until it finds a non-empty stderr log.
Log interpretation also strips the “subprocess failed” Java stack trace that appears in task stderr logs from Hadoop 1.
Also slightly improves EMR cluster pooling with updated information on memory and CPU power of various EC2 instance types, and by treating application names (e.g. “Spark”) as case-insensitive.
Pooling and idle cluster self-termination¶
This release accidentally removed the ami_version option instead of merely deprecating it. If you are upgrading from an earlier version of mrjob, use version 0.5.5 or later.
This release resolves a long-standing EMR API race condition that made it difficult to use cluster pooling and idle cluster self-termination (see max_hours_idle) together. Now if your pooled job unknowingly runs on a cluster that was in the process of shutting down, it will detect that and re-launch the job on a different cluster.
This means pretty much everyone running jobs on EMR should now enable pooling, with a configuration like this:
runners: emr: max_hours_idle: 1 pool_clusters: true
You may also run the terminate-idle-clusters script periodically, but (barring any bugs) this shouldn’t be necessary.
Generic EMR option names¶
|old option name||new option name|
The old option names and command-line switches are now deprecated but will continue to work until v0.6.0. (Exception: ami_version was accidentally removed; if you need it, use 0.5.5 or later.)
num_ec2_instances has simply been deprecated (it’s just num_core_instances plus one).
Log interpretation (counters and probable cause of job failure) on Hadoop is more robust, handing a wider variety of log4j formats and recovering more gracefully from permissions errors. This includes fixing a crash that could happen on Python 3 when attempting to read data from HDFS.
Log interpretation used to be partially broken on EMR AMI 4.3.0 and later due to a permissions issue; this is now fixed.
You can now pass through existing command-line switches to your job;
for example, you can tell a job which runner launched it. See
pass_through_option() for details.
If you don’t do this,
self.options.runner will now always be
in your job (it used to confusingly default to
Stop logging credentials¶
When mrjob is run in verbose mode (the
-v option), the values of all
runner options are debug-logged to stderr. This has been the case since
the very early days of mrjob.
Unfortunately, this means that if you set your AWS credentials in
mrjob.conf, they get logged as well, creating a surprising potential
security vulnerability. (This doesn’t happen for AWS credentials set through
Starting in this version, the values of aws_secret_access_key
and aws_security_token are shown as
'...' if they are set,
and all but the last four characters of aws_access_key_id are
blanked out as well (e.g.
Other improvements and bugfixes¶
The ssh tunnel to the resource manager on EMR (see ssh_tunnel) now connects to its correct internal IP; this resolves a firewall issue that existed on some VPC setups.
Uploaded files will no longer be given names starting with
since Hadoop’s input processing treats these files as “hidden”.
The EMR idle cluster self-termination script (see max_hours_idle) now only runs on the master node.
The audit-emr-usage command-line tool should no longer constantly trigger throttling warnings.
bootstrap_python no longer bothers trying to install Python 3 on EMR AMI 4.6.0 and later, since it is already installed.
--ssh-bind-ports command-line switch was broken (starting in
0.4.5!), and is now fixed.
The Hadoop runner now tries even harder to find your log files without needing additional configuration (see hadoop_log_dirs).
If your EMR cluster fails during bootstrapping, mrjob can now determine the probable cause of failure.
There are also some minor improvements to SSH tunneling and a handful of small bugfixes; see CHANGES.txt for details.
- fetching counters
- finding probable cause of errors
- running Java JARs as steps
Added the emr_applications option, which helps you configure 4.x AMIs.
Fixed an EMR bug (introduced in v0.5.0) where we were waiting for steps to complete in the wrong order (in a multi-step job, we wouldn’t register that the first step had finished until the last one had).
Fixed a bug in SSH tunneling (introduced in v0.5.0) that made connections to the job tracker/resource manager on EMR time out when running on a 2.x AMI inside a VPC (Virtual Private Cluster).
Fixed a bug (introduced in v0.4.6) that kept mrjob from interpreting
(home directory) in includes in
It is now again possible to run tool modules deprecated in v0.5.0 directly (e.g. python -m mrjob.tools.emr.create_job_flow). This is still a deprecated feature; it’s recommended that you use the appropriate mrjob subcommand instead (e.g. mrjob create-cluster).
Fixes a bug in the previous relase that broke
SORT_VALUES and any other attempt by the job
to set the partitioner. The
--partitioner switch is now deprecated
(the choice of partitioner is part of your job semantics).
Fixes a bug in the previous release that caused strict_protocols
and check_input_paths to be ignored in
would much prefer you fixed jobs that are using “loose protocols” rather than
strict_protocols: false in your config file, but we didn’t break
this on purpose, we promise!)
mrjob terminate-idle-clusters now correctly handles EMR debugging steps
(see enable_emr_debugging) set up by boto 2.40.0.
Fixed a bug that could result in showing a blank probable cause of error for pre-YARN (Hadoop 1) jobs.
ssh_bind_ports now defaults to a
range object (
Python 2), so that when you run on emr in verbose mode (
-r emr -v), debug
logging devotes one line to the value of
ssh_bind_ports rather than 840.
mrjob now fully supports Python 3.3+ in a way that should be transparent to existing Python 2 users (you don’t have to suddenly start handling
unicode instead of
str). For more information, see Python 2 vs. Python 3.
If you run a job with Python 3, mrjob will automatically install Python 3 on ElasticMapreduce AMIs (see bootstrap_python).
When you run jobs on EMR in Python 2, mrjob attempts to match your minor version of Python as well (either python2.6 or python2.7); see python_bin for details.
If you’re currently running Python 2.7, and
using yum to install python libraries, you’ll
want to use the Python 2.7 version of the package (e.g.
python27-numpy rather than
The mrjob command is now installed with Python-version-specific aliases (e.g. mrjob-3, mrjob-3.4), in case you install mrjob for multiple versions of Python.
mrjob should now work out-of-the box on almost any Hadoop setup. If hadoop is in your path, or you set any commonly-used
$HADOOP_* environment variable, mrjob will find the Hadoop binary, the streaming jar, and your logs, without any help on your part (see hadoop_bin, hadoop_log_dirs, hadoop_streaming_jar).
mrjob has been updated to fully support Hadoop 2 (YARN), including many updates to
HadoopFilesystem. Hadoop 1 is still supported, though anything prior to Hadoop 0.20.203 is not (mrjob is actually a few months older than Hadoop 0.20.203, so this used to matter).
3.x and 4.x AMIs¶
mrjob now fully supports the 3.x and 4.x Elastic MapReduce AMIs, including SSH tunneling to the resource mananager, fetching counters and finding probable cause of job failure.
The default ami_version (see image_version) is now
3.11.0. Our plan is to continue updating this to the lastest (non-broken) 3.x AMI for each 0.5.x release of mrjob.
The default instance_type is now
m1.small is too small for the 3.x and 4.x AMIs)
You can specify 4.x AMIs with either the new release_label option, or continue using ami_version; both work.
mrjob continues to support 2.x AMIs. However:
2.x AMIs are deprecated by AWS, and based on a very old version of Debian (squeeze), which breaks apt-get and exposes you to security holes.
Please, please switch if you haven’t already.
An edge case that might affect you: EC2 key pairs (i.e. SSH credentials) are region-specific, so if you’ve set up SSH but not explicitly specified a region, you may get an error saying your key pair is invalid. The fix is simply to create new SSH keys for the
us-west-2 (Oregon) region.
- mrjob is much smarter about the way it interacts with S3:
- automatically creates temp bucket in the same region as jobs
- connects to S3 buckets on the endpoint matching their region (no more 307 errors)
S3Filesystemmethods no longer take
s3_connargs (passing around a single S3 connection no longer makes sense)
- no longer uses the temp bucket’s location to choose where you run your job
rm()no longer has special logic for
ls()recurses “subdirectories” even if you pass it a URI without a trailing slash
The part of mrjob that fetches counters and tells you what probably caused your job to fail was basically unmaintainable and has been totally rewritten. Not only do we now have solid support across Hadoop and EMR AMI versions, but if we missed anything, it should be straightforward to add it.
Once casualty of this change was the mrjob fetch-logs command, which means mrjob no longer offers a way to fetch or interpret logs from a past job. We do plan to re-introduce this functionality.
Protocols are now strict by default (they simply raise an exception on
unencodable data). “Loose” protocols can be re-enabled with the
--no-strict-protocols switch; see strict_protocols for
why this is a bad idea.
Protocols will now use the much faster
ujson library, if installed,
to encode and decode JSON. This is especially recommended for simple jobs that
spend a significant fraction of their time encoding and data.
If you’re using EMR, try out
this bootstrap recipe to install
mrjob will fall back to the
simplejson library if
is not installed, and use the built-in
json module if neither is installed.
We’ve tried to cut the logging messages that your job prints as it runs down to the basics (either useful info, like where a temp directory is, or something that tells you why you’re waiting). If there are any messages you miss, try running your job with
When a step in your job fails, mrjob no longer prints a useless stacktrace telling you where in the code the runner raised an exception about your step failing. This is thanks to
StepFailedException, which you can also catch and interpret if you’re running jobs programmatically.
Many things that were deprecated in 0.4.6 have been removed:
- functions and methods:
- positional arguments to
mrjob.job.MRJob.mr()(don’t even use
- step_args and name arguments to
mrjob.step.JarStep(use args instead of step_args, and don’t use name at all)
uses_generic_jobconf(), which only existed to support very old versions of Hadoop, were removed without deprecation warnings (sorry!).
To avoid a similar wave of deprecation warnings in the future, the name of every part of mrjob that isn’t meant to be a stable interface provided by the library now starts with an underscore. You can still use these things (or copy them; it’s Open Source), but there’s no guarantee they’ll exist in the next release.
If you want to get ahead of the game, here is a list of things that are deprecated starting in mrjob 0.5.0 (do these after upgrading mrjob):
- base_tmp_dir is now local_tmp_dir
- cleanup options
- emr_job_flow_id is now cluster_id
- emr_job_flow_pool_name is now pool_name
- hdfs_scratch_dir is now hadoop_tmp_dir
- pool_emr_job_flows is now pool_clusters
- s3_scratch_uri is now cloud_tmp_dir
- ssh_tunnel_to_job_tracker is now simply ssh_tunnel
- functions and methods:
- mrjob subcommands - mrjob create-job-flow is now mrjob create-cluster - mrjob terminate-idle-job-flows is now mrjob terminate-idle-clusters - mrjob terminate-job-flow is now mrjob temrinate-cluster
- mrjob now requires
boto2.35.0 or newer (chances are you’re already doing this). Later 0.5.x releases of mrjob may require newer versions of
- visible_to_all_users now defaults to
- new iam_endpoint option
- custom hadoop_streaming_jars are properly uploaded
JOBcleanup on EMR is temporarily disabled
- mrjob now follows symlinks when
ls()ing the local filesystem (beware recursive symlinks!)
- The interpreter option disables bootstrap_mrjob by default (interpreter is meant for non-Python jobs)
- cluster pooling now respects ec2_key_pair
- cluster self-termination (see max_hours_idle) now respects non-streaming jobs
LocalFilesystemnow rejects URIs rather than interpreting them as local paths
inlinerunners no longer have a default hadoop_version, instead handling jobconf in a version-agnostic way
- steps_python_bin now defaults to the current Python interpreter.
- minor changes to
include: in conf files can now use relative paths in a meaningful way.
See Relative includes.
List and environment variable options loaded from included config files can
be totally overridden using the
!clear tag. See Clearing configs.
Fixed a bug that kept the
pool_wait_minutes option from being loaded from
This release moves mrjob off the deprecated DescribeJobFlows EMR API call.
AWS again broke older versions mrjob for at least some new accounts, by returning 400s for the deprecated DescribeJobFlows API call. If you have a newer AWS account (circa July 2015), you must use at least this version of mrjob.
The new API does not provide a way to tell when a job flow (now called a “cluster”) stopped provisioning instances and started bootstrapping, so the clock for our estimates of when we are close to the end of a billing hour now start at cluster creation time, and are thus more conservative.
Related to this change,
no longer considers job flows in the
STARTING state idle; use
report_long_jobs to catch jobs stuck in
terminate_idle_job_flows performs much better
on large numbers of job flows. Formerly, it collected all job flow information
first, but now it terminates idle job flows as soon as it identifies them.
job_flow_pool have not been ported to the
new API and will be removed in v0.5.0.
Added an aws_security_token option to allow you to run mrjob on EMR using temporary AWS credentials.
Added an emr_tags (see tags) option to allow you to tag EMR job flows at creation time.
EMRJobRunner now has a
The hadoop_version option no longer has any effect in EMR. This option only every did anything on the 1.x AMIs, which mrjob no longer supports.
Added many missing switches to the EMR tools (accessible from the mrjob command). Formerly, you had to use a config file to get at these options.
You can now access the
mrboss tool from the
command line: mrjob boss <args>.
Previous 0.4.x releases have worked with boto as old as 2.2.0, but this one requires at least boto 2.6.0 (which is still more than two years old). In any case, it’s recommended that you just use the latest version of boto.
This branch has a number of additional deprecation warnings, to help prepare you for mrjob v0.5.0. Please heed them; a lot of deprecated things really are going to be completely removed.
mrjob now automatically creates and uses IAM objects as necessary to comply with new requirements from Amazon Web Services.
(You do not need to install the AWS CLI or run
aws emr create-default-roles
as the link above describes; mrjob takes care of this for you.)
The change that AWS made essentially broke all older versions of mrjob for all new accounts. If the first time your AWS account created an Elastic MapReduce cluster was on or after April 6, 2015, you should use at least this version of mrjob.
If you must use an old version of mrjob with a new AWS account, see this thread for a possible workaround.
--iam-job-flow-role has been renamed to
This release also contains many, many bugfixes, one of which probably affects you! See CHANGES.txt for details.
Added a new subcommand,
mrjob collect-emr-active-stats, to collect stats
about active jobflows and instance counts.
--iam-job-flow-role option allows setting of a specific IAM role to run
this job flow.
You can now use
--no-check-input-paths on EMR
as well as Hadoop.
Files larger than 100MB will be uploaded to S3 using multipart upload if you
have the filechunkio module installed. You can change the limit/part size
--s3-upload-part-size option, or disable multipart upload by
setting this option to 0.
You can now require protocols to be strict from mrjob.conf; this means unencodable input/output will result in an exception rather than the job quietly incrementing a counter. It is recommended you set this for all runners:
runners: emr: strict_protocols: true hadoop: strict_protocols: true inline: strict_protocols: true local: strict_protocols: true
You can use
--no-strict-protocols to turn off strict protocols for
a particular job.
Tests now support pytest and tox.
Support for Python 2.5 has been dropped.
JarSteps, previously experimental, are now fully integrated into multi-step jobs, and work with both the Hadoop and EMR runners. You can now use powerful Java libraries such as Mahout in your MRJobs. For more information, see Jar steps.
Many options for setting up your task’s environment (
--setup-script) have been replaced by a powerful
--setup option. See the Job Environment Setup Cookbook for examples.
Similarly, many options for bootstrapping nodes on EMR (
--bootstrap-script) have been replaced by a single
option. See the EMR Bootstrapping Cookbook.
This release also contains many bugfixes, including problems with boto 2.10.0+, bz2 decompression, and Python 2.5.
SORT_VALUES option enables secondary sort,
ensuring that your reducer(s) receive values in sorted order. This allows you
to do things with reducers that would otherwise involve storing all the values
in memory, such as:
- Receiving a grand total before any subtotals, so you can calculate percentages on the fly. See mr_next_word_stats.py for an example.
- Running a window of fixed length over an arbitrary amount of sorted values (e.g. a 24-hour window over timestamped log data).
The max_hours_idle option allows you to spin up EMR job flows that will terminate themselves after being idle for a certain amount of time, in a way that optimizes EMR/EC2’s full-hour billing model.
runners: emr: max_hours_idle: 0.25 pool_emr_job_flows: true
If you enable pooling without max_hours_idle (or
terminate_idle_job_flows), pooled job
flows will stay active forever, costing you money!
You can now use
--no-check-input-paths with the Hadoop runner to
allow jobs to run even if
hadoop fs -ls can’t see their input files
Two bits of straggling deprecated functionality were removed:
- Built-in protocols must be instantiated to be used (formerly they had class methods).
- Old locations for mrjob.conf are no longer supported.
This version also contains numerous bugfixes and natural extensions of existing functionality; many more things will now Just Work (see CHANGES.txt).
The default runner is now inline instead of local. This change will speed up debugging for many users. Use local if you need to simulate more features of Hadoop.
The EMR tools can now be accessed more easily via the mrjob command. Learn more here.
Job steps are much richer now:
- You can now use mrjob to run jar steps other than Hadoop Streaming. More info
- You can filter step input with UNIX commands. More info
- In fact, you can use arbitrary UNIX commands as your whole step (mapper/reducer/combiner). More info
If you Ctrl+C from the command line, your job will be terminated if you give it time. If you’re running on EMR, that should prevent most accidental runaway jobs. More info
mrjob v0.4 requires boto 2.2.
We removed all deprecated functionality from v0.2:
- –*-protocol switches
We love contributions, so we wrote some guidelines to help you help us. See you on Github!
The pool_wait_minutes (
--pool-wait-minutes) option lets your job
delay itself in case a job flow becomes available. Reference:
Configuration quick reference
JOB_FLOW cleanup options tell mrjob to clean up the job
and/or the job flow on failure (including Ctrl+C). See
CLEANUP_CHOICES for more information.
The EMR instance type/number options have changed to support spot instances:
- slave_instance_type (alias for core_instance_type)
There is also a new ami_version option to change the AMI your job flow uses for its nodes.
For more information, see
report_long_jobs tool alerts on jobs that
have run for more than X hours.
Support for Combiners
You can now use combiners in your job. Like
reducer(), you can redefine
combiner()in your subclass to add a single combiner step to run after your mapper but before your reducer. (
MRWordFreqCountdoes this to improve performance.)
combiner_final()are similar to their mapper and reducer equivalents.
You can also add combiners to custom steps by adding keyword argumens to your call to
*_init(), *_final() for mappers, reducers, combiners
Custom Option Parsers
It is now possible to define your own option types and actions using a custom
More info: Custom option types
Job Flow Pooling
EMR jobs can pull job flows out of a “pool” of similarly configured job flows. This can make it easier to use a small set of job flows across multiple automated jobs, save time and money while debugging, and generally make your life simpler.
More info: Pooling Clusters
SSH Log Fetching
mrjob attempts to fetch counters and error logs for EMR jobs via SSH before trying to use S3. This method is faster, more reliable, and works with persistent job flows.
More info: Configuring SSH credentials
New EMR Tool: fetch_logs
If you want to fetch the counters or error logs for a job after the fact, you can use the new
New EMR Tool: mrboss
If you want to run a command on all nodes and inspect the output, perhaps to see what processes are running, you can use the new
Changes and Deprecations¶
The search path order for
mrjob.confhas changed. The new order is:
- The location specified by
mrjob.confin any directory in
mrjob.confpath is deprecated, use this table to fix it:
Old Location New Location
Defining Jobs (MRJob)
Mapper, combiner, and reducer methods no longer need to contain a yield statement if they emit no data.
--hadoop-*-formatswitches are deprecated. Instead, set your job’s Hadoop formats with
hadoop_output_format(). Hadoop formats can no longer be set from
In addition to
--jobconf, you can now set jobconf values with the
JOBCONFattribute or the
jobconf()method. To read jobconf values back, use
mrjob.compat.jobconf_from_env(), which ensures that the correct name is used depending on which version of Hadoop is active.
More info: Hadoop configuration
Protocols can now be anything with a
write()method. Unlike previous versions of mrjob, they can be instance methods rather than class methods. You should use instance methods when defining your own protocols.
DEFAULT_*PROTOCOLare deprecated. Instead, use the
*_PROTOCOLattributes or redefine the
Protocols now cache the decoded values of keys. Informal testing shows up to 30% speed improvements.
More info: Protocols
All runners are Hadoop-version aware and use the correct jobconf and combiner invocation styles. This change should decrease the number of warnings in Hadoop 0.20 environments.
*_binconfiguration options (
ssh_bin) take lists instead of strings so you can add arguments (like
['python', '-v']). More info: Configuration quick reference
Cleanup options have been split into
cleanup_on_failure. There are more granular values for both of these options.
Most limitations have been lifted from passthrough options, including the former inability to use custom types and actions. More info: Custom option types
job_name_prefixoption is gone (was deprecated).
All URIs are passed through to Hadoop where possible. This should relax some requirements about what URIs you can use.
Steps with no mapper use cat instead of going through a no-op mapper.
Compressed files can be streamed with the
The default Hadoop version on EMR is now 0.20 (was 0.18).
instance_typeoption only sets the instance type for slave nodes when there are multiple EC2 instance. This is because the master node can usually remain small without affecting the performance of the job.
Inline ModeInline mode now supports the
Local mode now runs 2 mappers and 2 reducers in parallel by default.
There is preliminary support for simulating some jobconf variables. The current list of supported variables is:
boto 2.0+ is now required.
The Debian packaging has been removed from the repostory.