What’s New¶
For a complete list of changes, see CHANGES.txt
0.7.4¶
Docker on EMR¶
This release adds support for Docker on EMR, which was released with AMI version 6.0.0. This is enabled by setting docker_image to point at your image.
There is also a docker_mounts option, and, if you want to host your image on a private ECR repo instead of Docker Hub, a docker_client_config option (though with AMIs 6.1.0 and later, you can also auto-authenticate to ECR; see this page).
As a result of adding Docker support, the default image_version on EMR is 6.0.0. Also, on EMR and Dataproc we used to literally bootstrap mrjob by copying it to Python’s root package directory, but as this won’t put mrjob into a Docker image, mrjob is now bootstrapped via py_files, like on every other runner.
Concurrent Steps on EMR clusters¶
This release also supports concurrent steps on EMR clusters, a feature introduced in AMI 5.28.0. The max_concurrent_steps option controls both the concurrency level of a newly launched cluster, and how much concurrency we will accept when joining a pooled cluster.
To prevent steps from the same job attempting to run simultaneously, mrjob will now submit steps of a multi-step one at a time (after the previous one completes) on clusters running AMI 5.28.0 or later. This can be changed with the add_steps_in_batch option.
get_job_steps()
is now deprecated, as
it can’t fetch steps before they’re submitted.
Cluster Pooling¶
Cluster pooling can now join pooled clusters based on available CPU and memory reported by the YARN resource manager, rather than looking at number and type of instances in the cluster. You can enable this by setting min_available_mb and/or min_available_virtual_cores. For this feature to work, you must enable SSH (the ec2_key_pair and ec2_key_pair_file options).
You can now control the size of your cluster pool with the max_clusters_in_pool option. If a job wants to launch a new cluster in the pool but the pool is already “full,” it will wait and try again until the pool is no longer full or it can join a cluster.
Once a job determines that it is okay to add another cluster to the pool, it will wait a random number of seconds and try again. This way, if several pooled jobs launch simultaneously, they will be likely to stay within the maximum number of clusters rather than all launching their own. The random wait time can be controlled with pool_jitter_seconds.
By default, a job will wait forever to either join an existing cluster or create new one. You can make jobs give up and raise an exception with the pool_timeout_minutes option.
mrjob will now bypass the pool_wait_minutes option if there is not a matching, active cluster to join. Basically, it won’t wait if there is not a cluster to wait for. As with max_clusters_in_pool, if a job determines there are no clusters to wait for, it will wait a random number of seconds and double-check before launching a new cluster.
Library requirements¶
To support concurrent
steps, boto3
must be at least version 1.10.0 and botocore
must be at
least version 1.13.16. The google-cloud-dataproc
library must
be no greater than 1.1.0, to maintain compatibility with our code.
0.7.3¶
Made many long-overdue changes to Cluster Pooling, to reduce the
potential for throttling by the EMR API. Pooling now puts most information
a job needs to tell if it can join a cluster into the cluster name, meaning
most non-matching clusters can be filtered out when we call ListClusters
.
Pooling also no longer needs to list cluster steps. Finally, if
pool_wait_minutes is set, and there are multiple clusters we can
join, we try them all, rather than just trying the “best” one and then
requesting more information from the API.
This update resulted in a few minor changes to pooling. When a job has the
choice of multiple clusters, it chooses solely on based on CPU capacity, using
NormalizedInstanceHours
in the cluster summary returned by the
ListClusters
API call. mrjob version and applications must
now match exactly in all cases.
We also re-worked the “locking” mechanism that keeps multiple jobs from joining the same cluster. Formerly, this used S3 (which may only be eventually consistent), and locks had no fixed expiration time. Now, EMR tags are used for locking, locks always expire after one minute, and every job uses the same timing when locking clusters, reducing the potential for race conditions.
mrjob terminate-idle-clusters no longer attempts to lock clusters
before terminating them, so its --max-mins-locked
option is deprecated and
does nothing.
The Spark harness now emulates counters correctly in local mode.
If you use mapper_raw()
, and your setup
script has an error, it will be correctly reported, even if your underlying
shell is dash and not bash.
0.7.2¶
Spark normally only supports archives if you’re running on YARN.
However, mrjob now seamlessly emulates archives on all Spark masters
(other than local
). This means you can now use --archives
or
--dirs
with mrjob spark-submit
, as well as using archives
in your --setup
script.
As a result of this change, mrjob is somewhat better at recognizing file
extensions; it ignores .
at the end of filenames, and can now recognize
that a file with a name like mrjob-0.7.0.tar.gz
is a .tar.gz
file, not
a .7.0.tar.gz
file.
Also, if you don’t specify a name for an archive (e.g.
--setup 'cd foo.tar.gz#/'
) mrjob no longer includes the file extension
in the resulting directory name (foo/
, not foo.tar.gz/
).
Patched a long-standing security issue on EMR where we were copying the SSH key to the master node when reading logs from other nodes, which are only accessible via the master node. mrjob now correctly uses ssh-add and the SSH agent instead of copying the key. As a result, mrjob now has a ssh_add_bin option.
The extra_cluster_params option now recursively merges dict params into existing ones. For example, you can now do this:
runners:
emr:
extra_cluster_params:
Instances:
EmrManagedMasterSecurityGroup: sg-foo
without obliterating the rest of the Instances
API parameter.
Python 2 has reached end-of-life, so if you’re using Python 2, the default
python_bin is python2.7
rather than python
, which now
means Python 3 on some systems (for example, 6.x EMR AMIs).
Finally, we ensure that if you’re installing mrjob on Python 3.4, we’ll install a Python 3.4-compatible version of PyYAML.
0.7.1¶
EMR¶
Fixed a bug to set default value of VisibleToAllUsers to True.
You can set sub-parameters with extra_cluster_params to set it False. For example, you can now do:
--extra-cluster-param VisibleToAllUsers=false
Added logging for mrjob to show invoked runner with keyword arguments. Contents of archives are now used during bootstrapping to ensure clusters have same setup.
0.7.0¶
AWS and Google are now optional dependencies¶
Amazon Web Services (EMR/S3) and Google Cloud are now optional dependencies,
aws
and google
respectively. For example, to install mrjob with
AWS support, run:
pip install mrjob[aws]
non-Python mrjobs are no longer supported¶
Fully removed support for writing MRJob scripts in other languages and then running them with the mrjob library. (This capability so little used that chances are you never knew it existed.)
As a result the interpreter and steps_interpreter options are gone,
the mrjob run subcommand is gone, and the MRJobLauncher class
has been merged back into MRJob. Also removed mr_wc.rb
from
mrjob/examples/
MRSomeJob() means read from sys.argv¶
In prior versions, if you initialized a MRJob
subclass
with no arguments (MRSomeJob()
), that meant the same thing as passing in
an empty argument list (MRSomeJob(args=[])
). It now means to read args
directly from sys.argv[1:]
.
In practice, it’s rare to see MRJob
subclass intialized this way outside
of test cases. Running a MRJob
script directly, or initializing it
with an argument list works this same as in previous versions.
mrjob/examples/ love¶
The mrjob.examples package has been updated. Some examples that were difficult to test or maintain were removed, and the remainder were tested and fixed if necessary.
mrjob.examples.mr_text_classifier
no longer needs you to encode
documents in JSON format, and instead operates directly on text files with
names like doc_id-cat_id_1-not_cat_id_2-etc.txt
. Try it out:
python -m mrjob.examples.mr_text_classifier docs-to-classify/*.txt
miscellanous tweaks¶
The mrjob audit-emr-usage subcommand no longer attempts to read cluster pool names from clusters launched by mrjob v0.5.x.
Method arguments in filesystem classes (in mrjob.fs
) are now consistenly
named. This probably won’t matter in practice, as
runner.fs <mrjob.runner.MRJobRunner.fs>
is always a
CompositeFilesystem
anyhow.
removed deprecated code¶
Check your deprecation warnings! Everything marked deprecated in mrjob v0.6.x has been removed.
The following runner config options no longer exist: emr_api_params, interpreter, max_hours_idle, mins_to_end_of_hour, steps_interpreter, steps_python_bin, visible_to_all_users.
The following singular switches have been removed in favor of their plural alternative (e.g. --archives): --archive, --dir, --file, --hadoop-arg, --libjar, --py-file, --spark-arg.
The --steps switch is gone. This means --help --steps no longer works; use --help -v to see help for --mapper, etc.
Support for simulating optparse
has been removed from
MRJob
. This includes add_file_option()
,
add_passthrough_option()
, configure_options()
, load_options()
,
pass_through_option()
, self.args
, self.OPTION_CLASS
.
mrjob.job.MRJobRunner.stream_output()
and
mrjob.job.MRJob.parse_output_line()
have been removed.
The constructor for MRJobRunner
no longer
has a file_upload_args keyword argument.
parse_and_save_options()
, read_file()
, and read_input()
have
all been removed from mrjob.util
.
CompositeFilesystem
no longer takes filesystems
as arguments to its constructor; use
add_fs()
. The useless
local_tmp_dir option to the GCSFilesystem
constructor and the chunk_size arg to its
put()
method have been removed.
0.6.12¶
Updated the Dataproc’s runner default image_version to 1.3
,
as the old default, 1.0
no longer works.
The local and inline runners can now handle file://
URIs as input paths
and as files/archives uploaded to the working directory. The local filesystem
(available as runner.fs
from all runners) can now handle file://
URIs as well.
0.6.11¶
Adds support for parsing Spark logs and driver output to determine why a job failed. This works with with the local, Hadoop, EMR, and Spark runners.
The Spark runner no longer needs pyspark
in the $PYTHONPATH
to
launch scripts with spark-submit (it still needs pyspark
to use the Spark harness).
On Python 3.7, you can now intermix positional arguments to
MRJob
with switches, similar to how you could back when
mrjob used optparse
. For example:
mr_your_script.py file1 -v file2.
On EMR, the default image_version (AMI) is now 5.27.0.
Restored m4.large
as the default instance type pre-5.13.0 AMIs, as they
do not support m5.xlarge
. (m5.xlarge
is still the default for AMI
5.13.0 and later.)
mrjob can now retry on transient AWS API errors (e.g. throttling) or network errors when making API calls that use pagination (e.g. listing clusters).
The emr_configurations opt now supports the !clear
tag
rather than crashing. You may also override individual configs by setting
a config with the same Classification
.
This version restores official support for Python 3.4, as it’s the version of Python 3 installed on EMR AMIs prior to 5.20.0. In order to make this work, mrjob drops support for Google Cloud services in Python 3.4, as the recent Google libraries appear to need a later Python version.
0.6.10¶
Adds official support for PyPy (that is any version of it compatible with
Python 2.7/3.5+). If you launch a job in PyPy python_bin will
automatically default to pypy
or pypy3
as appropriate.
Note that mrjob does not auto-install PyPy for you on EMR (Amazon Linux does not provide a PyPy package). Installing PyPy yourself at bootstrap time is fairly straightforward, see Installing PyPy.
The Spark harness can now be used on EMR, allowing you to run “classic”
MRJobs in Spark, which is often faster. Essentially, you launch jobs in
the Spark runner with --spark-submit-bin 'mrjob spark-submit -r emr'
;
see Running classic MRJobs on Spark on EMR for details.
The Spark runner can now optionally disable internal protocols when running “classic” MRJobs, eliminating the (usually) unnecessary effort of encoding data structures into JSON or other string representations and then decoding them. See skip_internal_protocol for details.
The EMR runner’s default instance type is now m5.xlarge
, which works
with newer reasons and should make it easier to run Spark jobs. The EMR runner
also now logs the DNS of the master node as soon as it is available, to make
it easier to SSH in.
Finally, mrjob gives a much clearer error message if you attempt to read a YAML
mrjob.conf file without PyYAML
installed.
0.6.9¶
Drops support for Python 3.4.
Fixes a bug introduced in 0.6.8 that could break archives or
directories uploaded into Hadoop or Spark if the name of the unpacked archive
didn’t have an archive extension (e.g. .tar.gz
).
The Spark runner can now optionally emulate Hadoop’s
mapreduce.map.input.file
configuration property when running the mapper of
the first step of a streaming job if you enable
emulate_map_input_file. This means that jobs that depend on
jobconf_from_env('mapreduce.map.input.file')
will still work.
The Spark runner also now uses the correct argument names when emulating
increment_counter()
, and logs a warning if
spark_tmp_dir doesn’t match spark_master.
mrjob spark-submit can now pass switches to the
Spark script/JAR without explicitly separating them out with --
.
The local and inline runners now more correctly emulate the
mapreduce.map.input.file config property by making it a file://
URL.
Deprecated methods add_file_option()
and
add_passthrough_option()
can now take a type
(e.g. int
) as their type
argument, to better emulate optparse
.
0.6.8¶
Nearly full support for Spark¶
This release adds nearly full support for Spark, including mrjob-specific features like setup scripts and passthrough options. See Why use mrjob with Spark? for everything mrjob can do with Spark.
This release adds a SparkMRJobRunner
(-r spark
), which
works with any Spark installation, does not require Hadoop, and can access any
filesystem supported by both mrjob and Spark (HDFS, S3, GCS). The Spark runner
is now the default for mrjob spark-submit.
What’s not supported? mrjob does not yet support Spark on Google Cloud Dataproc. The Spark runner does not yet parse logs to determine probable cause of failure when your job fails (though it does give you the Spark driver output).
Spark Hadoop Streaming emulation¶
Not only does the Spark runner not need Hadoop to run Spark jobs, it doesn’t
need Hadoop to run most Hadoop Streaming jobs, as it knows how to run them
directly on Spark. This means if you want to migrate to a
non-Hadoop Spark cluster, you can take all your old
MRJob
s with you. See Running “classic” MRJobs on Spark
for details.
The “experimental harness script” mentioned in 0.6.7 is now fully integrated into the Spark runner and is no longer supported as a separate feature.
Local runner support for Spark¶
The local
and inline
runner can now run Spark scripts locally for
testing, analogous to the way they’ve supported Hadoop streaming scripts
(except that they do require a local Spark installation). See
Other ways to run on Spark.
Other Spark improvements¶
MRJob
s are now Spark-serializable without calling
sandbox()
(there used to be a problematic reference
to sys.stdin
). This means you can always pass job methods to
rdd.flatMap()
etc.
setup scripts are no longer a YARN-specific feature, working
on all Spark masters (except
local[*]
, which doesn’t give executors a separate working directory).
Likewise, you can now specify a different name for files in the job’s
working directory (e.g. --file foo#bar
) on all Spark masters.
Note
Uploading archives and directories still only works on YARN
for now; Spark considers --archives
a YARN-specific feature.
When running on a local Spark cluster, uses file://...
rather than just
the path of the file when necessary (e.g. with --py-files
).
cat_output()
now ignores files and
subdirectories starting with "."
(used to only be "_"
). This allows
mrjob to ignore Spark’s checksum files (e.g. .part-00000.crc
), and also
brings mrjob in closer compliance to the way Hadoop input formats
read directories.
spark.yarn.appMasterEnv.*
config properties are only set if you’re
actually running on YARN.
The values of spark_master and spark_deploy_mode can
no longer be overridden with configuration properties
(-D spark.master=...
). While not exactly a “feature,” this means that mrjob
always knows what Spark platform it’s running on.
Filesystems¶
Every runner has an fs
attribute that gives access to all the filesystems
that runner supports.
Added a put()
method to all filesystems,
which allows uploading a single file (it used to be that each runner had
custom logic for uploads).
It also used to be that if you wanted to create a bucket on S3 or GCS, you had
to call create_bucket(...)
explicitly. Now
mkdir()
will automatically create buckets
as needed.
If you still need to access methods specific to a filesystem, you should do so
through fs.<name>
, where <name>
is the (lowercase) name of the
storage service. For example the Spark runner’s filesystem offers both
runner.fs.s3.create_bucket()
and runner.fs.gcs.create_bucket()
.
The old style of implicitly passing through FS-specific methods
(runner.fs.create_bucket(...)
) is deprecated and going away in v0.7.0.
GCSFilesystem
‘s constructor had a useless
local_tmp_dir
argument, which is now deprecated and going away in v0.7.0.
EMR¶
Fixed a bad bug introduced in 0.6.7 that could prevent mrjob from running on EMR with a non-default temp bucket.
You can now set sub-parameters with extra_cluster_params. For example, you can now do:
--extra-cluster-param Instances.EmrManagedMasterSecurityGroup=...
without clobbering the zone or instance group/fleet configs
specified in Instances
.
Running your job with --subnet ''
now un-sets a subnet
specified in your config file (used to be ignored).
If you are using cluster pooling with retries (pool_wait_minutes), mrjob now retains information about clusters that is immutable (e.g. AMI version), saving API calls.
Dependency upgrades¶
Bumped the required versions of several Google Cloud Python libraries to be
more compatible with current versions of their sub-dependencies
(Google libraries pin a fairly narrow range of dependencies). mrjob
now requires:
google-cloud-dataproc
at least 0.3.0,google-cloud-logging
at least 1.9.0, andgoogle-cloud-storage
at least 1.13.1.
Also dropped support for PyYAML
3.08; now we require at least
PyYAML
3.10 (which came out in 2011).
Note
We are aware that the Google libraries’ extensive dependencies can be a
nuisance for mrjob users who don’t use Google Cloud. Our tentative
plan is to make dependencies specific to a third-party service (including
google-cloud-*
and boto3
) optional starting in v0.7.0.
Other bugfixes¶
Fixed a long-standing bug that would cause the Hadoop runner to hang or raise cryptic errors if hadoop_bin or spark_submit_bin is not executable.
Support files for mrjob.examples
(e.g. stop_words.txt
for
MRMostUsedWord
) are now
installed along with mrjob
.
Setting a *_bin option to an empty value (e.g. --hadoop-bin
) now
always instructs mrjob to use the default, rather than disabling core
features or creating cryptic errors. This affects gcloud_bin,
hadoop_bin, sh_bin, and ssh_bin;
the various *python_bin options already worked this way.
0.6.7¶
setup commands now work on Spark (at least on YARN).
Added the mrjob spark-submit subcommand, which works as a drop-in replacement for spark-submit but with mrjob runners (e.g EMR) and mrjob features (e.g. setup, cmdenv).
Fixed a bug that was causing idle timeout scripts to silently fail on 2.x EMR AMIs.
Fixed a bug that broke create_bucket()
on us-east-1
, preventing new mrjob installations from launching on EMR
in that region.
Fixed an ImportError
from attempting to import
os.SIGKILL
on Windows.
The default instance type on EMR is now m4.large
.
EMR’s cluster pooling now knows the CPU and memory capacity of c5
and
m5
instances, allowing it to join “better” clusters.
Added the plural form of several switches (separate multiple values with commas):
--applications
--archives
--dirs
--files
--libjars
--py-files
Except for --application
, the singular version of these switches
(--archive
, --dir
, --file
, --libjar
, --py-file
) is
deprecated for consistency with Hadoop and Spark
sh_bin is now fully qualified by default (/bin/sh -ex
,
not sh -ex
). sh_bin may no longer be empty, and a warning
is issued if it has more than one argument, to properly support shell script
shebangs (e.g. #!/bin/sh -ex
) on Linux.
Runners no longer call MRJob
s with --steps
;
instead the job passes its step description to the runner on instantiation.
--steps
and steps_python_bin are now deprecated.
The Hadoop and EMR runner can now set SPARK_PYTHON
and
SPARK_DRIVER_PYTHON
to different values if need be (e.g. to
match task_python_bin, or to support setup
scripts in client mode).
The inline runner no longer attempts to run command substeps.
The inline and local runner no longer silently pretend to run non-streaming steps.
The Hadoop runner no longer has the bootstrap_spark option, which did nothing.
interpreter and steps_interpreter are deprecated, in anticipation in removing support for writing MRJobs in other programming languages.
Runners now issue a warning if they receive options that belong to other runners (e.g. passing image_version to the Hadoop runner).
mrjob create-cluster now supports --emr-action-on-failure
.
Updated deprecate escape sequences in mrjob code that would break on Python 3.8.
--help
message for mrjob subcommands now correctly includes the
subcommand in usage
.
mrjob no longer raises AssertionError
, instead raising
ValueError
.
Added an experimental harness script (in mrjob/spark
) to run basic
MRJobs on Spark, potentially without Hadoop:
spark-submit mrjob_spark_harness.py module.of.YourMRJob input_path output_dir
Added map_pairs()
,
reduce_pairs()
,
and combine_pairs()
methods to
MRJob
, to enable the Spark harness script.
0.6.6¶
Fixes a longstanding bug where boolean jobconf values
were passed to Hadoop in Python format (True
instead of true
). You
can now do safely do something like this:
runners:
emr:
jobconf:
mapreduce.output.fileoutputformat.compress: true
whereas in prior versions of mrjob, you had to use "true"
in quotes.
Added -D
as a synonym for --jobconf
, to match Hadoop.
On EMR, if you have SSH set up (see Configuring SSH credentials) mrjob can fetch your history log directly from HDFS, allowing it to more quickly diagnose why your job failed.
Added a --local-tmp-dir
switch. If you set local_tmp_dir
to empty string, mrjob will use the system default.
You can now pass multiple arguments to Hadoop --hadoop-args
(for example, --hadoop-args='-fs hdfs://namenode:port'
), rather
than having to use --hadoop-arg
one argument at time. --hadoop-arg
is now deprecated.
Similarly, you can use --spark-args
to pass arguments to
spark-submit
in place of the now-deprecated --spark-arg
.
mrjob no longer automatically passes generic arguments (-D
and
-libjars
) to JarStep
s, because this confuses
some JARs. If you want mrjob to pass generic arguments to a JAR, add
GENERIC_ARGS
to your
JarStep
‘s args keyword argument, like you would
with INPUT
and OUTPUT
.
The Hadoop runner now has a spark_deploy_mode option.
Fixed the usage: usage:
typo in --help
messages.
mrjob.job.MRJob.add_file_arg()
can now take an explicit type=str
(used to cause an error).
The deprecated optparse
emulation methods
add_file_option()
and
add_passthrough_option()
now support type='str'
(used to only accept type='string'
).
Fixed a permissions error that was breaking inline
and local
mode
on some versions of Windows.
0.6.5¶
This release fixes an issue with self-termination of idle clusters on EMR
(see max_mins_idle) where the master node sometimes
simply ignored sudo shutdown -h now
. The idle self termination script
now logs to bootstrap-actions/mrjob-idle-termination.log
.
Note
If you are using Cluster Pooling, it’s highly recommended you upgrade to this version to fix the self-termination issue.
You can now turn off log parsing (on all runners) by setting read_logs to false. This can speed up cases where you don’t care why a job failed (e.g. integration tests) or where you’d rather use the diagnose tool after the fact.
You may specify custom AMIs with the image_id option. To find
Amazon Linux AMIs compatible with EMR that you can use as a base for your
custom image, use describe_base_emr_images()
.
The default AMI on EMR is now 5.16.0.
New EMR clusters launched by mrjob will be automatically tagged with
__mrjob_label
(filename of your mrjob script) and __mrjob_owner
(your username), to make it easier to understand your mrjob usage in
CloudWatch etc. You can change the
value of these tags with the label and owner options.
You may now set the root EBS volume size for EMR clusters directly with ebs_root_volume_gb (you used to have to use instance_groups or instance_fleets).
API clients returned by EMRJobRunner
now retry on
SSL timeouts. EMR clients returned by
mrjob.emr.EMRJobRunner.make_emr_client()
won’t retry faster than
check_cluster_every, to prevent throttling.
Cluster pooling recovery (relaunching a job when your pooled cluster self-terminates) now works correctly on single-node clusters.
0.6.4¶
This release makes it easy to attach static files to your
MRJob
with the FILES
, DIRS
,
and ARCHIVES
attributes.
In most cases, you no longer need setup scripts to access other
python modules or packages from your job because you can use
DIRS
instead. For more details, see
Using other python modules and packages.
For completeness, also
added files()
,
dirs()
, and archives()
methods.
terminate-idle-clusters now skips termination-protected idle clusters, rather than crashing (this is fixed in 0.5.12, but not previous 0.6.x versions).
Python 3.3 is no longer supported.
mrjob now requires google-cloud-dataproc
0.2.0+ (this
library used to be vendored).
0.6.3¶
Read arbitrary file formats¶
You can now pass entire files in any format to your mapper by defining
mapper_raw()
. See Passing entire files to the mapper for an example.
Google Cloud Datatproc parity¶
mrjob now offers feature parity between Google Cloud Dataproc and Amazon Elastic MapReduce. Support for Spark and libjars will be added in a future release. (There is no plan to introduce Cluster Pooling with Dataproc.)
Specifically, DataprocJobRunner
now supports:
- fetching and parsing counters
- parsing logs for probable cause of failure
- job progress messages (% complete)
- Jar steps
- these config options:
- cloud_part_size_mb (chunked uploading)
- core_instance_config, master_instance_config, task_instance_config
- hadoop_streaming_jar
- network/subnet (running in a VPC)
- service_account (custom IAM account)
- service_account_scopes (fine-grained permissions)
- ssh_tunnel/ssh_tunnel_is_open (resource manager)
Improvements to existing Dataproc features:
- bootstrap scripts run in a temp dir, rather than
/
- uses Dataproc’s built-in auto-termination feature, rather than a script
- GCS filesystem:
cat()
streams data rather than dumping to a temp fileexists()
no longer swallows all exceptions
To get started, read Getting started with Google Cloud.
Other changes¶
mrjob no longer streams your job output to the command line if you specify output_dir. You can control this with the --cat-output and --no-cat-output switches (--no-output is deprecated).
cloud_upload_part_size has been renamed to cloud_part_size_mb (the old name will work until v0.7.0).
mrjob can now recognize “not a valid JAR” errors from Hadoop and suggest them as probable cause of job failure.
mrjob no longer depends on google-cloud
(which implies several other
Google libraries). Its current Google library dependencies are
google-cloud-logging
1.5.0+ and google-cloud-storage
1.9.0+.
Future versions of mrjob will depend on google-cloud-dataproc
0.11.0+
(currently included with mrjob because it hasn’t yet been released).
RetryWrapper
now sets __name__
when wrapping
methods, making for easier debugging.
0.6.2¶
mrjob is now orders of magnitude quicker at parsing logs, making it practical to diagnose rare errors from very large jobs. However, on some AMIs, it can no longer parse errors without waiting for logs to transfer to S3 (this may be fixed in a future version).
To run jobs on Google Cloud Dataproc, mrjob no longer requires you to install the gcloud util (though if you do have it installed, mrjob can read credentials from its configs). For details, see Dataproc Quickstart.
mrjob no longer requires you to select a Dataproc zone prior
to running jobs. Auto zone placement (just set region and let
Dataproc pick a zone) is now enabled, with the default being auto zone
placement in us-west1
. mrjob no longer reads zone and region from
gcloud‘s compute engine configs.
mrjob’s Dataproc code has been ported from the google-python-api-client
library (which is in maintenance mode) to google-cloud-sdk
, resulting in
some small changes to the GCS filesystem API. See CHANGES.txt for details.
Local mode now has a num_cores option that allow you to control how tasks it handles simultaneously.
0.6.1¶
Added the diagnose tool (run mrjob diagnose j-CLUSTERID), which determines why a previously run job failed.
Fixed a serious bug that made mrjob unable to properly parse error logs in some cases.
Added the get_job_steps()
method to
EMRJobRunner
.
0.6.0¶
Dropped Python 2.6¶
mrjob now supports Python 2.7 and Python 3.3+. (Some versions of PyPy also work but are not officially supported.)
boto3, not boto¶
mrjob now uses boto3
rather than boto
to talk to AWS.
This makes it much simpler to pass user-defined data structures directly
to the API, enabling a number of features.
At least version 1.4.6 of boto3
is required to run jobs on EMR.
It is now possible to fully configure instances (including EBS volumes). See instance_groups for an example.
mrjob also now supports Instance Fleets, which may be fully configured (including EBS volumes) through the instance_fleets option.
Methods that took or returned boto
objects (for example,
make_emr_conn()
) have been completely removed as there as no way
to make a deprecated shim for them without keeping boto
as a
dependency. See EMRJobRunner
and
S3Filesystem
for new method names.
Note that boto3
reads temporary credentials from
$AWS_SESSION_TOKEN
,
not $AWS_SECURITY_TOKEN
as in boto
(see
aws_session_token for details).
argparse, not optparse¶
mrjob now uses argparse
to parse options, rather than
optparse
, which has been deprecated since Python 2.7.
argparse
has slightly different option-parsing logic. A couple
of things you should be aware of:
- everything that starts with
-
is assumed to be a switch.--hadoop-arg=-verbose
works, but--hadoop-arg -verbose
does not.- positional arguments may not be split.
mr_wc.py CHANGES.txt LICENSE.txt -r local
will work, butmr_wc.py CHANGES.txt -r local LICENSE.txt
will not.
Passthrough options, file options, etc. are now handled with
add_file_arg()
,
add_passthru_arg()
,
configure_args()
,
load_args()
, and
pass_arg_through()
. The old
methods with “option” in their name are deprecated but still work.
As part of this refactor, OptionStore and its subclasses have been removed; options are now handled by runners directly.
Chunks, not lines¶
mrjob no longer assumes that job output will be line-based. If you
run your job programmatically, you should
read your job output with cat_output()
,
which yields bytestrings which don’t necessarily correspond to lines, and run
these through parse_output()
, which will convert
them into key/value pairs.
runner.fs.cat()
also now yields arbitrary bytestrings, not lines. When it
yields from multiple files, it will yield an empty bytestring (b''
)
between the chunks from each file.
read_file()
and read_input()
are
now deprecated because they are line-based. Try
decompress()
, to_chunks()
, and
to_lines()
.
Better local/inline mode¶
The sim runners (inline
and local
mode) have been completely
rewritten, making it possible to fix a number of outstanding issues.
Local mode now runs one mapper/reducer per CPU, using
multiprocesssing
, for faster results.
We only sort by reducer key (not the full line) unless
SORT_VALUES
is set, exposing bad assumptions sooner.
The step_output_dir option is now supported, making it easier to debug issues in intermediate steps.
Files in tasks’ (e.g. mappers’) working directories are marked user-executable, to better imitate Hadoop Distributed Cache. When possible, we also symlink to a copy of each file/archive in the “cache,” rather than copying them.
If os.symlink()
raises an exception, we fall back to copying (this
can be an issue in Python 3 on Windows).
Tasks are run more like they are in Hadoop; input is passed through stdin,
rather than as script arguments. mrjob.cat
is no longer executable
because local mode no longer needs it.
Cloud runner improvements¶
Much of the common code for the “cloud” runners (Dataproc and EMR) has been merged, so that new features can be rolled out in parallel.
The bootstrap option (for both Dataproc and EMR) can now take archives and directories as well as files, like the setup option has since version 0.5.8.
The extra_cluster_params option allows you to pass arbitrary JSON to the API at cluster create time (in Dataproc and EMR). The old emr_api_params option is deprecated and disabled.
max_hours_idle has been replaced with max_mins_idle (the old option is deprecated but still works). The default is 10 minutes. Due to a bug, smaller numbers of minutes might cause the cluster to terminate before the job runs.
It is no longer possible for mrjob to launch a cluster that sits idle indefinitely (except by setting max_mins_idle to an unreasonably high value). It is still a good idea to run report-long-jobs because mrjob can’t tell if a running job is doing useful work or has stalled.
EMR now bills by the second, not the hour¶
Elastic MapReduce recently stopped billing by the full hour, and now bills by the second. This means that Cluster Pooling is no longer a cost-saving strategy, though developers might find it handy to reduce wait times when testing.
The mins_to_end_of_hour option no longer makes sense, and has been deprecated and disabled.
audit-emr-usage has been updated to use billing by the second when approximating time billed and waste.
Note
Pooling was enabled by default for some development versions of v0.6.0, prior to the billing change. This did not make it into the release; you must still explicitly turn on cluster pooling.
Other EMR changes¶
The default AMI is now 5.8.0. Note that this means you get Spark 2 by default.
Regions are now case-sensitive, and the EU
alias for eu-west-1
no
longer works.
Pooling no longer adds dummy arguments to the master bootstrap script, instead
setting the __mrjob_pool_hash
and __mrjob_pool_name
tags on the
cluster.
mrjob automatically adds the __mrjob_version
tag to clusters it creates.
Jobs will not add tags to clusters they join rather than create.
enable_emr_debugging now works on AMI 4.x and later.
AMI 2.4.2 and earlier are no longer supported (no Python 2.7). There is no longer any special logic for the “latest” AMI alias (which the API no longer supports).
The SSH filesystem no longer dumps file contents to memory.
Pooling will only join a cluster with enough running instances to meet its specifications; requested instances no longer count.
Pooling is now aware of EBS (disk) setup.
Pooling won’t join a cluster that has extra instance types that don’t have enough memory or disk space to run your job.
Errors in bootstrapping scripts are no longer dumped as JSON.
visible_to_all_users is deprecated.
Massive purge of deprecated code¶
About a hundred functions, methods, options, and more that were deprecated in v0.5.x have been removed. See CHANGES.txt for details.
0.5.12¶
This release came out after v0.6.3. It was mostly a backport from v0.6.x.
Python 2.6 and 3.3 are no longer supported.
mrjob.parse.parse_s3_uri()
handles s3a://
URIs.
terminate-idle-clusters now skips termination-protected idle clusters, rather than crashing.
Since Amazon no longer bills by the full hour, the mins_to_end_of_hour option now defaults to 60, effectively disabling it.
When mrjob passes an environment dictionary to subprocesses, it ensures
that the keys and values are always str
s (this mostly affects
Python 2 on Windows).
0.5.11¶
The report-long-jobs utility can now ignore certain clusters based on EMR tags.
This version deals more gracefully with clusters that use instance fleets, preventing crashes that may occur in some rare edge cases.
0.5.10¶
Fixed an issue where bootstrapping mrjob on Dataproc or EMR could stall if mrjob was already installed.
The aws_security_token option has been renamed to
aws_session_token. If you want to set it via environment
variable, you still have to use $AWS_SECURITY_TOKEN
because that’s
what boto uses.
Added protocol support for rapidjson
; see
RapidJSONProtocol
and
RapidJSONValueProtocol
. If available,
rapidjson
will be used as the default JSON implementation if
ujson
is not installed.
The master bootstrap script on EMR and Dataproc now has the correct
file extension (.sh
, not .py
).
0.5.9¶
Fixed a bug that prevented setup scripts from working on EMR AMIs 5.2.0 and later. Our workaround should be completely transparent unless you use a custom shell binary; see sh_bin for details.
The EMR runner now correctly re-starts the SSH tunnel to the job tracker/resource manager when a cluster it tries to run a job on auto-terminates. It also no longer requires a working SSH tunnel to fetch job progress (you still a working SSH; see ec2_key_pair_file).
The emr_applications option has been renamed to applications.
The terminate-idle-clusters utility is now slightly more robust in cases where your S3 temp directory is an different region from your clusters.
Finally, there a couple of changes that probably only matter if you’re trying to wrap your Hadoop tasks (mappers, reducers, etc.) in docker:
- You can set just the python binary for tasks with task_python_bin. This allows you to use a wrapper script in place of Python without perturbing setup scripts.
- Local mode now no longer relies on an absolute path to access the
mrjob.cat
utility it uses to handle compressed input files; copying the job’s working directory into Docker is enough.
0.5.8¶
You can now pass directories to jobs, either directly with the upload_dirs option, or through setup commands. For example:
--setup 'export PYTHONPATH=$PYTHONPATH:your-src-code/#'
mrjob will automatically tarball these directories and pass them to Hadoop as archives.
For multi-step jobs, you can now specify where inter-step output goes
with step_output_dir (--step-output-dir
), which can be useful
for debugging.
All job step types
now take the jobconf keyword
argument to set Hadoop properties for that step.
Jobs’ --help
printout is now better-organized and less verbose.
Made several fixes to pre-filters (commands that pipe into streaming steps):
- you can once again add pre-filters to a single step job by re-defining
mapper_pre_filter()
,combiner_pre_filter()
, and/orreducer_pre_filter()
- local mode now ignores non-zero return codes from pre-filters (this matters for BSD grep)
- local mode can now run pre-filters on compressed input files
mrjob now respects sh_bin when it needs to wrap a command
in sh
before passing it to Hadoop (e.g. to support pipes)
On EMR, mrjob now fetches logs from task nodes when determining probable cause of error, not just core nodes (the ones that run tasks and host HDFS).
Several unused functions in mrjob.util
are now deprecated:
args_for_opt_dest_subset()
bash_wrap()
populate_option_groups_with_options()
scrape_options_and_index_by_dest()
tar_and_gzip()
bunzip2_stream()
and gunzip_stream()
have been moved from mrjob.util
to mrjob.cat
.
SSHFilesystem.ssh_slave_hosts()
has been deprecated.
Option group attributes in MRJob
s have been deprecated,
as has the get_all_option_groups()
method.
0.5.7¶
Cluster pooling¶
mrjob can now add up to 1,000 steps on
pooled clusters on EMR (except on very old AMIs).
mrjob now prints debug messages explaining why your job matched
a particular pooled cluster when running in verbose mode (the -v
option).
Fixed a bug that caused pooling to fail when there was no need for a master
bootstrap script (e.g. when running with --no-bootstrap-mrjob
).
Other improvements¶
Log interpretation is much more efficient at determining a job’s probable cause of failure (this works with Spark as well).
When running custom JARs (see JarStep
) mrjob now
repects libjars and jobconf.
The hadoop_streaming_jar option now supports environment variables
and ~
.
The terminate-idle-clusters tool now works with all step types, including Spark. (It’s still recommended that you rely on the max_hours_idle option rather than this tool.)
mrjob now works in Anaconda3 Jupyter Notebook.
Bugfixes¶
Added several missing command-line switches, including
--no-bootstrap-python
on Dataproc. Made a major refactor that should
prevent these kinds of issues in the future.
Fixed a bug that caused mrjob to crash when the ssh binary (see ssh_bin) was missing or not executable.
Fixed a bug that erroneously reported failed or just-started jobs as 100% complete.
Fixed a bug where timestamps were erroneously recognized as URIs.
mrjob now only recognizes strings containing
://
as URIs (see is_uri()
).
Deprecation¶
The following are deprecated and will be removed in v0.6.0:
JarStep
.``INPUT``; usemrjob.step.INPUT
insteadJarStep
.``OUTPUT``; usemrjob.step.OUTPUT
instead- non-strict protocols (see strict_protocols)
- the python_archives option (try this instead)
is_windows_path()
parse_key_value_list()
parse_port_range_list()
scrape_options_into_new_groups()
0.5.6¶
Fixed a critical bug that caused Dataproc runner to always crash when determining Hadoop version.
Log interpretation now prioritizes task errors (e.g. a traceback from your Python script) as probable cause of failure, even if they aren’t the most recent error. Log interpretation will now continue to download and parse task logs until it finds a non-empty stderr log.
Log interpretation also strips the “subprocess failed” Java stack trace that appears in task stderr logs from Hadoop 1.
0.5.5¶
Functionally equivalent to 0.5.4, except that it restores the deprecated ami_version option as an alias for image_version, making it easier to upgrade from earlier versions of mrjob.
Also slightly improves Cluster Pooling on EMR with updated information on memory and CPU power of various EC2 instance types, and by treating application names (e.g. “Spark”) as case-insensitive.
0.5.4¶
Pooling and idle cluster self-termination¶
Warning
This release accidentally removed the ami_version option instead of merely deprecating it. If you are upgrading from an earlier version of mrjob, use version 0.5.5 or later.
This release resolves a long-standing EMR API race condition that made it difficult to use Cluster Pooling and idle cluster self-termination (see max_hours_idle) together. Now if your pooled job unknowingly runs on a cluster that was in the process of shutting down, it will detect that and re-launch the job on a different cluster.
This means pretty much everyone running jobs on EMR should now enable pooling, with a configuration like this:
runners:
emr:
max_hours_idle: 1
pool_clusters: true
You may also run the terminate-idle-clusters script periodically, but (barring any bugs) this shouldn’t be necessary.
Generic EMR option names¶
Many options to the EMR runner have been made more generic, to make it easier to share code with the Dataproc runner (in most cases, the new names are also shorter and easier to remember):
old option name | new option name |
---|---|
ami_version | image_version |
aws_availablity_zone | zone |
aws_region | region |
check_emr_status_every | check_cluster_every |
ec2_core_instance_bid_price | core_instance_bid_price |
ec2_core_instance_type | core_instance_type |
ec2_instance_type | instance_type |
ec2_master_instance_bid_price | master_instance_bid_price |
ec2_master_instance_type | master_instance_type |
ec2_slave_instance_type | core_instance_type |
ec2_task_instance_bid_price | task_instance_bid_price |
ec2_task_instance_type | task_instance_type |
emr_tags | tags |
num_ec2_core_instances | num_core_instances |
num_ec2_task_instances | num_task_instances |
s3_log_uri | cloud_log_dir |
s3_sync_wait_time | cloud_fs_sync_secs |
s3_tmp_dir | cloud_tmp_dir |
s3_upload_part_size | cloud_upload_part_size |
The old option names and command-line switches are now deprecated but will continue to work until v0.6.0. (Exception: ami_version was accidentally removed; if you need it, use 0.5.5 or later.)
num_ec2_instances has simply been deprecated (it’s just num_core_instances plus one).
hadoop_streaming_jar_on_emr has also been deprecated; in its
place, you can now pass a file://
URI to hadoop_streaming_jar
to reference a path on the master node.
Log interpretation¶
Log interpretation (counters and probable cause of job failure) on Hadoop is more robust, handing a wider variety of log4j formats and recovering more gracefully from permissions errors. This includes fixing a crash that could happen on Python 3 when attempting to read data from HDFS.
Log interpretation used to be partially broken on EMR AMI 4.3.0 and later due to a permissions issue; this is now fixed.
pass_through_option()¶
You can now pass through existing command-line switches to your job;
for example, you can tell a job which runner launched it. See
pass_through_option()
for details.
If you don’t do this, self.options.runner
will now always be None
in your job (it used to confusingly default to 'inline'
).
Stop logging credentials¶
When mrjob is run in verbose mode (the -v
option), the values of all
runner options are debug-logged to stderr. This has been the case since
the very early days of mrjob.
Unfortunately, this means that if you set your AWS credentials in
mrjob.conf
, they get logged as well, creating a surprising potential
security vulnerability. (This doesn’t happen for AWS credentials set through
environment variables.)
Starting in this version, the values of aws_secret_access_key
and aws_security_token are shown as '...'
if they are set,
and all but the last four characters of aws_access_key_id are
blanked out as well (e.g. '...YNDR'
).
Other improvements and bugfixes¶
The ssh tunnel to the resource manager on EMR (see ssh_tunnel) now connects to its correct internal IP; this resolves a firewall issue that existed on some VPC setups.
Uploaded files will no longer be given names starting with _
or .
,
since Hadoop’s input processing treats these files as “hidden”.
The EMR idle cluster self-termination script (see max_hours_idle) now only runs on the master node.
The audit-emr-usage command-line tool should no longer constantly trigger throttling warnings.
bootstrap_python no longer bothers trying to install Python 3 on EMR AMI 4.6.0 and later, since it is already installed.
The --ssh-bind-ports
command-line switch was broken (starting in
0.4.5!), and is now fixed.
0.5.3¶
This release adds support for custom libjars (such as nicknack), allowing easy access to custom input and output formats. This works on Hadoop and EMR (including on a cluster that’s already running).
In addition, jobs can specify needed libjars by setting the
LIBJARS
attribute or overriding the
libjars()
method. For examples, see
Input and output formats.
The Hadoop runner now tries even harder to find your log files without needing additional configuration (see hadoop_log_dirs).
The EMR runner now supports Amazon VPC subnets (see subnet), and, on 4.x AMIs, Application Configurations (see emr_configurations).
If your EMR cluster fails during bootstrapping, mrjob can now determine the probable cause of failure.
There are also some minor improvements to SSH tunneling and a handful of small bugfixes; see CHANGES.txt for details.
0.5.2¶
This release adds basic support for Google Cloud Dataproc which is Google’s Hadoop service, roughly analogous to EMR. See Dataproc Quickstart. Some features are not yet implemented:
- fetching counters
- finding probable cause of errors
- running Java JARs as steps
Added the emr_applications option, which helps you configure 4.x AMIs.
Fixed an EMR bug (introduced in v0.5.0) where we were waiting for steps to complete in the wrong order (in a multi-step job, we wouldn’t register that the first step had finished until the last one had).
Fixed a bug in SSH tunneling (introduced in v0.5.0) that made connections to the job tracker/resource manager on EMR time out when running on a 2.x AMI inside a VPC (Virtual Private Cluster).
Fixed a bug (introduced in v0.4.6) that kept mrjob from interpreting ~
(home directory) in includes in mrjob.conf
.
It is now again possible to run tool modules deprecated in v0.5.0 directly (e.g. python -m mrjob.tools.emr.create_job_flow). This is still a deprecated feature; it’s recommended that you use the appropriate mrjob subcommand instead (e.g. mrjob create-cluster).
0.5.1¶
Fixes a bug in the previous relase that broke
SORT_VALUES
and any other attempt by the job
to set the partitioner. The --partitioner
switch is now deprecated
(the choice of partitioner is part of your job semantics).
Fixes a bug in the previous release that caused strict_protocols
and check_input_paths to be ignored in mrjob.conf
. (We
would much prefer you fixed jobs that are using “loose protocols” rather than
setting strict_protocols: false
in your config file, but we didn’t break
this on purpose, we promise!)
mrjob terminate-idle-clusters
now correctly handles EMR debugging steps
(see enable_emr_debugging) set up by boto 2.40.0.
Fixed a bug that could result in showing a blank probable cause of error for pre-YARN (Hadoop 1) jobs.
ssh_bind_ports now defaults to a range
object (xrange
on
Python 2), so that when you run on emr in verbose mode (-r emr -v
), debug
logging devotes one line to the value of ssh_bind_ports
rather than 840.
0.5.0¶
Python versions¶
mrjob now fully supports Python 3.3+ in a way that should be transparent to existing Python 2 users (you don’t have to suddenly start handling unicode
instead of str
). For more information, see Python 2 vs. Python 3.
If you run a job with Python 3, mrjob will automatically install Python 3 on ElasticMapreduce AMIs (see bootstrap_python).
When you run jobs on EMR in Python 2, mrjob attempts to match your minor version of Python as well (either python2.6 or python2.7); see python_bin for details.
Note
If you’re currently running Python 2.7, and
using yum to install python libraries, you’ll
want to use the Python 2.7 version of the package (e.g.
python27-numpy
rather than python-numpy
).
The mrjob command is now installed with Python-version-specific aliases (e.g. mrjob-3, mrjob-3.4), in case you install mrjob for multiple versions of Python.
Hadoop¶
mrjob should now work out-of-the box on almost any Hadoop setup. If hadoop is in your path, or you set any commonly-used $HADOOP_*
environment variable, mrjob will find the Hadoop binary, the streaming jar, and your logs, without any help on your part (see hadoop_bin, hadoop_log_dirs, hadoop_streaming_jar).
mrjob has been updated to fully support Hadoop 2 (YARN), including many updates to HadoopFilesystem
. Hadoop 1 is still supported, though anything prior to Hadoop 0.20.203 is not (mrjob is actually a few months older than Hadoop 0.20.203, so this used to matter).
3.x and 4.x AMIs¶
mrjob now fully supports the 3.x and 4.x Elastic MapReduce AMIs, including SSH tunneling to the resource mananager, fetching counters and finding probable cause of job failure.
The default ami_version (see image_version) is now 3.11.0
. Our plan is to continue updating this to the lastest (non-broken) 3.x AMI for each 0.5.x release of mrjob.
The default instance_type is now m1.medium
(m1.small
is too small for the 3.x and 4.x AMIs)
You can specify 4.x AMIs with either the new release_label option, or continue using ami_version; both work.
mrjob continues to support 2.x AMIs. However:
Warning
2.x AMIs are deprecated by AWS, and based on a very old version of Debian (squeeze), which breaks apt-get and exposes you to security holes.
Please, please switch if you haven’t already.
AWS Regions¶
The new default aws_region (see region) is us-west-2
(Oregon). This both matches the default in the EMR console and, according to Amazon, is carbon neutral.
An edge case that might affect you: EC2 key pairs (i.e. SSH credentials) are region-specific, so if you’ve set up SSH but not explicitly specified a region, you may get an error saying your key pair is invalid. The fix is simply to create new SSH keys for the us-west-2
(Oregon) region.
S3¶
- mrjob is much smarter about the way it interacts with S3:
- automatically creates temp bucket in the same region as jobs
- connects to S3 buckets on the endpoint matching their region (no more 307 errors)
EMRJobRunner
andS3Filesystem
methods no longer takes3_conn
args (passing around a single S3 connection no longer makes sense)
- no longer uses the temp bucket’s location to choose where you run your job
rm()
no longer has special logic for*_$folder$
keysls()
recurses “subdirectories” even if you pass it a URI without a trailing slash
Log interpretation¶
The part of mrjob that fetches counters and tells you what probably caused your job to fail was basically unmaintainable and has been totally rewritten. Not only do we now have solid support across Hadoop and EMR AMI versions, but if we missed anything, it should be straightforward to add it.
Once casualty of this change was the mrjob fetch-logs command, which means mrjob no longer offers a way to fetch or interpret logs from a past job. We do plan to re-introduce this functionality.
Protocols¶
Protocols are now strict by default (they simply raise an exception on
unencodable data). “Loose” protocols can be re-enabled with the
--no-strict-protocols
switch; see strict_protocols for
why this is a bad idea.
Protocols will now use the much faster ujson
library, if installed,
to encode and decode JSON. This is especially recommended for simple jobs that
spend a significant fraction of their time encoding and data.
Note
If you’re using EMR, try out
this bootstrap recipe to install ujson
.
mrjob will fall back to the simplejson
library if ujson
is not installed, and use the built-in json
module if neither is installed.
You can now explicitly specify which JSON implementation you wish to use
(e.g. StandardJSONProtocol
, SimpleJSONProtocol
, UltraJSONProtocol
).
Status messages¶
We’ve tried to cut the logging messages that your job prints as it runs down to the basics (either useful info, like where a temp directory is, or something that tells you why you’re waiting). If there are any messages you miss, try running your job with -v
.
When a step in your job fails, mrjob no longer prints a useless stacktrace telling you where in the code the runner raised an exception about your step failing. This is thanks to StepFailedException
, which you can also catch and interpret if you’re running jobs programmatically.
Deprecation¶
Many things that were deprecated in 0.4.6 have been removed:
- options:
IF_SUCCESSFUL
cleanup option (useALL
)- iam_job_flow_role (use iam_instance_profile)
- functions and methods:
- positional arguments to
mrjob.job.MRJob.mr()
(don’t even usemr()
; usemrjob.step.MRStep
)mrjob.job.MRJob.jar()
(usemrjob.step.JarStep
)- step_args and name arguments to
mrjob.step.JarStep
(use args instead of step_args, and don’t use name at all)mrjob.step.MRJobStep
(usemrjob.step.MRStep
)mrjob.compat.get_jobconf_value()
(use tojobconf_from_env()
)mrjob.job.MRJob.parse_counters()
mrjob.job.MRJob.parse_output()
mrjob.conf.combine_cmd_lists()
mrjob.fs.s3.S3Filesystem.get_s3_folder_keys()
mrjob.compat
functions supports_combiners_in_hadoop_streaming()
, supports_new_distributed_cache_options()
, and uses_generic_jobconf()
, which only existed to support very old versions of Hadoop, were removed without deprecation warnings (sorry!).
To avoid a similar wave of deprecation warnings in the future, the name of every part of mrjob that isn’t meant to be a stable interface provided by the library now starts with an underscore. You can still use these things (or copy them; it’s Open Source), but there’s no guarantee they’ll exist in the next release.
If you want to get ahead of the game, here is a list of things that are deprecated starting in mrjob 0.5.0 (do these after upgrading mrjob):
- options:
- base_tmp_dir is now local_tmp_dir
- cleanup options
LOCAL_SCRATCH
andREMOTE_SCRATCH
are nowLOCAL_TMP
andREMOTE_TMP
- emr_job_flow_id is now cluster_id
- emr_job_flow_pool_name is now pool_name
- hdfs_scratch_dir is now hadoop_tmp_dir
- pool_emr_job_flows is now pool_clusters
- s3_scratch_uri is now cloud_tmp_dir
- ssh_tunnel_to_job_tracker is now simply ssh_tunnel
- functions and methods:
mrjob.job.MRJob.is_mapper_or_reducer()
is nowis_task()
Filesystem
methodpath_exists()
is now simplyexists()
Filesystem
methodpath_join()
is now simplyjoin()
- Use
runner.fs
explicitly when accessing filesystem methods (e.g.runner.fs.ls()
, notrunner.ls()
)
- mrjob subcommands - mrjob create-job-flow is now mrjob create-cluster - mrjob terminate-idle-job-flows is now mrjob terminate-idle-clusters - mrjob terminate-job-flow is now mrjob temrinate-cluster
Other changes¶
- mrjob now requires
boto
2.35.0 or newer (chances are you’re already doing this). Later 0.5.x releases of mrjob may require newer versions ofboto
.- visible_to_all_users now defaults to
True
HadoopFilesystem.rm()
uses-skipTrash
- new iam_endpoint option
- custom hadoop_streaming_jars are properly uploaded
JOB
cleanup on EMR is temporarily disabled- mrjob now follows symlinks when
ls()
ing the local filesystem (beware recursive symlinks!)- The interpreter option disables bootstrap_mrjob by default (interpreter is meant for non-Python jobs)
- Cluster Pooling now respects ec2_key_pair
- cluster self-termination (see max_hours_idle) now respects non-streaming jobs
LocalFilesystem
now rejects URIs rather than interpreting them as local pathslocal
andinline
runners no longer have a default hadoop_version, instead handling jobconf in a version-agnostic way- steps_python_bin now defaults to the current Python interpreter.
- minor changes to
mrjob.util
:
file_ext()
takes filename, not pathgunzip_stream()
now yields chunks of bytes, not lines- moved
random_identifier()
method here frommrjob.aws
buffer_iterator_to_line_iterator()
is now namedto_lines()
, and no longer appends a trailing newline to data.
0.4.6¶
include:
in conf files can now use relative paths in a meaningful way.
See Relative includes.
List and environment variable options loaded from included config files can
be totally overridden using the !clear
tag. See Clearing configs.
Options that take lists (e.g. setup) now treat scalar values as single-item lists. See this example.
Fixed a bug that kept the pool_wait_minutes
option from being loaded from
config files.
0.4.5¶
This release moves mrjob off the deprecated DescribeJobFlows EMR API call.
Warning
AWS again broke older versions mrjob for at least some new accounts, by returning 400s for the deprecated DescribeJobFlows API call. If you have a newer AWS account (circa July 2015), you must use at least this version of mrjob.
The new API does not provide a way to tell when a job flow (now called a “cluster”) stopped provisioning instances and started bootstrapping, so the clock for our estimates of when we are close to the end of a billing hour now start at cluster creation time, and are thus more conservative.
Related to this change, terminate_idle_job_flows
no longer considers job flows in the STARTING
state idle; use
report_long_jobs
to catch jobs stuck in
this state.
terminate_idle_job_flows
performs much better
on large numbers of job flows. Formerly, it collected all job flow information
first, but now it terminates idle job flows as soon as it identifies them.
collect_emr_stats
and
job_flow_pool
have not been ported to the
new API and will be removed in v0.5.0.
Added an aws_security_token option to allow you to run mrjob on EMR using temporary AWS credentials.
Added an emr_tags (see tags) option to allow you to tag EMR job flows at creation time.
EMRJobRunner
now has a
get_ami_version()
method.
The hadoop_version option no longer has any effect in EMR. This option only every did anything on the 1.x AMIs, which mrjob no longer supports.
Added many missing switches to the EMR tools (accessible from the mrjob command). Formerly, you had to use a config file to get at these options.
You can now access the mrboss
tool from the
command line: mrjob boss <args>.
Previous 0.4.x releases have worked with boto as old as 2.2.0, but this one requires at least boto 2.6.0 (which is still more than two years old). In any case, it’s recommended that you just use the latest version of boto.
This branch has a number of additional deprecation warnings, to help prepare you for mrjob v0.5.0. Please heed them; a lot of deprecated things really are going to be completely removed.
0.4.4¶
mrjob now automatically creates and uses IAM objects as necessary to comply with new requirements from Amazon Web Services.
(You do not need to install the AWS CLI or run aws emr create-default-roles
as the link above describes; mrjob takes care of this for you.)
Warning
The change that AWS made essentially broke all older versions of mrjob for all new accounts. If the first time your AWS account created an Elastic MapReduce cluster was on or after April 6, 2015, you should use at least this version of mrjob.
If you must use an old version of mrjob with a new AWS account, see this thread for a possible workaround.
--iam-job-flow-role
has been renamed to --iam-instance-profile
.
New --iam-service-role
option.
0.4.3¶
This release also contains many, many bugfixes, one of which probably affects you! See CHANGES.txt for details.
Added a new subcommand, mrjob collect-emr-active-stats
, to collect stats
about active jobflows and instance counts.
--iam-job-flow-role
option allows setting of a specific IAM role to run
this job flow.
You can now use --check-input-paths
and --no-check-input-paths
on EMR
as well as Hadoop.
Files larger than 100MB will be uploaded to S3 using multipart upload if you
have the filechunkio module installed. You can change the limit/part size
with the --s3-upload-part-size
option, or disable multipart upload by
setting this option to 0.
You can now require protocols to be strict from mrjob.conf; this means unencodable input/output will result in an exception rather than the job quietly incrementing a counter. It is recommended you set this for all runners:
runners:
emr:
strict_protocols: true
hadoop:
strict_protocols: true
inline:
strict_protocols: true
local:
strict_protocols: true
You can use --no-strict-protocols
to turn off strict protocols for
a particular job.
Tests now support pytest and tox.
Support for Python 2.5 has been dropped.
0.4.2¶
JarSteps, previously experimental, are now fully integrated into multi-step jobs, and work with both the Hadoop and EMR runners. You can now use powerful Java libraries such as Mahout in your MRJobs. For more information, see Jar steps.
Many options for setting up your task’s environment (--python-archive
,
--setup-cmd
and --setup-script
) have been replaced by a powerful
--setup
option. See the Job Environment Setup Cookbook for examples.
Similarly, many options for bootstrapping nodes on EMR (--bootstrap-cmd
,
--bootstrap-file
, --bootstrap-python-package
and
--bootstrap-script
) have been replaced by a single --bootstrap
option. See the EMR Bootstrapping Cookbook.
This release also contains many bugfixes, including problems with boto 2.10.0+, bz2 decompression, and Python 2.5.
0.4.1¶
The SORT_VALUES
option enables secondary sort,
ensuring that your reducer(s) receive values in sorted order. This allows you
to do things with reducers that would otherwise involve storing all the values
in memory, such as:
- Receiving a grand total before any subtotals, so you can calculate percentages on the fly. See mr_next_word_stats.py for an example.
- Running a window of fixed length over an arbitrary amount of sorted values (e.g. a 24-hour window over timestamped log data).
The max_hours_idle option allows you to spin up EMR job flows that will terminate themselves after being idle for a certain amount of time, in a way that optimizes EMR/EC2’s full-hour billing model.
For development (not production), we now recommend always using job flow pooling, with max_hours_idle enabled. Update your mrjob.conf like this:
runners:
emr:
max_hours_idle: 0.25
pool_emr_job_flows: true
Warning
If you enable pooling without max_hours_idle (or
cronning terminate_idle_job_flows
), pooled job
flows will stay active forever, costing you money!
You can now use --no-check-input-paths
with the Hadoop runner to
allow jobs to run even if hadoop fs -ls
can’t see their input files
(see check_input_paths).
Two bits of straggling deprecated functionality were removed:
- Built-in protocols must be instantiated to be used (formerly they had class methods).
- Old locations for mrjob.conf are no longer supported.
This version also contains numerous bugfixes and natural extensions of existing functionality; many more things will now Just Work (see CHANGES.txt).
0.4.0¶
The default runner is now inline instead of local. This change will speed up debugging for many users. Use local if you need to simulate more features of Hadoop.
The EMR tools can now be accessed more easily via the mrjob command. Learn more here.
Job steps are much richer now:
- You can now use mrjob to run jar steps other than Hadoop Streaming. More info
- You can filter step input with UNIX commands. More info
- In fact, you can use arbitrary UNIX commands as your whole step (mapper/reducer/combiner). More info
If you Ctrl+C from the command line, your job will be terminated if you give it time. If you’re running on EMR, that should prevent most accidental runaway jobs. More info
mrjob v0.4 requires boto 2.2.
We removed all deprecated functionality from v0.2:
- –hadoop-*-format
- –*-protocol switches
- MRJob.DEFAULT_*_PROTOCOL
- MRJob.get_default_opts()
- MRJob.protocols()
- PROTOCOL_DICT
- IF_SUCCESSFUL
- DEFAULT_CLEANUP
- S3Filesystem.get_s3_folder_keys()
We love contributions, so we wrote some guidelines to help you help us. See you on Github!
0.3.5¶
The pool_wait_minutes (--pool-wait-minutes
) option lets your job
delay itself in case a job flow becomes available. Reference:
Configuration quick reference
The JOB
and JOB_FLOW
cleanup options tell mrjob to clean up the job
and/or the job flow on failure (including Ctrl+C). See
CLEANUP_CHOICES
for more information.
0.3.3¶
You can now include one config file from another.
0.3.2¶
The EMR instance type/number options have changed to support spot instances:
- core_instance_bid_price
- core_instance_type
- master_instance_bid_price
- master_instance_type
- slave_instance_type (alias for core_instance_type)
- task_instance_bid_price
- task_instance_type
There is also a new ami_version option to change the AMI your job flow uses for its nodes.
For more information, see mrjob.emr.EMRJobRunner.__init__()
.
The new report_long_jobs
tool alerts on jobs that
have run for more than X hours.
0.3¶
Features¶
Support for Combiners
You can now use combiners in your job. Like
mapper()
andreducer()
, you can redefinecombiner()
in your subclass to add a single combiner step to run after your mapper but before your reducer. (MRWordFreqCount
does this to improve performance.)combiner_init()
andcombiner_final()
are similar to their mapper and reducer equivalents.You can also add combiners to custom steps by adding keyword argumens to your call to
steps()
.More info: One-step jobs, Multi-step jobs
*_init(), *_final() for mappers, reducers, combiners
Mappers, reducers, and combiners have
*_init()
and*_final()
methods that are run before and after the input is run through the main function (e.g.mapper_init()
andmapper_final()
).More info: One-step jobs, Multi-step jobs
Custom Option Parsers
It is now possible to define your own option types and actions using a customOptionParser
subclass.
Job Flow Pooling
EMR jobs can pull job flows out of a “pool” of similarly configured job flows. This can make it easier to use a small set of job flows across multiple automated jobs, save time and money while debugging, and generally make your life simpler.
More info: Cluster Pooling
SSH Log Fetching
mrjob attempts to fetch counters and error logs for EMR jobs via SSH before trying to use S3. This method is faster, more reliable, and works with persistent job flows.
More info: Configuring SSH credentials
New EMR Tool: fetch_logs
If you want to fetch the counters or error logs for a job after the fact, you can use the new
fetch_logs
tool.More info:
mrjob.tools.emr.fetch_logs
New EMR Tool: mrboss
If you want to run a command on all nodes and inspect the output, perhaps to see what processes are running, you can use the new
mrboss
tool.More info:
mrjob.tools.emr.mrboss
Changes and Deprecations¶
Configuration
The search path order for
mrjob.conf
has changed. The new order is:
- The location specified by
MRJOB_CONF
~/.mrjob.conf
~/.mrjob
(deprecated)mrjob.conf
in any directory inPYTHONPATH
(deprecated)/etc/mrjob.conf
If your
mrjob.conf
path is deprecated, use this table to fix it:
Old Location New Location ~/.mrjob
~/.mrjob.conf
somewhere in PYTHONPATH
Specify in MRJOB_CONF
More info:
mrjob.conf
Defining Jobs (MRJob)
Mapper, combiner, and reducer methods no longer need to contain a yield statement if they emit no data.
The
--hadoop-*-format
switches are deprecated. Instead, set your job’s Hadoop formats withHADOOP_INPUT_FORMAT
/HADOOP_OUTPUT_FORMAT
orhadoop_input_format()
/hadoop_output_format()
. Hadoop formats can no longer be set frommrjob.conf
.In addition to
--jobconf
, you can now set jobconf values with theJOBCONF
attribute or thejobconf()
method. To read jobconf values back, usemrjob.compat.jobconf_from_env()
, which ensures that the correct name is used depending on which version of Hadoop is active.You can now set the Hadoop partioner class with
--partitioner
, thePARTITIONER
attribute, or thepartitioner()
method.More info: Hadoop configuration
Protocols
Protocols can now be anything with a
read()
andwrite()
method. Unlike previous versions of mrjob, they can be instance methods rather than class methods. You should use instance methods when defining your own protocols.The
--*protocol
switches andDEFAULT_*PROTOCOL
are deprecated. Instead, use the*_PROTOCOL
attributes or redefine the*_protocol()
methods.Protocols now cache the decoded values of keys. Informal testing shows up to 30% speed improvements.
More info: Protocols
Running Jobs
All Modes
All runners are Hadoop-version aware and use the correct jobconf and combiner invocation styles. This change should decrease the number of warnings in Hadoop 0.20 environments.
All
*_bin
configuration options (hadoop_bin
,python_bin
, andssh_bin
) take lists instead of strings so you can add arguments (like['python', '-v']
). More info: Configuration quick referenceCleanup options have been split into
cleanup
andcleanup_on_failure
. There are more granular values for both of these options.Most limitations have been lifted from passthrough options, including the former inability to use custom types and actions.
The
job_name_prefix
option is gone (was deprecated).All URIs are passed through to Hadoop where possible. This should relax some requirements about what URIs you can use.
Steps with no mapper use cat instead of going through a no-op mapper.
Compressed files can be streamed with the
cat()
method.EMR Mode
The default Hadoop version on EMR is now 0.20 (was 0.18).
The
instance_type
option only sets the instance type for slave nodes when there are multiple EC2 instance. This is because the master node can usually remain small without affecting the performance of the job.Inline Mode
Inline mode now supports thecmdenv
option.Local Mode
Local mode now runs 2 mappers and 2 reducers in parallel by default.
There is preliminary support for simulating some jobconf variables. The current list of supported variables is:
mapreduce.job.cache.archives
mapreduce.job.cache.files
mapreduce.job.cache.local.archives
mapreduce.job.cache.local.files
mapreduce.job.id
mapreduce.job.local.dir
mapreduce.map.input.file
mapreduce.map.input.length
mapreduce.map.input.start
mapreduce.task.attempt.id
mapreduce.task.id
mapreduce.task.ismap
mapreduce.task.output.dir
mapreduce.task.partition
Other Stuff
boto 2.0+ is now required.
The Debian packaging has been removed from the repostory.