What's New ========== For a complete list of changes, see `CHANGES.txt `_ .. _v0.7.4: 0.7.4 ----- Docker on EMR ^^^^^^^^^^^^^ This release adds support for Docker on EMR, which `was released with AMI version 6.0.0 `__. This is enabled by setting :mrjob-opt:`docker_image` to point at your image. There is also a :mrjob-opt:`docker_mounts` option, and, if you want to host your image on a private ECR repo instead of Docker Hub, a :mrjob-opt:`docker_client_config` option (though with AMIs 6.1.0 and later, you can also auto-authenticate to ECR; see `this page `__). As a result of adding Docker support, the default :mrjob-opt:`image_version` on EMR is 6.0.0. Also, on EMR and Dataproc we used to literally bootstrap mrjob by copying it to Python's root package directory, but as this won't put mrjob into a Docker image, mrjob is now bootstrapped via :mrjob-opt:`py_files`, like on every other runner. Concurrent Steps on EMR clusters ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This release also supports concurrent steps on EMR clusters, a feature introduced in AMI 5.28.0. The :mrjob-opt:`max_concurrent_steps` option controls both the concurrency level of a newly launched cluster, and how much concurrency we will accept when joining a pooled cluster. To prevent steps from the same job attempting to run simultaneously, mrjob will now submit steps of a multi-step one at a time (after the previous one completes) on clusters running AMI 5.28.0 or later. This can be changed with the :mrjob-opt:`add_steps_in_batch` option. :py:meth:`~mrjob.emr.EMRJobRunner.get_job_steps` is now deprecated, as it can't fetch steps before they're submitted. Cluster Pooling ^^^^^^^^^^^^^^^ Cluster pooling can now join pooled clusters based on available CPU and memory reported by the YARN resource manager, rather than looking at number and type of instances in the cluster. You can enable this by setting :mrjob-opt:`min_available_mb` and/or :mrjob-opt:`min_available_virtual_cores`. For this feature to work, you must enable SSH (the :mrjob-opt:`ec2_key_pair` and :mrjob-opt:`ec2_key_pair_file` options). You can now control the size of your cluster pool with the :mrjob-opt:`max_clusters_in_pool` option. If a job wants to launch a new cluster in the pool but the pool is already "full," it will wait and try again until the pool is no longer full or it can join a cluster. Once a job determines that it is okay to add another cluster to the pool, it will wait a random number of seconds and try again. This way, if several pooled jobs launch simultaneously, they will be likely to stay within the maximum number of clusters rather than all launching their own. The random wait time can be controlled with :mrjob-opt:`pool_jitter_seconds`. By default, a job will wait forever to either join an existing cluster or create new one. You can make jobs give up and raise an exception with the :mrjob-opt:`pool_timeout_minutes` option. mrjob will now bypass the :mrjob-opt:`pool_wait_minutes` option if there is not a matching, active cluster to join. Basically, it won't wait if there is not a cluster to wait for. As with :mrjob-opt:`max_clusters_in_pool`, if a job determines there are no clusters to wait for, it will wait a random number of seconds and double-check before launching a new cluster. Library requirements ^^^^^^^^^^^^^^^^^^^^ To support concurrent steps, ``boto3`` must be at least version 1.10.0 and ``botocore`` must be at least version 1.13.16. The ``google-cloud-dataproc`` library must be no greater than 1.1.0, to maintain compatibility with our code. .. _v0.7.3: 0.7.3 ----- Made many long-overdue changes to :ref:`cluster-pooling`, to reduce the potential for throttling by the EMR API. Pooling now puts most information a job needs to tell if it can join a cluster into the cluster name, meaning most non-matching clusters can be filtered out when we call ``ListClusters``. Pooling also no longer needs to list cluster steps. Finally, if :mrjob-opt:`pool_wait_minutes` is set, and there are multiple clusters we can join, we try them all, rather than just trying the "best" one and then requesting more information from the API. This update resulted in a few minor changes to pooling. When a job has the choice of multiple clusters, it chooses solely on based on CPU capacity, using ``NormalizedInstanceHours`` in the cluster summary returned by the ``ListClusters`` API call. mrjob version and :mrjob-opt:`applications` must now match exactly in all cases. We also re-worked the "locking" mechanism that keeps multiple jobs from joining the same cluster. Formerly, this used S3 (which may only be eventually consistent), and locks had no fixed expiration time. Now, EMR tags are used for locking, locks always expire after one minute, and every job uses the same timing when locking clusters, reducing the potential for race conditions. :command:`mrjob terminate-idle-clusters` no longer attempts to lock clusters before terminating them, so its ``--max-mins-locked`` option is deprecated and does nothing. The Spark harness now emulates counters correctly in local mode. If you use :py:meth:`~mrjob.job.MRJob.mapper_raw`, and your :mrjob-opt:`setup` script has an error, it will be correctly reported, even if your underlying shell is :command:`dash` and not :command:`bash`. .. _v0.7.2: 0.7.2 ----- Spark normally only supports archives if you're running on YARN. However, mrjob now seamlessly emulates archives on all Spark masters (other than ``local``). This means you can now use ``--archives`` or ``--dirs`` with ``mrjob spark-submit``, as well as using archives in your ``--setup`` script. As a result of this change, mrjob is somewhat better at recognizing file extensions; it ignores ``.`` at the end of filenames, and can now recognize that a file with a name like ``mrjob-0.7.0.tar.gz`` is a ``.tar.gz`` file, not a ``.7.0.tar.gz`` file. Also, if you don't specify a name for an archive (e.g. ``--setup 'cd foo.tar.gz#/'``) mrjob no longer includes the file extension in the resulting directory name (``foo/``, not ``foo.tar.gz/``). Patched a long-standing security issue on EMR where we were copying the SSH key to the master node when reading logs from other nodes, which are only accessible via the master node. mrjob now correctly uses :command:`ssh-add` and the SSH agent instead of copying the key. As a result, mrjob now has a :mrjob-opt:`ssh_add_bin` option. The :mrjob-opt:`extra_cluster_params` option now recursively merges dict params into existing ones. For example, you can now do this: .. code-block:: yaml runners: emr: extra_cluster_params: Instances: EmrManagedMasterSecurityGroup: sg-foo without obliterating the rest of the ``Instances`` API parameter. Python 2 has reached end-of-life, so if you're using Python 2, the default :mrjob-opt:`python_bin` is ``python2.7`` rather than ``python``, which now means Python 3 on some systems (for example, 6.x EMR AMIs). Finally, we ensure that if you're installing mrjob on Python 3.4, we'll install a Python 3.4-compatible version of PyYAML. .. _v0.7.1: 0.7.1 ----- EMR ^^^ Fixed a bug to set default value of `VisibleToAllUsers` to `True`. You can set sub-parameters with :mrjob-opt:`extra_cluster_params` to set it `False`. For example, you can now do: .. code-block:: sh --extra-cluster-param VisibleToAllUsers=false Added logging for mrjob to show invoked runner with keyword arguments. Contents of archives are now used during bootstrapping to ensure clusters have same setup. .. _v0.7.0: 0.7.0 ----- AWS and Google are now optional dependencies ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Amazon Web Services (EMR/S3) and Google Cloud are now optional dependencies, ``aws`` and ``google`` respectively. For example, to install mrjob with AWS support, run: .. code-block:: sh pip install mrjob[aws] non-Python mrjobs are no longer supported ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Fully removed support for writing MRJob scripts in other languages and then running them with the mrjob library. (This capability so little used that chances are you never knew it existed.) As a result the `interpreter` and `steps_interpreter` options are gone, the :command:`mrjob run` subcommand is gone, and the `MRJobLauncher` class has been merged back into `MRJob`. Also removed ``mr_wc.rb`` from ``mrjob/examples/`` MRSomeJob() means read from sys.argv ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ In prior versions, if you initialized a :py:class:`~mrjob.job.MRJob` subclass with no arguments (``MRSomeJob()``), that meant the same thing as passing in an empty argument list (``MRSomeJob(args=[])``). It now means to read *args* directly from ``sys.argv[1:]``. In practice, it's rare to see ``MRJob`` subclass intialized this way outside of test cases. Running a ``MRJob`` script directly, or initializing it with an argument list works this same as in previous versions. mrjob/examples/ love ^^^^^^^^^^^^^^^^^^^^ The `mrjob.examples package `__ has been updated. Some examples that were difficult to test or maintain were removed, and the remainder were tested and fixed if necessary. :py:mod:`mrjob.examples.mr_text_classifier` no longer needs you to encode documents in JSON format, and instead operates directly on text files with names like ``doc_id-cat_id_1-not_cat_id_2-etc.txt``. Try it out: .. code-block:: sh python -m mrjob.examples.mr_text_classifier docs-to-classify/*.txt miscellanous tweaks ^^^^^^^^^^^^^^^^^^^ The :command:`mrjob audit-emr-usage` subcommand no longer attempts to read cluster pool names from clusters launched by mrjob v0.5.x. Method arguments in filesystem classes (in ``mrjob.fs``) are now consistenly named. This probably won't matter in practice, as ``runner.fs `` is always a :py:class:`~mrjob.fs.composite.CompositeFilesystem` anyhow. removed deprecated code ^^^^^^^^^^^^^^^^^^^^^^^ Check your deprecation warnings! Everything marked deprecated in mrjob v0.6.x has been removed. The following runner config options no longer exist: `emr_api_params`, `interpreter`, `max_hours_idle`, `mins_to_end_of_hour`, `steps_interpreter`, `steps_python_bin`, `visible_to_all_users`. The following singular switches have been removed in favor of their plural alternative (e.g. :command:`--archives`): :command:`--archive`, :command:`--dir`, :command:`--file`, :command:`--hadoop-arg`, :command:`--libjar`, :command:`--py-file`, :command:`--spark-arg`. The :command:`--steps` switch is gone. This means :command:`--help --steps` no longer works; use :command:`--help -v` to see help for :command:`--mapper`, etc. Support for simulating :mod:`optparse` has been removed from :py:class:`~mrjob.job.MRJob`. This includes ``add_file_option()``, ``add_passthrough_option()``, ``configure_options()``, ``load_options()``, ``pass_through_option()``, ``self.args``, ``self.OPTION_CLASS``. :py:meth:`mrjob.job.MRJobRunner.stream_output` and :py:meth:`mrjob.job.MRJob.parse_output_line` have been removed. The constructor for :py:class:`~mrjob.job.runner.MRJobRunner` no longer has a *file_upload_args* keyword argument. ``parse_and_save_options()``, ``read_file()``, and ``read_input()`` have all been removed from :py:mod:`mrjob.util`. :py:class:`~mrjob.fs.composite.CompositeFilesystem` no longer takes filesystems as arguments to its constructor; use :py:meth:`~mrjob.fs.composite.CompositeFilesystem.add_fs`. The useless *local_tmp_dir* option to the :py:class:`~mrjob.fs.gcs.GCSFilesystem` constructor and the *chunk_size* arg to its :py:meth:`~mrjob.fs.gcs.GCSFilesystem.put` method have been removed. .. _v0.6.12: 0.6.12 ------ Updated the Dataproc's runner default :mrjob-opt:`image_version` to ``1.3``, as the old default, ``1.0`` no longer works. The local and inline runners can now handle ``file://`` URIs as input paths and as files/archives uploaded to the working directory. The local filesystem (available as ``runner.fs`` from all runners) can now handle ``file://`` URIs as well. .. _v0.6.11: 0.6.11 ------ Adds support for parsing Spark logs and driver output to determine why a job failed. This works with with the local, Hadoop, EMR, and Spark runners. The Spark runner no longer needs :py:mod:`pyspark` in the ``$PYTHONPATH`` to launch scripts with :command:`spark-submit` (it still needs :py:mod:`pyspark` to use the Spark harness). On Python 3.7, you can now intermix positional arguments to :py:class:`~mrjob.job.MRJob` with switches, similar to how you could back when mrjob used :py:mod:`optparse`. For example: :command:`mr_your_script.py file1 -v file2`. On EMR, the default :mrjob-opt:`image_version` (AMI) is now 5.27.0. Restored ``m4.large`` as the default instance type pre-5.13.0 AMIs, as they do not support ``m5.xlarge``. (``m5.xlarge`` is still the default for AMI 5.13.0 and later.) mrjob can now retry on transient AWS API errors (e.g. throttling) or network errors when making API calls that use pagination (e.g. listing clusters). The :mrjob-opt:`emr_configurations` opt now supports the ``!clear`` tag rather than crashing. You may also override individual configs by setting a config with the same ``Classification``. This version restores official support for Python 3.4, as it's the version of Python 3 installed on EMR AMIs prior to 5.20.0. In order to make this work, mrjob drops support for Google Cloud services in Python 3.4, as the recent Google libraries appear to need a later Python version. .. _v0.6.10: 0.6.10 ------ Adds official support for PyPy (that is any version of it compatible with Python 2.7/3.5+). If you launch a job in PyPy :mrjob-opt:`python_bin` will automatically default to ``pypy`` or ``pypy3`` as appropriate. Note that mrjob does not auto-install PyPy for you on EMR (Amazon Linux does not provide a PyPy package). Installing PyPy yourself at bootstrap time is fairly straightforward, see :ref:`installing-pypy-on-emr`. The Spark harness can now be used on EMR, allowing you to run "classic" MRJobs in Spark, which is often faster. Essentially, you launch jobs in the Spark runner with ``--spark-submit-bin 'mrjob spark-submit -r emr'``; see :ref:`mrjobs-on-spark-on-emr` for details. The Spark runner can now optionally disable internal protocols when running "classic" MRJobs, eliminating the (usually) unnecessary effort of encoding data structures into JSON or other string representations and then decoding them. See :mrjob-opt:`skip_internal_protocol` for details. The EMR runner's default instance type is now ``m5.xlarge``, which works with newer reasons and should make it easier to run Spark jobs. The EMR runner also now logs the DNS of the master node as soon as it is available, to make it easier to SSH in. Finally, mrjob gives a much clearer error message if you attempt to read a YAML mrjob.conf file without :mod:`PyYAML` installed. .. _v0.6.9: 0.6.9 ----- Drops support for Python 3.4. Fixes a bug introduced in :ref:`v0.6.8` that could break archives or directories uploaded into Hadoop or Spark if the name of the unpacked archive didn't have an archive extension (e.g. ``.tar.gz``). The Spark runner can now optionally emulate Hadoop's ``mapreduce.map.input.file`` configuration property when running the mapper of the first step of a streaming job if you enable :mrjob-opt:`emulate_map_input_file`. This means that jobs that depend on :py:func:`jobconf_from_env('mapreduce.map.input.file') ` will still work. The Spark runner also now uses the correct argument names when emulating :py:meth:`~mrjob.job.MRJob.increment_counter`, and logs a warning if :mrjob-opt:`spark_tmp_dir` doesn't match :mrjob-opt:`spark_master`. :ref:`mrjob spark-submit ` can now pass switches to the Spark script/JAR without explicitly separating them out with ``--``. The local and inline runners now more correctly emulate the `mapreduce.map.input.file` config property by making it a ``file://`` URL. Deprecated methods :py:meth:`~mrjob.job.MRJob.add_file_option` and :py:meth:`~mrjob.job.MRJob.add_passthrough_option` can now take a type (e.g. ``int``) as their ``type`` argument, to better emulate :mod:`optparse`. .. _v0.6.8: 0.6.8 ----- Nearly full support for Spark ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This release adds nearly full support for Spark, including mrjob-specific features like :mrjob-opt:`setup` scripts and :ref:`passthrough options `. See :ref:`why-mrjob-with-spark` for everything mrjob can do with Spark. This release adds a :py:class:`~mrjob.spark.runner.SparkMRJobRunner` (``-r spark``), which works with any Spark installation, does not require Hadoop, and can access any filesystem supported by both mrjob and Spark (HDFS, S3, GCS). The Spark runner is now the default for :ref:`mrjob spark-submit `. What's *not* supported? mrjob does not yet support Spark on Google Cloud Dataproc. The Spark runner does not yet parse logs to determine probable cause of failure when your job fails (though it does give you the Spark driver output). Spark Hadoop Streaming emulation ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Not only does the Spark runner not need Hadoop to run Spark jobs, it doesn't need Hadoop to run most *Hadoop Streaming* jobs, as it knows how to run them directly on Spark. This means if you want to migrate to a non-Hadoop Spark cluster, you can take all your old :py:class:`~mrjob.job.MRJob`\s with you. See :ref:`classic-mrjobs-on-spark` for details. The "experimental harness script" mentioned in :ref:`v0.6.7` is now fully integrated into the Spark runner and is no longer supported as a separate feature. Local runner support for Spark ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The ``local`` and ``inline`` runner can now run Spark scripts locally for testing, analogous to the way they've supported Hadoop streaming scripts (except that they *do* require a local Spark installation). See :ref:`other-ways-to-run-on-spark`. Other Spark improvements ^^^^^^^^^^^^^^^^^^^^^^^^ :py:class:`~mrjob.job.MRJob`\s are now Spark-serializable without calling :py:meth:`~mrjob.job.MRJob.sandbox` (there used to be a problematic reference to ``sys.stdin``). This means you can always pass job methods to ``rdd.flatMap()`` etc. :mrjob-opt:`setup` scripts are no longer a YARN-specific feature, working on all Spark masters (except ``local[*]``, which doesn't give executors a separate working directory). Likewise, you can now specify a different name for files in the job's working directory (e.g. ``--file foo#bar``) on all Spark masters. .. note:: Uploading archives and directories still only works on YARN for now; Spark considers ``--archives`` a YARN-specific feature. When running on a local Spark cluster, uses ``file://...`` rather than just the path of the file when necessary (e.g. with ``--py-files``). :py:meth:`~mrjob.runner.MRJobRunner.cat_output` now ignores files and subdirectories starting with ``"."`` (used to only be ``"_"``). This allows mrjob to ignore Spark's checksum files (e.g. ``.part-00000.crc``), and also brings mrjob in closer compliance to the way Hadoop input formats read directories. ``spark.yarn.appMasterEnv.*`` config properties are only set if you're actually running on YARN. The values of :mrjob-opt:`spark_master` and :mrjob-opt:`spark_deploy_mode` can no longer be overridden with configuration properties (``-D spark.master=...``). While not exactly a "feature," this means that mrjob always knows what Spark platform it's running on. Filesystems ^^^^^^^^^^^ Every runner has an ``fs`` attribute that gives access to all the filesystems that runner supports. Added a :py:meth:`~mrjob.fs.base.Filesystem.put` method to all filesystems, which allows uploading a single file (it used to be that each runner had custom logic for uploads). It also used to be that if you wanted to create a bucket on S3 or GCS, you had to call ``create_bucket(...)`` explicitly. Now :py:meth:`~mrjob.fs.base.Filesystem.mkdir` will automatically create buckets as needed. If you still need to access methods specific to a filesystem, you should do so through ``fs.``, where ```` is the (lowercase) name of the storage service. For example the Spark runner's filesystem offers both ``runner.fs.s3.create_bucket()`` and ``runner.fs.gcs.create_bucket()``. The old style of implicitly passing through FS-specific methods (``runner.fs.create_bucket(...)``) is deprecated and going away in v0.7.0. :py:class:`~mrjob.fs.gcs.GCSFilesystem`\'s constructor had a useless ``local_tmp_dir`` argument, which is now deprecated and going away in v0.7.0. EMR ^^^ Fixed a bad bug introduced in :ref:`v0.6.7` that could prevent mrjob from running on EMR with a non-default temp bucket. You can now set sub-parameters with :mrjob-opt:`extra_cluster_params`. For example, you can now do: .. code-block:: sh --extra-cluster-param Instances.EmrManagedMasterSecurityGroup=... without clobbering the zone or instance group/fleet configs specified in ``Instances``. Running your job with ``--subnet ''`` now un-sets a :mrjob-opt:`subnet` specified in your config file (used to be ignored). If you are using cluster pooling with retries (:mrjob-opt:`pool_wait_minutes`), mrjob now retains information about clusters that is immutable (e.g. AMI version), saving API calls. Dependency upgrades ^^^^^^^^^^^^^^^^^^^ Bumped the required versions of several Google Cloud Python libraries to be more compatible with current versions of their sub-dependencies (Google libraries pin a fairly narrow range of dependencies). :py:mod:`mrjob` now requires: * :py:mod:`google-cloud-dataproc` at least 0.3.0, * :py:mod:`google-cloud-logging` at least 1.9.0, and * :py:mod:`google-cloud-storage` at least 1.13.1. Also dropped support for :py:mod:`PyYAML` 3.08; now we require at least :py:mod:`PyYAML` 3.10 (which came out in 2011). .. note:: We are aware that the Google libraries' extensive dependencies can be a nuisance for mrjob users who don't use Google Cloud. Our tentative plan is to make dependencies specific to a third-party service (including :py:mod:`google-cloud-*` and :py:mod:`boto3`) optional starting in v0.7.0. Other bugfixes ^^^^^^^^^^^^^^ Fixed a long-standing bug that would cause the Hadoop runner to hang or raise cryptic errors if :mrjob-opt:`hadoop_bin` or :mrjob-opt:`spark_submit_bin` is not executable. Support files for ``mrjob.examples`` (e.g. ``stop_words.txt`` for :py:class:`~mrjob.examples.mr_most_used_word.MRMostUsedWord`) are now installed along with :py:mod:`mrjob`. Setting a `*_bin` option to an empty value (e.g. ``--hadoop-bin``) now always instructs mrjob to use the default, rather than disabling core features or creating cryptic errors. This affects :mrjob-opt:`gcloud_bin`, :mrjob-opt:`hadoop_bin`, :mrjob-opt:`sh_bin`, and :mrjob-opt:`ssh_bin`; the various `*python_bin` options already worked this way. .. _v0.6.7: 0.6.7 ----- :mrjob-opt:`setup` commands now work on Spark (at least on YARN). Added the :ref:`mrjob spark-submit ` subcommand, which works as a drop-in replacement for :command:`spark-submit` but with mrjob runners (e.g EMR) and mrjob features (e.g. :mrjob-opt:`setup`, :mrjob-opt:`cmdenv`). Fixed a bug that was causing idle timeout scripts to silently fail on 2.x EMR AMIs. Fixed a bug that broke :py:meth:`~mrjob.fs.s3.S3Filesystem.create_bucket` on ``us-east-1``, preventing new mrjob installations from launching on EMR in that region. Fixed an :py:class:`ImportError` from attempting to import :py:data:`os.SIGKILL` on Windows. The default instance type on EMR is now ``m4.large``. EMR's cluster pooling now knows the CPU and memory capacity of ``c5`` and ``m5`` instances, allowing it to join "better" clusters. Added the plural form of several switches (separate multiple values with commas): * ``--applications`` * ``--archives`` * ``--dirs`` * ``--files`` * ``--libjars`` * ``--py-files`` Except for ``--application``, the singular version of these switches (``--archive``, ``--dir``, ``--file``, ``--libjar``, ``--py-file``) is deprecated for consistency with Hadoop and Spark :mrjob-opt:`sh_bin` is now fully qualified by default (``/bin/sh -ex``, not ``sh -ex``). :mrjob-opt:`sh_bin` may no longer be empty, and a warning is issued if it has more than one argument, to properly support shell script shebangs (e.g. ``#!/bin/sh -ex``) on Linux. Runners no longer call :py:class:`~mrjob.job.MRJob`\s with ``--steps``; instead the job passes its step description to the runner on instantiation. ``--steps`` and `steps_python_bin` are now deprecated. The Hadoop and EMR runner can now set ``SPARK_PYTHON`` and ``SPARK_DRIVER_PYTHON`` to different values if need be (e.g. to match :mrjob-opt:`task_python_bin`, or to support :mrjob-opt:`setup` scripts in client mode). The inline runner no longer attempts to run command substeps. The inline and local runner no longer silently pretend to run non-streaming steps. The Hadoop runner no longer has the :mrjob-opt:`bootstrap_spark` option, which did nothing. `interpreter` and `steps_interpreter` are deprecated, in anticipation in removing support for writing MRJobs in other programming languages. Runners now issue a warning if they receive options that belong to other runners (e.g. passing :mrjob-opt:`image_version` to the Hadoop runner). :command:`mrjob create-cluster` now supports ``--emr-action-on-failure``. Updated deprecate escape sequences in mrjob code that would break on Python 3.8. ``--help`` message for mrjob subcommands now correctly includes the subcommand in ``usage``. mrjob no longer raises :py:class:`AssertionError`, instead raising :py:class:`ValueError`. Added an experimental harness script (in ``mrjob/spark``) to run basic MRJobs on Spark, potentially without Hadoop: .. code-block:: sh spark-submit mrjob_spark_harness.py module.of.YourMRJob input_path output_dir Added :py:meth:`~mrjob.job.MRJob.map_pairs`, :py:meth:`~mrjob.job.MRJob.reduce_pairs`, and :py:meth:`~mrjob.job.MRJob.combine_pairs` methods to :py:class:`~mrjob.job.MRJob`, to enable the Spark harness script. .. _v0.6.6: 0.6.6 ----- Fixes a longstanding bug where boolean :mrjob-opt:`jobconf` values were passed to Hadoop in Python format (``True`` instead of ``true``). You can now do safely do something like this: .. code-block:: yaml runners: emr: jobconf: mapreduce.output.fileoutputformat.compress: true whereas in prior versions of mrjob, you had to use ``"true"`` in quotes. Added ``-D`` as a synonym for ``--jobconf``, to match Hadoop. On EMR, if you have SSH set up (see :ref:`ssh-tunneling`) mrjob can fetch your history log directly from HDFS, allowing it to more quickly diagnose why your job failed. Added a ``--local-tmp-dir`` switch. If you set :mrjob-opt:`local_tmp_dir` to empty string, mrjob will use the system default. You can now pass multiple arguments to Hadoop ``--hadoop-args`` (for example, ``--hadoop-args='-fs hdfs://namenode:port'``), rather than having to use ``--hadoop-arg`` one argument at time. ``--hadoop-arg`` is now deprecated. Similarly, you can use ``--spark-args`` to pass arguments to ``spark-submit`` in place of the now-deprecated ``--spark-arg``. mrjob no longer automatically passes generic arguments (``-D`` and ``-libjars``) to :py:class:`~mrjob.step.JarStep`\s, because this confuses some JARs. If you want mrjob to pass generic arguments to a JAR, add :py:data:`~mrjob.step.GENERIC_ARGS` to your :py:class:`~mrjob.step.JarStep`\'s *args* keyword argument, like you would with :py:data:`~mrjob.step.INPUT` and :py:data:`~mrjob.step.OUTPUT`. The Hadoop runner now has a :mrjob-opt:`spark_deploy_mode` option. Fixed the ``usage: usage:`` typo in ``--help`` messages. :py:meth:`mrjob.job.MRJob.add_file_arg` can now take an explicit ``type=str`` (used to cause an error). The deprecated ``optparse`` emulation methods :py:meth:`~mrjob.job.MRJob.add_file_option` and :py:meth:`~mrjob.job.MRJob.add_passthrough_option` now support ``type='str'`` (used to only accept ``type='string'``). Fixed a permissions error that was breaking ``inline`` and ``local`` mode on some versions of Windows. .. _v0.6.5: 0.6.5 ----- This release fixes an issue with self-termination of idle clusters on EMR (see :mrjob-opt:`max_mins_idle`) where the master node sometimes simply ignored ``sudo shutdown -h now``. The idle self termination script now logs to ``bootstrap-actions/mrjob-idle-termination.log``. .. note:: If you are using :ref:`cluster-pooling`, it's highly recommended you upgrade to this version to fix the self-termination issue. You can now turn off log parsing (on all runners) by setting :mrjob-opt:`read_logs` to false. This can speed up cases where you don't care why a job failed (e.g. integration tests) or where you'd rather use the :ref:`diagnose-tool` tool after the fact. You may specify custom AMIs with the :mrjob-opt:`image_id` option. To find Amazon Linux AMIs compatible with EMR that you can use as a base for your custom image, use :py:func:`~mrjob.ami.describe_base_emr_images`. The default AMI on EMR is now 5.16.0. New EMR clusters launched by mrjob will be automatically tagged with ``__mrjob_label`` (filename of your mrjob script) and ``__mrjob_owner`` (your username), to make it easier to understand your mrjob usage in `CloudWatch `_ etc. You can change the value of these tags with the :mrjob-opt:`label` and :mrjob-opt:`owner` options. You may now set the root EBS volume size for EMR clusters directly with :mrjob-opt:`ebs_root_volume_gb` (you used to have to use :mrjob-opt:`instance_groups` or :mrjob-opt:`instance_fleets`). API clients returned by :py:class:`~mrjob.emr.EMRJobRunner` now retry on SSL timeouts. EMR clients returned by :py:meth:`mrjob.emr.EMRJobRunner.make_emr_client` won't retry faster than :mrjob-opt:`check_cluster_every`, to prevent throttling. Cluster pooling recovery (relaunching a job when your pooled cluster self-terminates) now works correctly on single-node clusters. .. _v0.6.4: 0.6.4 ----- This release makes it easy to attach static files to your :py:class:`~mrjob.job.MRJob` with the :py:attr:`~mrjob.job.MRJob.FILES`, :py:attr:`~mrjob.job.MRJob.DIRS`, and :py:attr:`~mrjob.job.MRJob.ARCHIVES` attributes. In most cases, you no longer need :mrjob-opt:`setup` scripts to access other python modules or packages from your job because you can use :py:attr:`~mrjob.job.MRJob.DIRS` instead. For more details, see :ref:`uploading-modules-and-packages`. For completeness, also added :py:meth:`~mrjob.job.MRJob.files`, :py:meth:`~mrjob.job.MRJob.dirs`, and :py:meth:`~mrjob.job.MRJob.archives` methods. :ref:`terminate-idle-clusters` now skips termination-protected idle clusters, rather than crashing (this is fixed in :ref:`v0.5.12`, but not previous 0.6.x versions). Python 3.3 is no longer supported. mrjob now requires :mod:`google-cloud-dataproc` 0.2.0+ (this library used to be vendored). .. _v0.6.3: 0.6.3 ----- Read arbitrary file formats ^^^^^^^^^^^^^^^^^^^^^^^^^^^ You can now pass entire files in any format to your mapper by defining :py:meth:`~mrjob.job.MRJob.mapper_raw`. See :ref:`raw-input` for an example. Google Cloud Datatproc parity ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ mrjob now offers feature parity between Google Cloud Dataproc and Amazon Elastic MapReduce. Support for :doc:`guides/spark` and :mrjob-opt:`libjars` will be added in a future release. (There is no plan to introduce :ref:`cluster-pooling` with Dataproc.) Specifically, :py:class:`~mrjob.dataproc.DataprocJobRunner` now supports: * fetching and parsing counters * parsing logs for probable cause of failure * job progress messages (% complete) * :ref:`non-hadoop-streaming-jar-steps` * these config options: * :mrjob-opt:`cloud_part_size_mb` (chunked uploading) * :mrjob-opt:`core_instance_config`, :mrjob-opt:`master_instance_config`, :mrjob-opt:`task_instance_config` * :mrjob-opt:`hadoop_streaming_jar` * :mrjob-opt:`network`/:mrjob-opt:`subnet` (running in a VPC) * :mrjob-opt:`service_account` (custom IAM account) * :mrjob-opt:`service_account_scopes` (fine-grained permissions) * :mrjob-opt:`ssh_tunnel`/:mrjob-opt:`ssh_tunnel_is_open` (resource manager) Improvements to existing Dataproc features: * :mrjob-opt:`bootstrap` scripts run in a temp dir, rather than ``/`` * uses Dataproc's built-in auto-termination feature, rather than a script * GCS filesystem: * :py:meth:`~mrjob.fs.gcs.GCSFilesystem.cat` streams data rather than dumping to a temp file * :py:meth:`~mrjob.fs.gcs.GCSFilesystem.exists` no longer swallows all exceptions To get started, read :ref:`google-setup`. Other changes ^^^^^^^^^^^^^ mrjob no longer streams your job output to the command line if you specify :mrjob-opt:`output_dir`. You can control this with the :command:`--cat-output` and :command:`--no-cat-output` switches (:command:`--no-output` is deprecated). `cloud_upload_part_size` has been renamed to :mrjob-opt:`cloud_part_size_mb` (the old name will work until v0.7.0). mrjob can now recognize "not a valid JAR" errors from Hadoop and suggest them as probable cause of job failure. mrjob no longer depends on :mod:`google-cloud` (which implies several other Google libraries). Its current Google library dependencies are :mod:`google-cloud-logging` 1.5.0+ and :mod:`google-cloud-storage` 1.9.0+. Future versions of mrjob will depend on :mod:`google-cloud-dataproc` 0.11.0+ (currently included with mrjob because it hasn't yet been released). :py:class:`~mrjob.retry.RetryWrapper` now sets ``__name__`` when wrapping methods, making for easier debugging. .. _v0.6.2: 0.6.2 ----- mrjob is now orders of magnitude quicker at parsing logs, making it practical to diagnose rare errors from very large jobs. However, on some AMIs, it can no longer parse errors without waiting for logs to transfer to S3 (this may be fixed in a future version). To run jobs on Google Cloud Dataproc, mrjob no longer requires you to install the :command:`gcloud` util (though if you do have it installed, mrjob can read credentials from its configs). For details, see :doc:`guides/dataproc-quickstart`. mrjob no longer requires you to select a Dataproc :mrjob-opt:`zone` prior to running jobs. Auto zone placement (just set :mrjob-opt:`region` and let Dataproc pick a zone) is now enabled, with the default being auto zone placement in ``us-west1``. mrjob no longer reads zone and region from :command:`gcloud`\'s compute engine configs. mrjob's Dataproc code has been ported from the ``google-python-api-client`` library (which is in maintenance mode) to ``google-cloud-sdk``, resulting in some small changes to the GCS filesystem API. See `CHANGES.txt `_ for details. Local mode now has a :mrjob-opt:`num_cores` option that allow you to control how tasks it handles simultaneously. .. _v0.6.1: 0.6.1 ----- Added the :ref:`diagnose-tool` tool (run :command:`mrjob diagnose j-CLUSTERID`), which determines why a previously run job failed. Fixed a serious bug that made mrjob unable to properly parse error logs in some cases. Added the :py:meth:`~mrjob.emr.EMRJobRunner.get_job_steps` method to :py:class:`~mrjob.emr.EMRJobRunner`. .. _v0.6.0: 0.6.0 ----- Dropped Python 2.6 ^^^^^^^^^^^^^^^^^^ mrjob now supports Python 2.7 and Python 3.3+. (Some versions of PyPy also work but are not officially supported.) boto3, not boto ^^^^^^^^^^^^^^^ mrjob now uses :py:mod:`boto3` rather than :py:mod:`boto` to talk to AWS. This makes it much simpler to pass user-defined data structures directly to the API, enabling a number of features. At least version 1.4.6 of :py:mod:`boto3` is required to run jobs on EMR. It is now possible to fully configure instances (including EBS volumes). See :mrjob-opt:`instance_groups` for an example. mrjob also now supports Instance Fleets, which may be fully configured (including EBS volumes) through the :mrjob-opt:`instance_fleets` option. Methods that took or returned :py:mod:`boto` objects (for example, ``make_emr_conn()``) have been completely removed as there as no way to make a deprecated shim for them without keeping :py:mod:`boto` as a dependency. See :py:class:`~mrjob.emr.EMRJobRunner` and :py:class:`~mrjob.fs.s3.S3Filesystem` for new method names. Note that :py:mod:`boto3` reads temporary credentials from :envvar:`$AWS_SESSION_TOKEN`, not :envvar:`$AWS_SECURITY_TOKEN` as in :py:mod:`boto` (see :mrjob-opt:`aws_session_token` for details). argparse, not optparse ^^^^^^^^^^^^^^^^^^^^^^ mrjob now uses :py:mod:`argparse` to parse options, rather than :py:mod:`optparse`, which has been deprecated since Python 2.7. :py:mod:`argparse` has slightly different option-parsing logic. A couple of things you should be aware of: * everything that starts with ``-`` is assumed to be a switch. ``--hadoop-arg=-verbose`` works, but ``--hadoop-arg -verbose`` does not. * positional arguments may not be split. ``mr_wc.py CHANGES.txt LICENSE.txt -r local`` will work, but ``mr_wc.py CHANGES.txt -r local LICENSE.txt`` will not. Passthrough options, file options, etc. are now handled with :py:meth:`~mrjob.job.MRJob.add_file_arg`, :py:meth:`~mrjob.job.MRJob.add_passthru_arg`, :py:meth:`~mrjob.job.MRJob.configure_args`, :py:meth:`~mrjob.job.MRJob.load_args`, and :py:meth:`~mrjob.job.MRJob.pass_arg_through`. The old methods with "option" in their name are deprecated but still work. As part of this refactor, `OptionStore` and its subclasses have been removed; options are now handled by runners directly. Chunks, not lines ^^^^^^^^^^^^^^^^^ mrjob no longer assumes that job output will be line-based. If you :ref:`run your job programmatically `, you should read your job output with :py:meth:`~mrjob.runner.MRJobRunner.cat_output`, which yields bytestrings which don't necessarily correspond to lines, and run these through :py:meth:`~mrjob.job.MRJob.parse_output`, which will convert them into key/value pairs. ``runner.fs.cat()`` also now yields arbitrary bytestrings, not lines. When it yields from multiple files, it will yield an empty bytestring (``b''``) between the chunks from each file. :py:func:`~mrjob.util.read_file` and :py:func:`~mrjob.util.read_input` are now deprecated because they are line-based. Try :py:func:`~mrjob.cat.decompress`, :py:func:`~mrjob.cat.to_chunks`, and :py:func:`~mrjob.util.to_lines`. Better local/inline mode ^^^^^^^^^^^^^^^^^^^^^^^^ The sim runners (``inline`` and ``local`` mode) have been completely rewritten, making it possible to fix a number of outstanding issues. Local mode now runs one mapper/reducer per CPU, using :py:mod:`multiprocesssing`, for faster results. We only sort by reducer key (not the full line) unless :py:attr:`~mrjob.job.SORT_VALUES` is set, exposing bad assumptions sooner. The :mrjob-opt:`step_output_dir` option is now supported, making it easier to debug issues in intermediate steps. Files in tasks' (e.g. mappers') working directories are marked user-executable, to better imitate Hadoop Distributed Cache. When possible, we also symlink to a copy of each file/archive in the "cache," rather than copying them. If :py:func:`os.symlink` raises an exception, we fall back to copying (this can be an issue in Python 3 on Windows). Tasks are run more like they are in Hadoop; input is passed through stdin, rather than as script arguments. :py:mod:`mrjob.cat` is no longer executable because local mode no longer needs it. Cloud runner improvements ^^^^^^^^^^^^^^^^^^^^^^^^^ Much of the common code for the "cloud" runners (Dataproc and EMR) has been merged, so that new features can be rolled out in parallel. The :mrjob-opt:`bootstrap` option (for both Dataproc and EMR) can now take archives and directories as well as files, like the :mrjob-opt:`setup` option has since version :ref:`v0.5.8`. The :mrjob-opt:`extra_cluster_params` option allows you to pass arbitrary JSON to the API at cluster create time (in Dataproc and EMR). The old `emr_api_params` option is deprecated and disabled. `max_hours_idle` has been replaced with :mrjob-opt:`max_mins_idle` (the old option is deprecated but still works). The default is 10 minutes. Due to a bug, smaller numbers of minutes might cause the cluster to terminate before the job runs. It is no longer possible for mrjob to launch a cluster that sits idle indefinitely (except by setting :mrjob-opt:`max_mins_idle` to an unreasonably high value). It is still a good idea to run :ref:`report-long-jobs` because mrjob can't tell if a running job is doing useful work or has stalled. EMR now bills by the second, not the hour ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Elastic MapReduce recently stopped billing by the full hour, and now bills by the second. This means that :ref:`cluster-pooling` is no longer a cost-saving strategy, though developers might find it handy to reduce wait times when testing. The `mins_to_end_of_hour` option no longer makes sense, and has been deprecated and disabled. :ref:`audit-emr-usage` has been updated to use billing by the second when approximating time billed and waste. .. note:: Pooling was enabled by default for some development versions of v0.6.0, prior to the billing change. This did not make it into the release; you must still explicitly turn on :ref:`cluster pooling `. Other EMR changes ^^^^^^^^^^^^^^^^^ The default AMI is now 5.8.0. Note that this means you get Spark 2 by default. Regions are now case-sensitive, and the ``EU`` alias for ``eu-west-1`` no longer works. Pooling no longer adds dummy arguments to the master bootstrap script, instead setting the ``__mrjob_pool_hash`` and ``__mrjob_pool_name`` tags on the cluster. mrjob automatically adds the ``__mrjob_version`` tag to clusters it creates. Jobs will not add tags to clusters they join rather than create. :mrjob-opt:`enable_emr_debugging` now works on AMI 4.x and later. AMI 2.4.2 and earlier are no longer supported (no Python 2.7). There is no longer any special logic for the "latest" AMI alias (which the API no longer supports). The SSH filesystem no longer dumps file contents to memory. Pooling will only join a cluster with enough *running* instances to meet its specifications; *requested* instances no longer count. Pooling is now aware of EBS (disk) setup. Pooling won't join a cluster that has extra instance types that don't have enough memory or disk space to run your job. Errors in bootstrapping scripts are no longer dumped as JSON. `visible_to_all_users` is deprecated. Massive purge of deprecated code ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ About a hundred functions, methods, options, and more that were deprecated in v0.5.x have been removed. See `CHANGES.txt `_ for details. .. _v0.5.12: 0.5.12 ------ `This release came out after v0.6.3. It was mostly a backport from v0.6.x.` Python 2.6 and 3.3 are no longer supported. :py:func:`mrjob.parse.parse_s3_uri` handles ``s3a://`` URIs. :ref:`terminate-idle-clusters` now skips termination-protected idle clusters, rather than crashing. Since `Amazon no longer bills by the full hour `__, the `mins_to_end_of_hour` option now defaults to 60, effectively disabling it. When mrjob passes an environment dictionary to subprocesses, it ensures that the keys and values are always :py:class:`str`\s (this mostly affects Python 2 on Windows). .. _v0.5.11: 0.5.11 ------ The :ref:`report-long-jobs` utility can now ignore certain clusters based on EMR tags. This version deals more gracefully with clusters that use instance fleets, preventing crashes that may occur in some rare edge cases. .. _v0.5.10: 0.5.10 ------ Fixed an issue where bootstrapping mrjob on Dataproc or EMR could stall if mrjob was already installed. The `aws_security_token` option has been renamed to :mrjob-opt:`aws_session_token`. If you want to set it via environment variable, you still have to use :envvar:`$AWS_SECURITY_TOKEN` because that's what boto uses. Added protocol support for :py:mod:`rapidjson`; see :py:class:`~mrjob.protocol.RapidJSONProtocol` and :py:class:`~mrjob.protocol.RapidJSONValueProtocol`. If available, :py:mod:`rapidjson` will be used as the default JSON implementation if :py:mod:`ujson` is not installed. The master bootstrap script on EMR and Dataproc now has the correct file extension (``.sh``, not ``.py``). .. _v0.5.9: 0.5.9 ----- Fixed a bug that prevented :mrjob-opt:`setup` scripts from working on EMR AMIs 5.2.0 and later. Our workaround should be completely transparent unless you use a custom shell binary; see :mrjob-opt:`sh_bin` for details. The EMR runner now correctly re-starts the SSH tunnel to the job tracker/resource manager when a cluster it tries to run a job on auto-terminates. It also no longer requires a working SSH tunnel to fetch job progress (you still a working SSH; see :mrjob-opt:`ec2_key_pair_file`). The `emr_applications` option has been renamed to :mrjob-opt:`applications`. The :ref:`terminate-idle-clusters` utility is now slightly more robust in cases where your S3 temp directory is an different region from your clusters. Finally, there a couple of changes that probably only matter if you're trying to wrap your Hadoop tasks (mappers, reducers, etc.) in :command:`docker`: * You can set *just* the python binary for tasks with :mrjob-opt:`task_python_bin`. This allows you to use a wrapper script in place of Python without perturbing :mrjob-opt:`setup` scripts. * Local mode now no longer relies on an absolute path to access the :py:mod:`mrjob.cat` utility it uses to handle compressed input files; copying the job's working directory into Docker is enough. .. _v0.5.8: 0.5.8 ----- You can now pass directories to jobs, either directly with the :mrjob-opt:`upload_dirs` option, or through :mrjob-opt:`setup` commands. For example: .. code-block:: sh --setup 'export PYTHONPATH=$PYTHONPATH:your-src-code/#' mrjob will automatically tarball these directories and pass them to Hadoop as archives. For multi-step jobs, you can now specify where inter-step output goes with :mrjob-opt:`step_output_dir` (``--step-output-dir``), which can be useful for debugging. All :py:mod:`job step types ` now take the *jobconf* keyword argument to set Hadoop properties for that step. Jobs' ``--help`` printout is now better-organized and less verbose. Made several fixes to pre-filters (commands that pipe into streaming steps): * you can once again add pre-filters to a single step job by re-defining :py:meth:`~mrjob.job.MRJob.mapper_pre_filter`, :py:meth:`~mrjob.job.MRJob.combiner_pre_filter`, and/or :py:meth:`~mrjob.job.MRJob.reducer_pre_filter` * local mode now ignores non-zero return codes from pre-filters (this matters for BSD grep) * local mode can now run pre-filters on compressed input files mrjob now respects :mrjob-opt:`sh_bin` when it needs to wrap a command in ``sh`` before passing it to Hadoop (e.g. to support pipes) On EMR, mrjob now fetches logs from task nodes when determining probable cause of error, not just core nodes (the ones that run tasks and host HDFS). Several unused functions in :py:mod:`mrjob.util` are now deprecated: * :py:func:`~mrjob.util.args_for_opt_dest_subset` * :py:func:`~mrjob.util.bash_wrap` * :py:func:`~mrjob.util.populate_option_groups_with_options` * :py:func:`~mrjob.util.scrape_options_and_index_by_dest` * :py:func:`~mrjob.util.tar_and_gzip` :py:func:`~mrjob.cat.bunzip2_stream` and :py:func:`~mrjob.cat.gunzip_stream` have been moved from :py:mod:`mrjob.util` to :py:mod:`mrjob.cat`. :py:meth:`SSHFilesystem.ssh_slave_hosts() ` has been deprecated. Option group attributes in :py:class:`~mrjob.job.MRJob`\s have been deprecated, as has the :py:meth:`~mrjob.job.MRJob.get_all_option_groups` method. .. _v0.5.7: 0.5.7 ----- Spark and related changes ^^^^^^^^^^^^^^^^^^^^^^^^^ mrjob now supports running Spark jobs on your own Hadoop cluster or Elastic MapReduce. mrjob provides significant benefits over Spark's built-in Python support; see :ref:`why-mrjob-with-spark` for details. Added the :mrjob-opt:`py_files` option, to put `.zip` or `.egg` files in your job's ``PYTHONPATH``. This is based on a Spark feature, but it works with streaming jobs as well. mrjob is now bootstrapped (see :mrjob-opt:`bootstrap_mrjob`) as a `.zip` file rather than a tarball. If for some reason, the bootstrapped mrjob library won't compile, you'll get much cleaner error messages. The default AMI version on EMR (see :mrjob-opt:`image_version`) has been bumped from 3.11.0 to 4.8.2, as 3.11.0's Spark support is spotty. On EMR, mrjob now defaults to the cheapest instance type that will work (see :mrjob-opt:`instance_type`). In most cases, this is ``m1.medium``, but it needs to be ``m1.large`` for Spark worker nodes. Cluster pooling ^^^^^^^^^^^^^^^ mrjob can now add up to 1,000 steps on :ref:`pooled clusters ` on EMR (except on very old AMIs). mrjob now prints debug messages explaining why your job matched a particular pooled cluster when running in verbose mode (the ``-v`` option). Fixed a bug that caused pooling to fail when there was no need for a master bootstrap script (e.g. when running with ``--no-bootstrap-mrjob``). Other improvements ^^^^^^^^^^^^^^^^^^ Log interpretation is much more efficient at determining a job's probable cause of failure (this works with Spark as well). When running custom JARs (see :py:class:`~mrjob.step.JarStep`) mrjob now repects :mrjob-opt:`libjars` and :mrjob-opt:`jobconf`. The :mrjob-opt:`hadoop_streaming_jar` option now supports environment variables and ``~``. The :ref:`terminate-idle-clusters` tool now works with all step types, including Spark. (It's still recommended that you rely on the `max_hours_idle` option rather than this tool.) mrjob now works in Anaconda3 Jupyter Notebook. Bugfixes ^^^^^^^^ Added several missing command-line switches, including ``--no-bootstrap-python`` on Dataproc. Made a major refactor that should prevent these kinds of issues in the future. Fixed a bug that caused mrjob to crash when the ssh binary (see :mrjob-opt:`ssh_bin`) was missing or not executable. Fixed a bug that erroneously reported failed or just-started jobs as 100% complete. Fixed a bug where timestamps were erroneously recognized as URIs. mrjob now only recognizes strings containing ``://`` as URIs (see :py:func:`~mrjob.parse.is_uri`). Deprecation ^^^^^^^^^^^ The following are deprecated and will be removed in v0.6.0: * :py:class:`~mrjob.step.JarStep`.``INPUT``; use :py:data:`mrjob.step.INPUT` instead * :py:class:`~mrjob.step.JarStep`.``OUTPUT``; use :py:data:`mrjob.step.OUTPUT` instead * non-strict protocols (see `strict_protocols`) * the *python_archives* option (try :ref:`this ` instead) * :py:func:`~mrjob.parse.is_windows_path` * :py:func:`~mrjob.parse.parse_key_value_list` * :py:func:`~mrjob.parse.parse_port_range_list` * :py:func:`~mrjob.util.scrape_options_into_new_groups` .. _v0.5.6: 0.5.6 ----- Fixed a critical bug that caused Dataproc runner to always crash when determining Hadoop version. Log interpretation now prioritizes task errors (e.g. a traceback from your Python script) as probable cause of failure, even if they aren't the most recent error. Log interpretation will now continue to download and parse task logs until it finds a non-empty stderr log. Log interpretation also strips the "subprocess failed" Java stack trace that appears in task stderr logs from Hadoop 1. .. _v0.5.5: 0.5.5 ----- Functionally equivalent to :ref:`v0.5.4`, except that it restores the deprecated *ami_version* option as an alias for :mrjob-opt:`image_version`, making it easier to upgrade from earlier versions of mrjob. Also slightly improves :ref:`cluster-pooling` on EMR with updated information on memory and CPU power of various EC2 instance types, and by treating application names (e.g. "Spark") as case-insensitive. .. _v0.5.4: 0.5.4 ----- Pooling and idle cluster self-termination ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. warning:: This release accidentally removed the *ami_version* option instead of merely deprecating it. If you are upgrading from an earlier version of mrjob, use version :ref:`v0.5.5` or later. This release resolves a long-standing EMR API race condition that made it difficult to use :ref:`cluster-pooling` and idle cluster self-termination (see `max_hours_idle`) together. Now if your pooled job unknowingly runs on a cluster that was in the process of shutting down, it will detect that and re-launch the job on a different cluster. This means pretty much *everyone* running jobs on EMR should now enable pooling, with a configuration like this: .. code-block:: yaml runners: emr: max_hours_idle: 1 pool_clusters: true You may *also* run the :ref:`terminate-idle-clusters` script periodically, but (barring any bugs) this shouldn't be necessary. .. _generic-emr-option-names: Generic EMR option names ^^^^^^^^^^^^^^^^^^^^^^^^ Many options to the :doc:`EMR runner ` have been made more generic, to make it easier to share code with the :doc:`Dataproc runner ` (in most cases, the new names are also shorter and easier to remember): =============================== ====================================== old option name new option name =============================== ====================================== *ami_version* :mrjob-opt:`image_version` *aws_availablity_zone* :mrjob-opt:`zone` *aws_region* :mrjob-opt:`region` *check_emr_status_every* :mrjob-opt:`check_cluster_every` *ec2_core_instance_bid_price* :mrjob-opt:`core_instance_bid_price` *ec2_core_instance_type* :mrjob-opt:`core_instance_type` *ec2_instance_type* :mrjob-opt:`instance_type` *ec2_master_instance_bid_price* :mrjob-opt:`master_instance_bid_price` *ec2_master_instance_type* :mrjob-opt:`master_instance_type` *ec2_slave_instance_type* :mrjob-opt:`core_instance_type` *ec2_task_instance_bid_price* :mrjob-opt:`task_instance_bid_price` *ec2_task_instance_type* :mrjob-opt:`task_instance_type` *emr_tags* :mrjob-opt:`tags` *num_ec2_core_instances* :mrjob-opt:`num_core_instances` *num_ec2_task_instances* :mrjob-opt:`num_task_instances` *s3_log_uri* :mrjob-opt:`cloud_log_dir` *s3_sync_wait_time* :mrjob-opt:`cloud_fs_sync_secs` *s3_tmp_dir* :mrjob-opt:`cloud_tmp_dir` *s3_upload_part_size* *cloud_upload_part_size* =============================== ====================================== The old option names and command-line switches are now deprecated but will continue to work until v0.6.0. (Exception: *ami_version* was accidentally removed; if you need it, use :ref:`v0.5.5` or later.) `num_ec2_instances` has simply been deprecated (it's just :mrjob-opt:`num_core_instances` plus one). :mrjob-opt:`hadoop_streaming_jar_on_emr` has also been deprecated; in its place, you can now pass a ``file://`` URI to :mrjob-opt:`hadoop_streaming_jar` to reference a path on the master node. Log interpretation ^^^^^^^^^^^^^^^^^^ Log interpretation (counters and probable cause of job failure) on Hadoop is more robust, handing a wider variety of log4j formats and recovering more gracefully from permissions errors. This includes fixing a crash that could happen on Python 3 when attempting to read data from HDFS. Log interpretation used to be partially broken on EMR AMI 4.3.0 and later due to a permissions issue; this is now fixed. pass_through_option() ^^^^^^^^^^^^^^^^^^^^^ You can now pass through *existing* command-line switches to your job; for example, you can tell a job which runner launched it. See :py:meth:`~mrjob.job.MRJob.pass_through_option` for details. If you *don't* do this, ``self.options.runner`` will now always be ``None`` in your job (it used to confusingly default to ``'inline'``). Stop logging credentials ^^^^^^^^^^^^^^^^^^^^^^^^ When mrjob is run in verbose mode (the ``-v`` option), the values of all runner options are debug-logged to stderr. This has been the case since the very early days of mrjob. Unfortunately, this means that if you set your AWS credentials in :file:`mrjob.conf`, they get logged as well, creating a surprising potential security vulnerability. (This doesn't happen for AWS credentials set through environment variables.) Starting in this version, the values of :mrjob-opt:`aws_secret_access_key` and `aws_security_token` are shown as ``'...'`` if they are set, and all but the last four characters of :mrjob-opt:`aws_access_key_id` are blanked out as well (e.g. ``'...YNDR'``). Other improvements and bugfixes ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The ssh tunnel to the resource manager on EMR (see :mrjob-opt:`ssh_tunnel`) now connects to its correct *internal* IP; this resolves a firewall issue that existed on some VPC setups. Uploaded files will no longer be given names starting with ``_`` or ``.``, since Hadoop's input processing treats these files as "hidden". The EMR idle cluster self-termination script (see `max_hours_idle`) now only runs on the master node. The :ref:`audit-emr-usage` command-line tool should no longer constantly trigger throttling warnings. :mrjob-opt:`bootstrap_python` no longer bothers trying to install Python 3 on EMR AMI 4.6.0 and later, since it is already installed. The ``--ssh-bind-ports`` command-line switch was broken (starting in :ref:`v0.4.5`!), and is now fixed. .. _v0.5.3: 0.5.3 ----- This release adds support for custom :mrjob-opt:`libjars` (such as `nicknack `__), allowing easy access to custom input and output formats. This works on Hadoop and EMR (including on a cluster that's already running). In addition, jobs can specify needed libjars by setting the :py:attr:`~mrjob.job.MRJob.LIBJARS` attribute or overriding the :py:meth:`~mrjob.job.MRJob.libjars` method. For examples, see :ref:`input-and-output-formats`. The Hadoop runner now tries *even harder* to find your log files without needing additional configuration (see :mrjob-opt:`hadoop_log_dirs`). The EMR runner now supports Amazon VPC subnets (see :mrjob-opt:`subnet`), and, on 4.x AMIs, Application Configurations (see :mrjob-opt:`emr_configurations`). If your EMR cluster fails during bootstrapping, mrjob can now determine the probable cause of failure. There are also some minor improvements to SSH tunneling and a handful of small bugfixes; see `CHANGES.txt `_ for details. .. _v0.5.2: 0.5.2 ----- This release adds basic support for `Google Cloud Dataproc `_ which is Google's Hadoop service, roughly analogous to EMR. See :doc:`guides/dataproc-quickstart`. Some features are not yet implemented: * fetching counters * finding probable cause of errors * running Java JARs as steps Added the `emr_applications` option, which helps you configure 4.x AMIs. Fixed an EMR bug (introduced in v0.5.0) where we were waiting for steps to complete in the wrong order (in a multi-step job, we wouldn't register that the first step had finished until the last one had). Fixed a bug in SSH tunneling (introduced in v0.5.0) that made connections to the job tracker/resource manager on EMR time out when running on a 2.x AMI inside a VPC (Virtual Private Cluster). Fixed a bug (introduced in v0.4.6) that kept mrjob from interpreting ``~`` (home directory) in includes in :file:`mrjob.conf`. It is now again possible to run tool modules deprecated in v0.5.0 directly (e.g. :command:`python -m mrjob.tools.emr.create_job_flow`). This is still a deprecated feature; it's recommended that you use the appropriate :command:`mrjob` subcommand instead (e.g. :command:`mrjob create-cluster`). .. _v0.5.1: 0.5.1 ----- Fixes a bug in the previous relase that broke :py:attr:`~mrjob.job.MRJob.SORT_VALUES` and any other attempt by the job to set the partitioner. The ``--partitioner`` switch is now deprecated (the choice of partitioner is part of your job semantics). Fixes a bug in the previous release that caused `strict_protocols` and :mrjob-opt:`check_input_paths` to be ignored in :file:`mrjob.conf`. (We would much prefer you fixed jobs that are using "loose protocols" rather than setting ``strict_protocols: false`` in your config file, but we didn't break this on purpose, we promise!) ``mrjob terminate-idle-clusters`` now correctly handles EMR debugging steps (see :mrjob-opt:`enable_emr_debugging`) set up by boto 2.40.0. Fixed a bug that could result in showing a blank probable cause of error for pre-YARN (Hadoop 1) jobs. :mrjob-opt:`ssh_bind_ports` now defaults to a ``range`` object (``xrange`` on Python 2), so that when you run on emr in verbose mode (``-r emr -v``), debug logging devotes one line to the value of ``ssh_bind_ports`` rather than 840. .. _v0.5.0: 0.5.0 ----- Python versions ^^^^^^^^^^^^^^^ mrjob now fully supports Python 3.3+ in a way that should be transparent to existing Python 2 users (you don't have to suddenly start handling ``unicode`` instead of ``str``). For more information, see :doc:`guides/py2-vs-py3`. If you run a job with Python 3, mrjob will automatically install Python 3 on ElasticMapreduce AMIs (see :mrjob-opt:`bootstrap_python`). When you run jobs on EMR in Python 2, mrjob attempts to match your minor version of Python as well (either :command:`python2.6` or :command:`python2.7`); see :mrjob-opt:`python_bin` for details. .. note:: If you're currently running Python 2.7, and :ref:`using yum to install python libraries `, you'll want to use the Python 2.7 version of the package (e.g. ``python27-numpy`` rather than ``python-numpy``). The :command:`mrjob` command is now installed with Python-version-specific aliases (e.g. :command:`mrjob-3`, :command:`mrjob-3.4`), in case you install mrjob for multiple versions of Python. Hadoop ^^^^^^ mrjob should now work out-of-the box on almost any Hadoop setup. If :command:`hadoop` is in your path, or you set any commonly-used :envvar:`$HADOOP_*` environment variable, mrjob will find the Hadoop binary, the streaming jar, and your logs, without any help on your part (see :mrjob-opt:`hadoop_bin`, :mrjob-opt:`hadoop_log_dirs`, :mrjob-opt:`hadoop_streaming_jar`). mrjob has been updated to fully support Hadoop 2 (YARN), including many updates to :py:class:`~mrjob.fs.hadoop.HadoopFilesystem`. Hadoop 1 is still supported, though anything prior to Hadoop 0.20.203 is not (mrjob is actually a few months older than Hadoop 0.20.203, so this used to matter). 3.x and 4.x AMIs ^^^^^^^^^^^^^^^^ mrjob now fully supports the 3.x and 4.x Elastic MapReduce AMIs, including SSH tunneling to the resource mananager, fetching counters and finding probable cause of job failure. The default `ami_version` (see :mrjob-opt:`image_version`) is now ``3.11.0``. Our plan is to continue updating this to the lastest (non-broken) 3.x AMI for each 0.5.x release of mrjob. The default :mrjob-opt:`instance_type` is now ``m1.medium`` (``m1.small`` is too small for the 3.x and 4.x AMIs) You can specify 4.x AMIs with either the new :mrjob-opt:`release_label` option, or continue using `ami_version`; both work. mrjob continues to support 2.x AMIs. However: .. warning:: 2.x AMIs are deprecated by AWS, and based on a very old version of Debian (squeeze), which breaks :command:`apt-get` and exposes you to security holes. Please, please switch if you haven't already. AWS Regions ^^^^^^^^^^^ The new default `aws_region` (see :mrjob-opt:`region`) is ``us-west-2`` (Oregon). This both matches the default in the EMR console and, according to Amazon, is `carbon neutral `__. An edge case that might affect you: EC2 key pairs (i.e. SSH credentials) are region-specific, so if you've set up SSH but not explicitly specified a region, you may get an error saying your key pair is invalid. The fix is simply to :ref:`create new SSH keys ` for the ``us-west-2`` (Oregon) region. S3 ^^^ mrjob is much smarter about the way it interacts with S3: - automatically creates temp bucket in the same region as jobs - connects to S3 buckets on the endpoint matching their region (no more 307 errors) - :py:class:`~mrjob.emr.EMRJobRunner` and :py:class:`~mrjob.fs.s3.S3Filesystem` methods no longer take ``s3_conn`` args (passing around a single S3 connection no longer makes sense) - no longer uses the temp bucket's location to choose where you run your job - :py:meth:`~mrjob.fs.s3.S3Filesystem.rm` no longer has special logic for ``*_$folder$`` keys - :py:meth:`~mrjob.fs.s3.S3Filesystem.ls` recurses "subdirectories" even if you pass it a URI without a trailing slash Log interpretation ^^^^^^^^^^^^^^^^^^ The part of mrjob that fetches counters and tells you what probably caused your job to fail was basically unmaintainable and has been totally rewritten. Not only do we now have solid support across Hadoop and EMR AMI versions, but if we missed anything, it should be straightforward to add it. Once casualty of this change was the :command:`mrjob fetch-logs` command, which means mrjob no longer offers a way to fetch or interpret logs from a *past* job. We do plan to re-introduce this functionality. Protocols ^^^^^^^^^ Protocols are now strict by default (they simply raise an exception on unencodable data). "Loose" protocols can be re-enabled with the ``--no-strict-protocols`` switch; see `strict_protocols` for why this is a bad idea. Protocols will now use the much faster :py:mod:`ujson` library, if installed, to encode and decode JSON. This is especially recommended for simple jobs that spend a significant fraction of their time encoding and data. .. note:: If you're using EMR, try out :ref:`this bootstrap recipe ` to install :py:mod:`ujson`. mrjob will fall back to the :py:mod:`simplejson` library if :py:mod:`ujson` is not installed, and use the built-in ``json`` module if neither is installed. You can now explicitly specify which JSON implementation you wish to use (e.g. :py:class:`~mrjob.protocol.StandardJSONProtocol`, :py:class:`~mrjob.protocol.SimpleJSONProtocol`, :py:class:`~mrjob.protocol.UltraJSONProtocol`). Status messages ^^^^^^^^^^^^^^^ We've tried to cut the logging messages that your job prints as it runs down to the basics (either useful info, like where a temp directory is, or something that tells you why you're waiting). If there are any messages you miss, try running your job with ``-v``. When a step in your job fails, mrjob no longer prints a useless stacktrace telling you where in the code the runner raised an exception about your step failing. This is thanks to :py:class:`~mrjob.step.StepFailedException`, which you can also catch and interpret if you're :ref:`running jobs programmatically `. .. _v0.5.0-deprecation: Deprecation ^^^^^^^^^^^ Many things that were deprecated in 0.4.6 have been removed: - options: - :py:data:`~mrjob.runner.IF_SUCCESSFUL` :mrjob-opt:`cleanup` option (use :py:data:`~mrjob.runner.ALL`) - *iam_job_flow_role* (use :mrjob-opt:`iam_instance_profile`) - functions and methods: - positional arguments to :py:meth:`mrjob.job.MRJob.mr()` (don't even use :py:meth:`~mrjob.job.MRJob.mr()`; use :py:class:`mrjob.step.MRStep`) - ``mrjob.job.MRJob.jar()`` (use :py:class:`mrjob.step.JarStep`) - *step_args* and *name* arguments to :py:class:`mrjob.step.JarStep` (use *args* instead of *step_args*, and don't use *name* at all) - :py:class:`mrjob.step.MRJobStep` (use :py:class:`mrjob.step.MRStep`) - :py:func:`mrjob.compat.get_jobconf_value` (use to :py:func:`~mrjob.compat.jobconf_from_env`) - :py:meth:`mrjob.job.MRJob.parse_counters` - :py:meth:`mrjob.job.MRJob.parse_output` - :py:func:`mrjob.conf.combine_cmd_lists` - :py:meth:`mrjob.fs.s3.S3Filesystem.get_s3_folder_keys` :py:mod:`mrjob.compat` functions :py:func:`~mrjob.compat.supports_combiners_in_hadoop_streaming`, :py:func:`~mrjob.compat.supports_new_distributed_cache_options`, and :py:func:`~mrjob.compat.uses_generic_jobconf`, which only existed to support very old versions of Hadoop, were removed without deprecation warnings (sorry!). To avoid a similar wave of deprecation warnings in the future, the name of every part of mrjob that isn't meant to be a stable interface provided by the library now starts with an underscore. You can still use these things (or copy them; it's Open Source), but there's no guarantee they'll exist in the next release. If you want to get ahead of the game, here is a list of things that are deprecated starting in mrjob 0.5.0 (do these *after* upgrading mrjob): - options: - *base_tmp_dir* is now :mrjob-opt:`local_tmp_dir` - :mrjob-opt:`cleanup` options :py:data:`~mrjob.runner.LOCAL_SCRATCH` and :py:data:`~mrjob.runner.REMOTE_SCRATCH` are now :py:data:`~mrjob.runner.LOCAL_TMP` and :py:data:`~mrjob.runner.REMOTE_TMP` - *emr_job_flow_id* is now :mrjob-opt:`cluster_id` - *emr_job_flow_pool_name* is now :mrjob-opt:`pool_name` - *hdfs_scratch_dir* is now :mrjob-opt:`hadoop_tmp_dir` - *pool_emr_job_flows* is now :mrjob-opt:`pool_clusters` - *s3_scratch_uri* is now :mrjob-opt:`cloud_tmp_dir` - *ssh_tunnel_to_job_tracker* is now simply :mrjob-opt:`ssh_tunnel` - functions and methods: - :py:meth:`mrjob.job.MRJob.is_mapper_or_reducer` is now :py:meth:`~mrjob.job.MRJob.is_task` - :py:class:`~mrjob.fs.base.Filesystem` method ``path_exists()`` is now simply :py:meth:`~mrjob.fs.base.Filesystem.exists` - :py:class:`~mrjob.fs.base.Filesystem` method ``path_join()`` is now simply :py:meth:`~mrjob.fs.base.Filesystem.join` - Use ``runner.fs`` explicitly when accessing filesystem methods (e.g. ``runner.fs.ls()``, not ``runner.ls()``) - :command:`mrjob` subcommands - :command:`mrjob create-job-flow` is now :command:`mrjob create-cluster` - :command:`mrjob terminate-idle-job-flows` is now :command:`mrjob terminate-idle-clusters` - :command:`mrjob terminate-job-flow` is now :command:`mrjob temrinate-cluster` Other changes ^^^^^^^^^^^^^ - mrjob now requires ``boto`` 2.35.0 or newer (chances are you're already doing this). Later 0.5.x releases of mrjob may require newer versions of ``boto``. - `visible_to_all_users` now defaults to ``True`` - ``HadoopFilesystem.rm()`` uses ``-skipTrash`` - new :mrjob-opt:`iam_endpoint` option - custom :mrjob-opt:`hadoop_streaming_jar`\ s are properly uploaded - :py:data:`~mrjob.runner.JOB` :mrjob-opt:`cleanup` on EMR is temporarily disabled - mrjob now follows symlinks when :py:meth:`~mrjob.fs.local.LocalFileSystem.ls`\ ing the local filesystem (beware recursive symlinks!) - The `interpreter` option disables :mrjob-opt:`bootstrap_mrjob` by default (`interpreter` is meant for non-Python jobs) - :ref:`cluster-pooling` now respects :mrjob-opt:`ec2_key_pair` - cluster self-termination (see `max_hours_idle`) now respects non-streaming jobs - :py:class:`~mrjob.fs.local.LocalFilesystem` now rejects URIs rather than interpreting them as local paths - ``local`` and ``inline`` runners no longer have a default :mrjob-opt:`hadoop_version`, instead handling :mrjob-opt:`jobconf` in a version-agnostic way - `steps_python_bin` now defaults to the current Python interpreter. - minor changes to :py:mod:`mrjob.util`: - :py:func:`~mrjob.util.file_ext` takes filename, not path - :py:func:`~mrjob.util.gunzip_stream` now yields chunks of bytes, not lines - moved :py:func:`~mrjob.util.random_identifier` method here from :py:mod:`mrjob.aws` - ``buffer_iterator_to_line_iterator()`` is now named :py:func:`~mrjob.util.to_lines`, and no longer appends a trailing newline to data. 0.4.6 ----- ``include:`` in conf files can now use relative paths in a meaningful way. See :ref:`configs-relative-includes`. List and environment variable options loaded from included config files can be totally overridden using the ``!clear`` tag. See :ref:`clearing-configs`. Options that take lists (e.g. :mrjob-opt:`setup`) now treat scalar values as single-item lists. See :ref:`this example `. Fixed a bug that kept the ``pool_wait_minutes`` option from being loaded from config files. .. _v0.4.5: 0.4.5 ----- This release moves mrjob off the deprecated `DescribeJobFlows `_ EMR API call. .. warning:: AWS *again* broke older versions mrjob for at least some new accounts, by returning 400s for the deprecated `DescribeJobFlows `_ API call. If you have a newer AWS account (circa July 2015), you must use at least this version of mrjob. The new API does not provide a way to tell when a job flow (now called a "cluster") stopped provisioning instances and started bootstrapping, so the clock for our estimates of when we are close to the end of a billing hour now start at cluster creation time, and are thus more conservative. Related to this change, :py:mod:`~mrjob.emr.tools.terminate_idle_job_flows` no longer considers job flows in the ``STARTING`` state idle; use :py:mod:`~mrjob.emr.tools.report_long_jobs` to catch jobs stuck in this state. :py:mod:`~mrjob.emr.tools.terminate_idle_job_flows` performs much better on large numbers of job flows. Formerly, it collected all job flow information first, but now it terminates idle job flows as soon as it identifies them. :py:mod:`~mrjob.emr.tools.collect_emr_stats` and :py:mod:`~mrjob.emr.tools.job_flow_pool` have *not* been ported to the new API and will be removed in v0.5.0. Added an `aws_security_token` option to allow you to run mrjob on EMR using temporary AWS credentials. Added an `emr_tags` (see :mrjob-opt:`tags`) option to allow you to tag EMR job flows at creation time. :py:class:`~mrjob.emr.EMRJobRunner` now has a :py:meth:`~mrjob.emr.EMRJobRunner.get_ami_version` method. The :mrjob-opt:`hadoop_version` option no longer has any effect in EMR. This option only every did anything on the 1.x AMIs, which mrjob no longer supports. Added many missing switches to the EMR tools (accessible from the :command:`mrjob` command). Formerly, you had to use a config file to get at these options. You can now access the :py:mod:`~mrjob.emr.tools.mrboss` tool from the command line: :command:`mrjob boss `. Previous 0.4.x releases have worked with boto as old as 2.2.0, but this one requires at least boto 2.6.0 (which is still more than two years old). In any case, it's recommended that you just use the latest version of boto. This branch has a number of additional deprecation warnings, to help prepare you for mrjob v0.5.0. Please heed them; a lot of deprecated things really are going to be completely removed. 0.4.4 ----- mrjob now automatically creates and uses IAM objects as necessary to comply with `new requirements from Amazon Web Services `_. (You do not need to install the AWS CLI or run ``aws emr create-default-roles`` as the link above describes; mrjob takes care of this for you.) .. warning:: The change that AWS made essentially broke all older versions of mrjob for all new accounts. If the first time your AWS account created an Elastic MapReduce cluster was on or after April 6, 2015, you should use at least this version of mrjob. If you *must* use an old version of mrjob with a new AWS account, see `this thread `_ for a possible workaround. ``--iam-job-flow-role`` has been renamed to ``--iam-instance-profile``. New ``--iam-service-role`` option. 0.4.3 ----- This release also contains many, many bugfixes, one of which probably affects you! See `CHANGES.txt `_ for details. Added a new subcommand, ``mrjob collect-emr-active-stats``, to collect stats about active jobflows and instance counts. ``--iam-job-flow-role`` option allows setting of a specific IAM role to run this job flow. You can now use ``--check-input-paths`` and ``--no-check-input-paths`` on EMR as well as Hadoop. Files larger than 100MB will be uploaded to S3 using multipart upload if you have the `filechunkio` module installed. You can change the limit/part size with the ``--s3-upload-part-size`` option, or disable multipart upload by setting this option to 0. .. _ready-for-strict-protocols: You can now require protocols to be strict from :ref:`mrjob.conf `; this means unencodable input/output will result in an exception rather than the job quietly incrementing a counter. It is recommended you set this for all runners: .. code-block:: yaml runners: emr: strict_protocols: true hadoop: strict_protocols: true inline: strict_protocols: true local: strict_protocols: true You can use ``--no-strict-protocols`` to turn off strict protocols for a particular job. Tests now support pytest and tox. Support for Python 2.5 has been dropped. 0.4.2 ----- JarSteps, previously experimental, are now fully integrated into multi-step jobs, and work with both the Hadoop and EMR runners. You can now use powerful Java libraries such as `Mahout `_ in your MRJobs. For more information, see :ref:`non-hadoop-streaming-jar-steps`. Many options for setting up your task's environment (``--python-archive``, ``--setup-cmd`` and ``--setup-script``) have been replaced by a powerful ``--setup`` option. See the :doc:`guides/setup-cookbook` for examples. Similarly, many options for bootstrapping nodes on EMR (``--bootstrap-cmd``, ``--bootstrap-file``, ``--bootstrap-python-package`` and ``--bootstrap-script``) have been replaced by a single ``--bootstrap`` option. See the :doc:`guides/emr-bootstrap-cookbook`. This release also contains many `bugfixes `_, including problems with boto 2.10.0+, bz2 decompression, and Python 2.5. 0.4.1 ----- The :py:attr:`~mrjob.job.MRJob.SORT_VALUES` option enables secondary sort, ensuring that your reducer(s) receive values in sorted order. This allows you to do things with reducers that would otherwise involve storing all the values in memory, such as: * Receiving a grand total before any subtotals, so you can calculate percentages on the fly. See `mr_next_word_stats.py `_ for an example. * Running a window of fixed length over an arbitrary amount of sorted values (e.g. a 24-hour window over timestamped log data). The `max_hours_idle` option allows you to spin up EMR job flows that will terminate themselves after being idle for a certain amount of time, in a way that optimizes EMR/EC2's full-hour billing model. For development (not production), we now recommend always using :ref:`job flow pooling `, with `max_hours_idle` enabled. Update your :ref:`mrjob.conf ` like this: .. code-block:: yaml runners: emr: max_hours_idle: 0.25 pool_emr_job_flows: true .. warning:: If you enable pooling *without* `max_hours_idle` (or cronning :py:mod:`~mrjob.tools.emr.terminate_idle_job_flows`), pooled job flows will stay active forever, costing you money! You can now use :option:`--no-check-input-paths` with the Hadoop runner to allow jobs to run even if ``hadoop fs -ls`` can't see their input files (see :mrjob-opt:`check_input_paths`). Two bits of straggling deprecated functionality were removed: * Built-in :ref:`protocols ` must be instantiated to be used (formerly they had class methods). * Old locations for :ref:`mrjob.conf ` are no longer supported. This version also contains numerous bugfixes and natural extensions of existing functionality; many more things will now Just Work (see `CHANGES.txt `_). 0.4.0 ----- The default runner is now `inline` instead of `local`. This change will speed up debugging for many users. Use `local` if you need to simulate more features of Hadoop. The EMR tools can now be accessed more easily via the `mrjob` command. Learn more :doc:`here `. Job steps are much richer now: * You can now use mrjob to run jar steps other than Hadoop Streaming. :ref:`More info ` * You can filter step input with UNIX commands. :ref:`More info ` * In fact, you can use arbitrary UNIX commands as your whole step (mapper/reducer/combiner). :ref:`More info ` If you Ctrl+C from the command line, your job will be terminated if you give it time. If you're running on EMR, that should prevent most accidental runaway jobs. :ref:`More info ` mrjob v0.4 requires boto 2.2. We removed all deprecated functionality from v0.2: * --hadoop-\*-format * --\*-protocol switches * MRJob.DEFAULT_*_PROTOCOL * MRJob.get_default_opts() * MRJob.protocols() * PROTOCOL_DICT * IF_SUCCESSFUL * DEFAULT_CLEANUP * S3Filesystem.get_s3_folder_keys() We love contributions, so we wrote some :doc:`guidelines` to help you help us. See you on Github! 0.3.5 ----- The *pool_wait_minutes* (:option:`--pool-wait-minutes`) option lets your job delay itself in case a job flow becomes available. Reference: :doc:`guides/configs-reference` The ``JOB`` and ``JOB_FLOW`` cleanup options tell mrjob to clean up the job and/or the job flow on failure (including Ctrl+C). See :py:data:`~mrjob.options.CLEANUP_CHOICES` for more information. 0.3.3 ----- You can now :ref:`include one config file from another `. 0.3.2 ----- The EMR instance type/number options have changed to support spot instances: * *core_instance_bid_price* * *core_instance_type* * *master_instance_bid_price* * *master_instance_type* * *slave_instance_type* (alias for *core_instance_type*) * *task_instance_bid_price* * *task_instance_type* There is also a new *ami_version* option to change the AMI your job flow uses for its nodes. For more information, see :py:meth:`mrjob.emr.EMRJobRunner.__init__`. The new :py:mod:`~mrjob.tools.emr.report_long_jobs` tool alerts on jobs that have run for more than X hours. 0.3 --- Features ^^^^^^^^ **Support for Combiners** You can now use combiners in your job. Like :py:meth:`.mapper()` and :py:meth:`.reducer()`, you can redefine :py:meth:`.combiner()` in your subclass to add a single combiner step to run after your mapper but before your reducer. (:py:class:`MRWordFreqCount` does this to improve performance.) :py:meth:`.combiner_init()` and :py:meth:`.combiner_final()` are similar to their mapper and reducer equivalents. You can also add combiners to custom steps by adding keyword argumens to your call to :py:meth:`.steps()`. More info: :ref:`writing-one-step-jobs`, :ref:`writing-multi-step-jobs` **\*_init(), \*_final() for mappers, reducers, combiners** Mappers, reducers, and combiners have ``*_init()`` and ``*_final()`` methods that are run before and after the input is run through the main function (e.g. :py:meth:`.mapper_init()` and :py:meth:`.mapper_final()`). More info: :ref:`writing-one-step-jobs`, :ref:`writing-multi-step-jobs` **Custom Option Parsers** It is now possible to define your own option types and actions using a custom :py:class:`OptionParser` subclass. **Job Flow Pooling** EMR jobs can pull job flows out of a "pool" of similarly configured job flows. This can make it easier to use a small set of job flows across multiple automated jobs, save time and money while debugging, and generally make your life simpler. More info: :ref:`cluster-pooling` **SSH Log Fetching** mrjob attempts to fetch counters and error logs for EMR jobs via SSH before trying to use S3. This method is faster, more reliable, and works with persistent job flows. More info: :ref:`ssh-tunneling` **New EMR Tool: fetch_logs** If you want to fetch the counters or error logs for a job after the fact, you can use the new ``fetch_logs`` tool. More info: :py:mod:`mrjob.tools.emr.fetch_logs` **New EMR Tool: mrboss** If you want to run a command on all nodes and inspect the output, perhaps to see what processes are running, you can use the new ``mrboss`` tool. More info: :py:mod:`mrjob.tools.emr.mrboss` Changes and Deprecations ^^^^^^^^^^^^^^^^^^^^^^^^ **Configuration** The search path order for ``mrjob.conf`` has changed. The new order is: * The location specified by :envvar:`MRJOB_CONF` * :file:`~/.mrjob.conf` * :file:`~/.mrjob` **(deprecated)** * :file:`mrjob.conf` in any directory in :envvar:`PYTHONPATH` **(deprecated)** * :file:`/etc/mrjob.conf` If your :file:`mrjob.conf` path is deprecated, use this table to fix it: ================================= =============================== Old Location New Location ================================= =============================== :file:`~/.mrjob` :file:`~/.mrjob.conf` somewhere in :envvar:`PYTHONPATH` Specify in :envvar:`MRJOB_CONF` ================================= =============================== More info: :py:mod:`mrjob.conf` **Defining Jobs (MRJob)** Mapper, combiner, and reducer methods no longer need to contain a yield statement if they emit no data. The :option:`--hadoop-*-format` switches are deprecated. Instead, set your job's Hadoop formats with :py:attr:`.HADOOP_INPUT_FORMAT`/:py:attr:`.HADOOP_OUTPUT_FORMAT` or :py:meth:`.hadoop_input_format()`/:py:meth:`.hadoop_output_format()`. Hadoop formats can no longer be set from :file:`mrjob.conf`. In addition to :option:`--jobconf`, you can now set jobconf values with the :py:attr:`.JOBCONF` attribute or the :py:meth:`.jobconf()` method. To read jobconf values back, use :py:func:`mrjob.compat.jobconf_from_env()`, which ensures that the correct name is used depending on which version of Hadoop is active. You can now set the Hadoop partioner class with :option:`--partitioner`, the :py:attr:`.PARTITIONER` attribute, or the :py:meth:`.partitioner()` method. More info: :ref:`hadoop-config` **Protocols** Protocols can now be anything with a ``read()`` and ``write()`` method. Unlike previous versions of mrjob, they can be **instance methods** rather than class methods. You should use instance methods when defining your own protocols. The :option:`--*protocol` switches and :py:attr:`DEFAULT_*PROTOCOL` are deprecated. Instead, use the :py:attr:`*_PROTOCOL` attributes or redefine the :py:meth:`*_protocol()` methods. Protocols now cache the decoded values of keys. Informal testing shows up to 30% speed improvements. More info: :ref:`job-protocols` **Running Jobs** **All Modes** All runners are Hadoop-version aware and use the correct jobconf and combiner invocation styles. This change should decrease the number of warnings in Hadoop 0.20 environments. All ``*_bin`` configuration options (``hadoop_bin``, ``python_bin``, and ``ssh_bin``) take lists instead of strings so you can add arguments (like ``['python', '-v']``). More info: :doc:`guides/configs-reference` Cleanup options have been split into ``cleanup`` and ``cleanup_on_failure``. There are more granular values for both of these options. Most limitations have been lifted from passthrough options, including the former inability to use custom types and actions. The ``job_name_prefix`` option is gone (was deprecated). All URIs are passed through to Hadoop where possible. This should relax some requirements about what URIs you can use. Steps with no mapper use :command:`cat` instead of going through a no-op mapper. Compressed files can be streamed with the :py:meth:`.cat()` method. **EMR Mode** The default Hadoop version on EMR is now 0.20 (was 0.18). The ``instance_type`` option only sets the instance type for slave nodes when there are multiple EC2 instance. This is because the master node can usually remain small without affecting the performance of the job. **Inline Mode** Inline mode now supports the ``cmdenv`` option. **Local Mode** Local mode now runs 2 mappers and 2 reducers in parallel by default. There is preliminary support for simulating some jobconf variables. The current list of supported variables is: * ``mapreduce.job.cache.archives`` * ``mapreduce.job.cache.files`` * ``mapreduce.job.cache.local.archives`` * ``mapreduce.job.cache.local.files`` * ``mapreduce.job.id`` * ``mapreduce.job.local.dir`` * ``mapreduce.map.input.file`` * ``mapreduce.map.input.length`` * ``mapreduce.map.input.start`` * ``mapreduce.task.attempt.id`` * ``mapreduce.task.id`` * ``mapreduce.task.ismap`` * ``mapreduce.task.output.dir`` * ``mapreduce.task.partition`` **Other Stuff** boto 2.0+ is now required. The Debian packaging has been removed from the repostory.