mrjob.cmd: The mrjob command-line utility

The mrjob command provides a number of sub-commands that help you run and monitor jobs.

The mrjob command comes with Python-version-specific aliases (e.g. mrjob-3, mrjob-3.4), in case you choose to install mrjob for multiple versions of Python. You may also run it as python -m mrjob.cmd <subcommand>.

audit-emr-usage

Audit EMR usage over the past 2 weeks, sorted by cluster name and user.

Usage:

mrjob audit-emr-usage > report

Options:

-c CONF_PATHS, --conf-path CONF_PATHS
                      Path to alternate mrjob.conf file to read from
--no-conf             Don't load mrjob.conf even if it's available
--ec2-endpoint EC2_ENDPOINT
                      Force mrjob to connect to EC2 on this endpoint (e.g.
                      ec2.us-west-1.amazonaws.com). Default is to infer this
                      from region.
--emr-endpoint EMR_ENDPOINT
                      Force mrjob to connect to EMR on this endpoint (e.g.
                      us-west-1.elasticmapreduce.amazonaws.com). Default is
                      to infer this from region.
-h, --help            show this help message and exit
--max-days-ago MAX_DAYS_AGO
                      Max number of days ago to look at jobs. By default, we
                      go back as far as EMR supports (currently about 2
                      months)
-q, --quiet           Don't print anything to stderr
--region REGION       GCE/AWS region to run Dataproc/EMR jobs in.
--s3-endpoint S3_ENDPOINT
                      Force mrjob to connect to S3 on this endpoint (e.g. s3
                      -us-west-1.amazonaws.com). You usually shouldn't set
                      this; by default mrjob will choose the correct
                      endpoint for each S3 bucket based on its location.
-v, --verbose         print more messages to stderr

boss

Run a command on every node of a cluster. Store stdout and stderr for results in OUTPUT_DIR.

Usage:

mrjob boss CLUSTER_ID [options] "command string"

Options:

-c CONF_PATHS, --conf-path CONF_PATHS
                      Path to alternate mrjob.conf file to read from
--no-conf             Don't load mrjob.conf even if it's available
--ec2-endpoint EC2_ENDPOINT
                      Force mrjob to connect to EC2 on this endpoint (e.g.
                      ec2.us-west-1.amazonaws.com). Default is to infer this
                      from region.
--ec2-key-pair-file EC2_KEY_PAIR_FILE
                      Path to file containing SSH key for EMR
--emr-endpoint EMR_ENDPOINT
                      Force mrjob to connect to EMR on this endpoint (e.g.
                      us-west-1.elasticmapreduce.amazonaws.com). Default is
                      to infer this from region.
-h, --help            show this help message and exit
-o OUTPUT_DIR, --output-dir OUTPUT_DIR
                      Specify an output directory (default: CLUSTER_ID)
-q, --quiet           Don't print anything to stderr
--region REGION       GCE/AWS region to run Dataproc/EMR jobs in.
--s3-endpoint S3_ENDPOINT
                      Force mrjob to connect to S3 on this endpoint (e.g. s3
                      -us-west-1.amazonaws.com). You usually shouldn't set
                      this; by default mrjob will choose the correct
                      endpoint for each S3 bucket based on its location.
--ssh-bin SSH_BIN     Name/path of ssh binary. Arguments are allowed (e.g.
                      --ssh-bin 'ssh -v')
-v, --verbose         print more messages to stderr

create-cluster

Create a persistent EMR cluster to run clusters in, and print its ID to stdout.

Usage:

mrjob create-cluster

Options:

--additional-emr-info ADDITIONAL_EMR_INFO
                      A JSON string for selecting additional features on EMR
--applications APPLICATIONS, --application APPLICATIONS
                      Additional applications to run on 4.x and 5.x AMIs,
                      separated by commas (e.g. "Ganglia,Spark")
--bootstrap BOOTSTRAP
                      A shell command to set up libraries etc. before any
                      steps (e.g. "sudo apt-get -qy install python3"). You
                      may interpolate files available via URL or locally
                      with Hadoop Distributed Cache syntax ("sudo yum
                      install -y foo.rpm#")
--bootstrap-action BOOTSTRAP_ACTIONS
                      Raw bootstrap action scripts to run before any of the
                      other bootstrap steps. You can use --bootstrap-action
                      more than once. Local scripts will be automatically
                      uploaded to S3. To add arguments, just use quotes:
                      "foo.sh arg1 arg2"
--bootstrap-mrjob     Automatically zip up the mrjob library and install it
                      when we run the mrjob. This is the default. Use --no-
                      bootstrap-mrjob if you've already installed mrjob on
                      your Hadoop cluster.
--no-bootstrap-mrjob  Don't automatically zip up the mrjob library and
                      install it when we run this job. Use this if you've
                      already installed mrjob on your Hadoop cluster.
--bootstrap-python    Attempt to install a compatible version of Python at
                      bootstrap time. Currently this only does anything for
                      Python 3, for which it is enabled by default.
--no-bootstrap-python
                      Don't automatically try to install a compatible
                      version of Python at bootstrap time.
--bootstrap-spark     Auto-install Spark on the cluster (even if not
                      needed).
--no-bootstrap-spark  Don't auto-install Spark on the cluster.
--cloud-fs-sync-secs CLOUD_FS_SYNC_SECS
                      How long to wait for remote FS to reach eventual
                      consistency. This is typically less than a second but
                      the default is 5.0 to be safe.
--cloud-log-dir CLOUD_LOG_DIR
                      URI on remote FS to write logs into
--cloud-part-size-mb CLOUD_PART_SIZE_MB
                      Upload files to cloud FS in parts no bigger than this
                      many megabytes. Default is 100 MiB. Set to 0 to
                      disable multipart uploading entirely.
--cloud-upload-part-size CLOUD_PART_SIZE_MB
                      Deprecated alias for --cloud-part-size-mb
--cloud-tmp-dir CLOUD_TMP_DIR
                      URI on remote FS to use as our temp directory.
-c CONF_PATHS, --conf-path CONF_PATHS
                      Path to alternate mrjob.conf file to read from
--no-conf             Don't load mrjob.conf even if it's available
--core-instance-bid-price CORE_INSTANCE_BID_PRICE
                      Bid price to specify for core nodes when setting them
                      up as EC2 spot instances (you probably only want to do
                      this for task instances).
--core-instance-type CORE_INSTANCE_TYPE
                      Type of GCE/EC2 core instance(s) to launch
--ebs-root-volume-gb EBS_ROOT_VOLUME_GB
                      Size of root EBS volume, in GiB. Must be an
                      integer.Set to 0 to use the default
--ec2-endpoint EC2_ENDPOINT
                      Force mrjob to connect to EC2 on this endpoint (e.g.
                      ec2.us-west-1.amazonaws.com). Default is to infer this
                      from region.
--ec2-key-pair EC2_KEY_PAIR
                      Name of the SSH key pair you set up for EMR
--emr-action-on-failure EMR_ACTION_ON_FAILURE
                      Action to take when a step fails (e.g.
                      TERMINATE_CLUSTER, CANCEL_AND_WAIT, CONTINUE)
--emr-configuration EMR_CONFIGURATIONS
                      Configuration to use on 4.x AMIs as a JSON-encoded
                      dict; see http://docs.aws.amazon.com/ElasticMapReduce/
                      latest/ReleaseGuide/emr-configure-apps.html for
                      examples
--emr-endpoint EMR_ENDPOINT
                      Force mrjob to connect to EMR on this endpoint (e.g.
                      us-west-1.elasticmapreduce.amazonaws.com). Default is
                      to infer this from region.
--enable-emr-debugging
                      Enable storage of Hadoop logs in SimpleDB
--disable-emr-debugging
                      Disable storage of Hadoop logs in SimpleDB (the
                      default)
--extra-cluster-param EXTRA_CLUSTER_PARAMS
                      extra parameter to pass to cloud API when creating a
                      cluster, to access features not currently supported by
                      mrjob. Takes the form <param>=<value>, where value is
                      JSON or a string. Use <param>=null to unset a
                      parameter
-h, --help            show this help message and exit
--iam-endpoint IAM_ENDPOINT
                      Force mrjob to connect to IAM on this endpoint (e.g.
                      iam.us-gov.amazonaws.com)
--iam-instance-profile IAM_INSTANCE_PROFILE
                      EC2 instance profile to use for the EMR cluster -- see
                      "Configure IAM Roles for Amazon EMR" in AWS docs
--iam-service-role IAM_SERVICE_ROLE
                      IAM service role to use for the EMR cluster -- see
                      "Configure IAM Roles for Amazon EMR" in AWS docs
--image-id IMAGE_ID   ID of custom AWS machine image (AMI) to use
--image-version IMAGE_VERSION
                      version of EMR/Dataproc machine image to run
--instance-fleets INSTANCE_FLEETS
                      detailed JSON list of instance fleets, including EBS
                      configuration. See docs for --instance-fleets at
                      http://docs.aws.amazon.com/cli/latest/reference/emr
                      /create-cluster.html
--instance-groups INSTANCE_GROUPS
                      detailed JSON list of EMR instance configs, including
                      EBS configuration. See docs for --instance-groups at
                      http://docs.aws.amazon.com/cli/latest/reference/emr
                      /create-cluster.html
--instance-type INSTANCE_TYPE
                      Type of GCE/EC2 instance(s) to launch GCE - e.g.
                      n1-standard-1, n1-highcpu-4, n1-highmem-4 -- See
                      https://cloud.google.com/compute/docs/machine-types
                      EC2 - e.g. m1.medium, c3.xlarge, r3.xlarge -- See
                      http://aws.amazon.com/ec2/instance-types/
--label LABEL         Alternate label for the job, to help us identify it.
--master-instance-bid-price MASTER_INSTANCE_BID_PRICE
                      Bid price to specify for the master node when setting
                      it up as an EC2 spot instance (you probably only want
                      to do this for task instances).
--master-instance-type MASTER_INSTANCE_TYPE
                      Type of GCE/EC2 master instance to launch
--max-mins-idle MAX_MINS_IDLE
                      If we create a cluster, have it automatically
                      terminate itself after it's been idle this many
                      minutes
--num-core-instances NUM_CORE_INSTANCES
                      Total number of core instances to launch
--num-task-instances NUM_TASK_INSTANCES
                      Total number of task instances to launch
--owner OWNER         User who ran the job (default is the current user)
--pool-clusters       Add to an existing cluster or create a new one that
                      does not terminate when the job completes.
--no-pool-clusters    Don't run job on a pooled cluster (the default)
--pool-name POOL_NAME
                      Specify a pool name to join. Default is "default"
-q, --quiet           Don't print anything to stderr
--region REGION       GCE/AWS region to run Dataproc/EMR jobs in.
--release-label RELEASE_LABEL
                      Release Label (e.g. "emr-4.0.0"). Overrides --image-
                      version
--s3-endpoint S3_ENDPOINT
                      Force mrjob to connect to S3 on this endpoint (e.g. s3
                      -us-west-1.amazonaws.com). You usually shouldn't set
                      this; by default mrjob will choose the correct
                      endpoint for each S3 bucket based on its location.
--subnet SUBNET       ID of Amazon VPC subnet/URI of Google Compute Engine
                      subnetwork to launch cluster in.
--subnets SUBNET      Like --subnet, but with a comma-separated list, to
                      specify multiple subnets in conjunction with
                      --instance-fleets (EMR only)
--tag TAGS            Metadata tags to apply to the EMR cluster; should take
                      the form KEY=VALUE. You can use --tag multiple times
--task-instance-bid-price TASK_INSTANCE_BID_PRICE
                      Bid price to specify for task nodes when setting them
                      up as EC2 spot instances
--task-instance-type TASK_INSTANCE_TYPE
                      Type of GCE/EC2 task instance(s) to launch
-v, --verbose         print more messages to stderr
--zone ZONE           GCE zone/AWS availability zone to run Dataproc/EMR
                      jobs in.

diagnose

Print probable cause of error for a failed step.

Currently this only works on EMR.

Usage:

mrjob diagnose [opts] j-CLUSTERID

Options:

-c CONF_PATHS, --conf-path CONF_PATHS
                      Path to alternate mrjob.conf file to read from
--no-conf             Don't load mrjob.conf even if it's available
--ec2-endpoint EC2_ENDPOINT
                      Force mrjob to connect to EC2 on this endpoint (e.g.
                      ec2.us-west-1.amazonaws.com). Default is to infer this
                      from region.
--emr-endpoint EMR_ENDPOINT
                      Force mrjob to connect to EMR on this endpoint (e.g.
                      us-west-1.elasticmapreduce.amazonaws.com). Default is
                      to infer this from region.
-h, --help            show this help message and exit
-q, --quiet           Don't print anything to stderr
--region REGION       GCE/AWS region to run Dataproc/EMR jobs in.
--s3-endpoint S3_ENDPOINT
                      Force mrjob to connect to S3 on this endpoint (e.g. s3
                      -us-west-1.amazonaws.com). You usually shouldn't set
                      this; by default mrjob will choose the correct
                      endpoint for each S3 bucket based on its location.
--step-id STEP_ID     ID of a particular failed step to diagnose
-v, --verbose         print more messages to stderr

New in version 0.6.1.

report-long-jobs

Report jobs running for more than a certain number of hours (by default, 24.0). This can help catch buggy jobs and Hadoop/EMR operational issues.

Suggested usage: run this as a daily cron job with the -q option:

0 0 * * * mrjob report-long-jobs

Options:

-c CONF_PATHS, --conf-path CONF_PATHS
                      Path to alternate mrjob.conf file to read from
--no-conf             Don't load mrjob.conf even if it's available
--ec2-endpoint EC2_ENDPOINT
                      Force mrjob to connect to EC2 on this endpoint (e.g.
                      ec2.us-west-1.amazonaws.com). Default is to infer this
                      from region.
--emr-endpoint EMR_ENDPOINT
                      Force mrjob to connect to EMR on this endpoint (e.g.
                      us-west-1.elasticmapreduce.amazonaws.com). Default is
                      to infer this from region.
-x EXCLUDE, --exclude EXCLUDE
                      Exclude clusters that match the specified tags.
                      Specifed in the form TAG_KEY,TAG_VALUE.
-h, --help            show this help message and exit
--min-hours MIN_HOURS
                      Minimum number of hours a job can run before we report
                      it. Default: 24.0
-q, --quiet           Don't print anything to stderr
--region REGION       GCE/AWS region to run Dataproc/EMR jobs in.
--s3-endpoint S3_ENDPOINT
                      Force mrjob to connect to S3 on this endpoint (e.g. s3
                      -us-west-1.amazonaws.com). You usually shouldn't set
                      this; by default mrjob will choose the correct
                      endpoint for each S3 bucket based on its location.
-v, --verbose         print more messages to stderr

s3-tmpwatch

Delete all files in a given URI that are older than a specified time. The time parameter defines the threshold for removing files. If the file has not been accessed for time, the file is removed. The time argument is a number with an optional single-character suffix specifying the units: m for minutes, h for hours, d for days. If no suffix is specified, time is in hours.

Suggested usage: run this as a cron job with the -q option:

0 0 * * * mrjob s3-tmpwatch -q 30d s3://your-bucket/tmp/

Usage:

mrjob s3-tmpwatch [options] <time-untouched> <URIs>

Options:

-c CONF_PATHS, --conf-path CONF_PATHS
                      Path to alternate mrjob.conf file to read from
--no-conf             Don't load mrjob.conf even if it's available
-h, --help            show this help message and exit
-q, --quiet           Don't print anything to stderr
--region REGION       GCE/AWS region to run Dataproc/EMR jobs in.
--s3-endpoint S3_ENDPOINT
                      Force mrjob to connect to S3 on this endpoint (e.g. s3
                      -us-west-1.amazonaws.com). You usually shouldn't set
                      this; by default mrjob will choose the correct
                      endpoint for each S3 bucket based on its location.
-t, --test            Don't actually delete any files; just log that we
                      would
-v, --verbose         print more messages to stderr

spark-submit

A drop-in replacement for spark-submit that can use mrjob’s runners. For example, you can submit your spark job to EMR just by adding -r emr.

This also adds a few mrjob features that are not standard with spark-submit, such as --cmdenv, --dirs, and --setup.

New in version 0.6.7.

Changed in version 0.6.8: added local, spark runners, made spark the default (was hadoop)

Changed in version 0.7.1: --archives and --dirs are supported on all masters (except local)

Usage:

mrjob spark-submit [-r <runner>] [options] <python file | app jar>
[app arguments]

Options:

All runners:
 -r {emr,hadoop,local,spark}, --runner {emr,hadoop,local,spark}
                       Where to run the job (default: "spark")
 --class MAIN_CLASS    Your application's main class (for Java / Scala apps).
 --name NAME           The name of your application.
 --jars LIBJARS        Comma-separated list of jars to include on the
                       driverand executor classpaths.
 --packages PACKAGES   Comma-separated list of maven coordinates of jars to
                       include on the driver and executor classpaths. Will
                       search the local maven repo, then maven central and
                       any additional remote repositories given by
                       --repositories. The format for the coordinates should
                       be groupId:artifactId:version.
 --exclude-packages EXCLUDE_PACKAGES
                       Comma-separated list of groupId:artifactId, to exclude
                       while resolving the dependencies provided in
                       --packages to avoid dependency conflicts.
 --repositories REPOSITORIES
                       Comma-separated list of additional remote repositories
                       to search for the maven coordinates given with
                       --packages.
 --py-files PY_FILES   Comma-separated list of .zip, .egg, or .py files to
                       placed on the PYTHONPATH for Python apps.
 --files UPLOAD_FILES  Comma-separated list of files to be placed in the
                       working directory of each executor. Ignored on
                       local[*] master.
 --archives UPLOAD_ARCHIVES
                       Comma-separated list of archives to be extracted into
                       the working directory of each executor.
 --dirs UPLOAD_DIRS    Comma-separated list of directors to be archived and
                       then extracted into the working directory of each
                       executor.
 --cmdenv CMDENV       Arbitrary environment variable to set inside Spark, in
                       the format NAME=VALUE.
 --conf JOBCONF        Arbitrary Spark configuration property, in the format
                       PROP=VALUE.
 --setup SETUP         A command to run before each Spark executor in the
                       shell ("touch foo"). In cluster mode, runs before the
                       Spark driver as well. You may interpolate files
                       available via URL or on your local filesystem using
                       Hadoop Distributed Cache syntax (". setup.sh#"). To
                       interpolate archives (YARN only), use #/: "cd
                       foo.tar.gz#/; make.
 --properties-file PROPERTIES_FILE
                       Path to a file from which to load extra properties. If
                       not specified, this will look for conf/spark-
                       defaults.conf.
 --driver-memory DRIVER_MEMORY
                       Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
 --driver-java-options DRIVER_JAVA_OPTIONS
                       Extra Java options to pass to the driver.
 --driver-library-path DRIVER_LIBRARY_PATH
                       Extra library path entries to pass to the driver.
 --driver-class-path DRIVER_CLASS_PATH
                       Extra class path entries to pass to the driver. Note
                       that jars added with --jars are automatically included
                       in the classpath.
 --executor-memory EXECUTOR_MEMORY
                       Memory per executor (e.g. 1000M, 2G) (Default: 1G).
 --proxy-user PROXY_USER
                       User to impersonate when submitting the application.
                       This argument does not work with --principal /
                       --keytab.
 -c CONF_PATHS, --conf-path CONF_PATHS
                       Path to alternate mrjob.conf file to read from
 --no-conf             Don't load mrjob.conf even if it's available
 -q, --quiet           Don't print anything to stderr
 -v, --verbose         print more messages to stderr
 -h, --help            show this message and exit
Spark and Hadoop runners only:
--master SPARK_MASTER
 spark://host:port, mesos://host:port, yarn,k8s://https://host:port, or local. Defaults to local[*] on spark runner, yarn on hadoop runner.
--deploy-mode SPARK_DEPLOY_MODE
 Whether to launch the driver program locally (“client”) or on one of the worker machines inside the cluster (“cluster”) (Default: client).
Cluster deploy mode only:
--driver-cores DRIVER_CORES
 Number of cores used by the driver (Default: 1).
Spark standalone or Mesos with cluster deploy mode only:
--supervise If given, restarts the driver on failure.
Spark standalone and Mesos only:
--total-executor-cores TOTAL_EXECUTOR_CORES
 Total cores for all executors.
Spark standalone and YARN only:
--executor-cores EXECUTOR_CORES
 Number of cores per executor. (Default: 1 in YARN mode, or all available cores on the worker in standalone mode)
YARN-only:
--queue QUEUE_NAME
 The YARN queue to submit to (Default: “default”).
--num-executors NUM_EXECUTORS
 Number of executors to launch (Default: 2). If dynamic allocation is enabled, the initial number of executors will be at least NUM.
--principal PRINCIPAL
 Principal to be used to login to KDC, while running onsecure HDFS.
--keytab KEYTAB
 The full path to the file that contains the keytab for the principal specified above. This keytab will be copied to the node running the Application Master via the Secure Distributed Cache, for renewing the login tickets and the delegation tokens periodically.

This also supports the same runner-specific switches as MRJobs (e.g. --hadoop-bin, --region).

terminate-cluster

Terminate an existing EMR cluster.

Usage:

mrjob terminate-cluster [options] CLUSTER_ID

Terminate an existing EMR cluster.

Options:

-c CONF_PATHS, --conf-path CONF_PATHS
                      Path to alternate mrjob.conf file to read from
--no-conf             Don't load mrjob.conf even if it's available
--ec2-endpoint EC2_ENDPOINT
                      Force mrjob to connect to EC2 on this endpoint (e.g.
                      ec2.us-west-1.amazonaws.com). Default is to infer this
                      from region.
--emr-endpoint EMR_ENDPOINT
                      Force mrjob to connect to EMR on this endpoint (e.g.
                      us-west-1.elasticmapreduce.amazonaws.com). Default is
                      to infer this from region.
-h, --help            show this help message and exit
-q, --quiet           Don't print anything to stderr
--region REGION       GCE/AWS region to run Dataproc/EMR jobs in.
--s3-endpoint S3_ENDPOINT
                      Force mrjob to connect to S3 on this endpoint (e.g.
                      s3-us-west-1.amazonaws.com). You usually shouldn't set
                      this; by default mrjob will choose the correct
                      endpoint for each S3 bucket based on its location.
-t, --test            Don't actually delete any files; just log that we
                      would
-v, --verbose         print more messages to stderr

terminate-idle-clusters

Terminate idle EMR clusters that meet the criteria passed in on the command line (or, by default, clusters that have been idle for one hour).

Suggested usage: run this as a cron job with the -q option:

*/30 * * * * mrjob terminate-idle-clusters -q

Changed in version 0.6.4: Skips termination-protected idle clusters, rather than crashing. (This was also backported to mrjob v0.5.12.)

Options:

-c CONF_PATHS, --conf-path CONF_PATHS
                      Path to alternate mrjob.conf file to read from
--no-conf             Don't load mrjob.conf even if it's available
--dry-run             Don't actually kill idle jobs; just log that we would
--ec2-endpoint EC2_ENDPOINT
                      Force mrjob to connect to EC2 on this endpoint (e.g.
                      ec2.us-west-1.amazonaws.com). Default is to infer this
                      from region.
--emr-endpoint EMR_ENDPOINT
                      Force mrjob to connect to EMR on this endpoint (e.g.
                      us-west-1.elasticmapreduce.amazonaws.com). Default is
                      to infer this from region.
-h, --help            show this help message and exit
--max-mins-idle MAX_MINS_IDLE
                      Max number of minutes a cluster can go without
                      bootstrapping, running a step, or having a new step
                      created. This will fire even if there are pending
                      steps which EMR has failed to start. Make sure you set
                      this higher than the amount of time your jobs can take
                      to start instances and bootstrap.
--max-mins-locked MAX_MINS_LOCKED
                      Deprecated, does nothing
--pool-name POOL_NAME
                      Only terminate clusters in the given named pool.
--pooled-only         Only terminate pooled clusters
-q, --quiet           Don't print anything to stderr
--region REGION       GCE/AWS region to run Dataproc/EMR jobs in.
--s3-endpoint S3_ENDPOINT
                      Force mrjob to connect to S3 on this endpoint (e.g.
                      s3-us-west-1.amazonaws.com). You usually shouldn't set
                      this; by default mrjob will choose the correct
                      endpoint for each S3 bucket based on its location.
--unpooled-only       Only terminate un-pooled clusters
-v, --verbose         print more messages to stderr