The mrjob command

The mrjob command has two purposes:

  1. To provide easy access to EMR tools
  2. To eventually let you run Hadoop Streaming jobs written in languages other than Python

The mrjob command comes with Python-version-specific aliases (e.g. mrjob-3, mrjob-3.4), in case you choose to install mrjob for multiple versions of Python.

EMR tools

audit-emr-usage

Audit EMR usage over the past 2 weeks, sorted by cluster name and user.

Usage:

mrjob audit-emr-usage > report

Options:

-c CONF_PATHS, --conf-path CONF_PATHS
                      Path to alternate mrjob.conf file to read from
--no-conf             Don't load mrjob.conf even if it's available
--emr-endpoint EMR_ENDPOINT
                      Force mrjob to connect to EMR on this endpoint (e.g.
                      us-west-1.elasticmapreduce.amazonaws.com). Default is
                      to infer this from region.
-h, --help            show this help message and exit
--max-days-ago MAX_DAYS_AGO
                      Max number of days ago to look at jobs. By default, we
                      go back as far as EMR supports (currently about 2
                      months)
-q, --quiet           Don't print anything to stderr
--region REGION       GCE/AWS region to run Dataproc/EMR jobs in.
--s3-endpoint S3_ENDPOINT
                      Force mrjob to connect to S3 on this endpoint (e.g. s3
                      -us-west-1.amazonaws.com). You usually shouldn't set
                      this; by default mrjob will choose the correct
                      endpoint for each S3 bucket based on its location.
-v, --verbose         print more messages to stderr

boss

Run a command on every node of a cluster. Store stdout and stderr for results in OUTPUT_DIR.

Usage:

mrjob boss CLUSTER_ID [options] "command string"

Options:

-c CONF_PATHS, --conf-path CONF_PATHS
                      Path to alternate mrjob.conf file to read from
--no-conf             Don't load mrjob.conf even if it's available
--ec2-key-pair-file EC2_KEY_PAIR_FILE
                      Path to file containing SSH key for EMR
--emr-endpoint EMR_ENDPOINT
                      Force mrjob to connect to EMR on this endpoint (e.g.
                      us-west-1.elasticmapreduce.amazonaws.com). Default is
                      to infer this from region.
-h, --help            show this help message and exit
-o OUTPUT_DIR, --output-dir OUTPUT_DIR
                      Specify an output directory (default: CLUSTER_ID)
-q, --quiet           Don't print anything to stderr
--region REGION       GCE/AWS region to run Dataproc/EMR jobs in.
--s3-endpoint S3_ENDPOINT
                      Force mrjob to connect to S3 on this endpoint (e.g. s3
                      -us-west-1.amazonaws.com). You usually shouldn't set
                      this; by default mrjob will choose the correct
                      endpoint for each S3 bucket based on its location.
--ssh-bin SSH_BIN     Name/path of ssh binary. Arguments are allowed (e.g.
                      --ssh-bin 'ssh -v')
-v, --verbose         print more messages to stderr

create-cluster

Create a persistent EMR cluster to run clusters in, and print its ID to stdout.

Usage:

mrjob create-cluster

Options:

--additional-emr-info ADDITIONAL_EMR_INFO
                      A JSON string for selecting additional features on EMR
--application APPLICATIONS
                      Additional applications to run on 4.x AMIs (e.g.
                      Ganglia, Mahout, Spark)
--bootstrap BOOTSTRAP
                      A shell command to set up libraries etc. before any
                      steps (e.g. "sudo apt-get -qy install python3"). You
                      may interpolate files available via URL or locally
                      with Hadoop Distributed Cache syntax ("sudo yum
                      install -y foo.rpm#")
--bootstrap-action BOOTSTRAP_ACTIONS
                      Raw bootstrap action scripts to run before any of the
                      other bootstrap steps. You can use --bootstrap-action
                      more than once. Local scripts will be automatically
                      uploaded to S3. To add arguments, just use quotes:
                      "foo.sh arg1 arg2"
--bootstrap-mrjob     Automatically zip up the mrjob library and install it
                      when we run the mrjob. This is the default. Use --no-
                      bootstrap-mrjob if you've already installed mrjob on
                      your Hadoop cluster.
--no-bootstrap-mrjob  Don't automatically zip up the mrjob library and
                      install it when we run this job. Use this if you've
                      already installed mrjob on your Hadoop cluster.
--bootstrap-python    Attempt to install a compatible version of Python at
                      bootstrap time. Currently this only does anything for
                      Python 3, for which it is enabled by default.
--no-bootstrap-python
                      Don't automatically try to install a compatible
                      version of Python at bootstrap time.
--bootstrap-spark     Auto-install Spark on the cluster (even if not
                      needed).
--no-bootstrap-spark  Don't auto-install Spark on the cluster.
--cloud-fs-sync-secs CLOUD_FS_SYNC_SECS
                      How long to wait for remote FS to reach eventual
                      consistency. This is typically less than a second but
                      the default is 5.0 to be safe.
--cloud-log-dir CLOUD_LOG_DIR
                      URI on remote FS to write logs into
--cloud-part-size-mb CLOUD_PART_SIZE_MB
                      Upload files to cloud FS in parts no bigger than this
                      many megabytes. Default is 100 MiB. Set to 0 to
                      disable multipart uploading entirely.
--cloud-upload-part-size CLOUD_PART_SIZE_MB
                      Deprecated alias for --cloud-part-size-mb
--cloud-tmp-dir CLOUD_TMP_DIR
                      URI on remote FS to use as our temp directory.
-c CONF_PATHS, --conf-path CONF_PATHS
                      Path to alternate mrjob.conf file to read from
--no-conf             Don't load mrjob.conf even if it's available
--core-instance-bid-price CORE_INSTANCE_BID_PRICE
                      Bid price to specify for core nodes when setting them
                      up as EC2 spot instances (you probably only want to do
                      this for task instances).
--core-instance-type CORE_INSTANCE_TYPE
                      Type of GCE/EC2 core instance(s) to launch
--ec2-key-pair EC2_KEY_PAIR
                      Name of the SSH key pair you set up for EMR
--emr-api-param EMR_API_PARAMS
                      deprecated. Use --extra-cluster-param instead
--no-emr-api-param EMR_API_PARAMS
                      deprecated. Use --extra-cluster-param instead
--emr-configuration EMR_CONFIGURATIONS
                      Configuration to use on 4.x AMIs as a JSON-encoded
                      dict; see http://docs.aws.amazon.com/ElasticMapReduce/
                      latest/ReleaseGuide/emr-configure-apps.html for
                      examples
--emr-endpoint EMR_ENDPOINT
                      Force mrjob to connect to EMR on this endpoint (e.g.
                      us-west-1.elasticmapreduce.amazonaws.com). Default is
                      to infer this from region.
--enable-emr-debugging
                      Enable storage of Hadoop logs in SimpleDB
--disable-emr-debugging
                      Disable storage of Hadoop logs in SimpleDB (the
                      default)
--extra-cluster-param EXTRA_CLUSTER_PARAMS
                      extra parameter to pass to cloud API when creating a
                      cluster, to access features not currently supported by
                      mrjob. Takes the form <param>=<value>, where value is
                      JSON or a string. Use <param>=null to unset a
                      parameter
-h, --help            show this help message and exit
--iam-endpoint IAM_ENDPOINT
                      Force mrjob to connect to IAM on this endpoint (e.g.
                      iam.us-gov.amazonaws.com)
--iam-instance-profile IAM_INSTANCE_PROFILE
                      EC2 instance profile to use for the EMR cluster -- see
                      "Configure IAM Roles for Amazon EMR" in AWS docs
--iam-service-role IAM_SERVICE_ROLE
                      IAM service role to use for the EMR cluster -- see
                      "Configure IAM Roles for Amazon EMR" in AWS docs
--image-version IMAGE_VERSION
                      EMR/Dataproc machine image to launch clusters with
--instance-fleets INSTANCE_FLEETS
                      detailed JSON list of instance fleets, including EBS
                      configuration. See docs for --instance-fleets at
                      http://docs.aws.amazon.com/cli/latest/reference/emr
                      /create-cluster.html
--instance-groups INSTANCE_GROUPS
                      detailed JSON list of EMR instance configs, including
                      EBS configuration. See docs for --instance-groups at
                      http://docs.aws.amazon.com/cli/latest/reference/emr
                      /create-cluster.html
--instance-type INSTANCE_TYPE
                      Type of GCE/EC2 instance(s) to launch GCE - e.g.
                      n1-standard-1, n1-highcpu-4, n1-highmem-4 -- See
                      https://cloud.google.com/compute/docs/machine-types
                      EC2 - e.g. m1.medium, c3.xlarge, r3.xlarge -- See
                      http://aws.amazon.com/ec2/instance-types/
--label LABEL         Alternate label for the job, to help us identify it.
--master-instance-bid-price MASTER_INSTANCE_BID_PRICE
                      Bid price to specify for the master node when setting
                      it up as an EC2 spot instance (you probably only want
                      to do this for task instances).
--master-instance-type MASTER_INSTANCE_TYPE
                      Type of GCE/EC2 master instance to launch
--max-hours-idle MAX_HOURS_IDLE
                      Please use --max-mins-idle instead
--max-mins-idle MAX_MINS_IDLE
                      If we create a cluster, have it automatically
                      terminate itself after it's been idle this many
                      minutes
--mins-to-end-of-hour MINS_TO_END_OF_HOUR
                      If --max-mins-idle is set, control how close to the
                      end of an hour the cluster can automatically terminate
                      itself (default is 5 minutes)
--num-core-instances NUM_CORE_INSTANCES
                      Total number of core instances to launch
--num-task-instances NUM_TASK_INSTANCES
                      Total number of task instances to launch
--owner OWNER         User who ran the job (default is the current user)
--pool-clusters       Add to an existing cluster or create a new one that
                      does not terminate when the job completes.
--no-pool-clusters    Don't run job on a pooled cluster (the default)
--pool-name POOL_NAME
                      Specify a pool name to join. Default is "default"
-q, --quiet           Don't print anything to stderr
--region REGION       GCE/AWS region to run Dataproc/EMR jobs in.
--release-label RELEASE_LABEL
                      Release Label (e.g. "emr-4.0.0"). Overrides --image-
                      version
--s3-endpoint S3_ENDPOINT
                      Force mrjob to connect to S3 on this endpoint (e.g. s3
                      -us-west-1.amazonaws.com). You usually shouldn't set
                      this; by default mrjob will choose the correct
                      endpoint for each S3 bucket based on its location.
--subnet SUBNET       ID of Amazon VPC subnet/URI of Google Compute Engine
                      subnetwork to launch cluster in.
--subnets SUBNET      Like --subnet, but with a comma-separated list, to
                      specify multiple subnets in conjunction with
                      --instance-fleets (EMR only)
--tag TAGS            Metadata tags to apply to the EMR cluster; should take
                      the form KEY=VALUE. You can use --tag multiple times
--task-instance-bid-price TASK_INSTANCE_BID_PRICE
                      Bid price to specify for task nodes when setting them
                      up as EC2 spot instances
--task-instance-type TASK_INSTANCE_TYPE
                      Type of GCE/EC2 task instance(s) to launch
-v, --verbose         print more messages to stderr
--visible-to-all-users
                      Make your cluster is visible to all IAM users on the
                      same AWS account (the default)
--no-visible-to-all-users
                      Hide your cluster from other IAM users on the same AWS
                      account
--zone ZONE           GCE zone/AWS availability zone to run Dataproc/EMR
                      jobs in.

diagnose

Print probable cause of error for a failed step.

Currently this only works on EMR.

Usage:

mrjob diagnose [opts] j-CLUSTERID

Options:

-c CONF_PATHS, --conf-path CONF_PATHS
                      Path to alternate mrjob.conf file to read from
--no-conf             Don't load mrjob.conf even if it's available
--emr-endpoint EMR_ENDPOINT
                      Force mrjob to connect to EMR on this endpoint (e.g.
                      us-west-1.elasticmapreduce.amazonaws.com). Default is
                      to infer this from region.
-h, --help            show this help message and exit
-q, --quiet           Don't print anything to stderr
--region REGION       GCE/AWS region to run Dataproc/EMR jobs in.
--s3-endpoint S3_ENDPOINT
                      Force mrjob to connect to S3 on this endpoint (e.g. s3
                      -us-west-1.amazonaws.com). You usually shouldn't set
                      this; by default mrjob will choose the correct
                      endpoint for each S3 bucket based on its location.
--step-id STEP_ID     ID of a particular failed step to diagnose
-v, --verbose         print more messages to stderr

New in version 0.6.1.

report-long-jobs

Report jobs running for more than a certain number of hours (by default, 24.0). This can help catch buggy jobs and Hadoop/EMR operational issues.

Suggested usage: run this as a daily cron job with the -q option:

0 0 * * * mrjob report-long-jobs

Options:

-c CONF_PATHS, --conf-path CONF_PATHS
                      Path to alternate mrjob.conf file to read from
--no-conf             Don't load mrjob.conf even if it's available
--emr-endpoint EMR_ENDPOINT
                      Force mrjob to connect to EMR on this endpoint (e.g.
                      us-west-1.elasticmapreduce.amazonaws.com). Default is
                      to infer this from region.
-x EXCLUDE, --exclude EXCLUDE
                      Exclude clusters that match the specified tags.
                      Specifed in the form TAG_KEY,TAG_VALUE.
-h, --help            show this help message and exit
--min-hours MIN_HOURS
                      Minimum number of hours a job can run before we report
                      it. Default: 24.0
-q, --quiet           Don't print anything to stderr
--region REGION       GCE/AWS region to run Dataproc/EMR jobs in.
--s3-endpoint S3_ENDPOINT
                      Force mrjob to connect to S3 on this endpoint (e.g. s3
                      -us-west-1.amazonaws.com). You usually shouldn't set
                      this; by default mrjob will choose the correct
                      endpoint for each S3 bucket based on its location.
-v, --verbose         print more messages to stderr

s3-tmpwatch

Delete all files in a given URI that are older than a specified time. The time parameter defines the threshold for removing files. If the file has not been accessed for time, the file is removed. The time argument is a number with an optional single-character suffix specifying the units: m for minutes, h for hours, d for days. If no suffix is specified, time is in hours.

Suggested usage: run this as a cron job with the -q option:

0 0 * * * mrjob s3-tmpwatch -q 30d s3://your-bucket/tmp/

Usage:

mrjob s3-tmpwatch [options] <time-untouched> <URIs>

Options:

-c CONF_PATHS, --conf-path CONF_PATHS
                      Path to alternate mrjob.conf file to read from
--no-conf             Don't load mrjob.conf even if it's available
-h, --help            show this help message and exit
-q, --quiet           Don't print anything to stderr
--region REGION       GCE/AWS region to run Dataproc/EMR jobs in.
--s3-endpoint S3_ENDPOINT
                      Force mrjob to connect to S3 on this endpoint (e.g. s3
                      -us-west-1.amazonaws.com). You usually shouldn't set
                      this; by default mrjob will choose the correct
                      endpoint for each S3 bucket based on its location.
-t, --test            Don't actually delete any files; just log that we
                      would
-v, --verbose         print more messages to stderr

terminate-cluster

Terminate an existing EMR cluster.

Usage:

mrjob terminate-cluster [options] CLUSTER_ID

Terminate an existing EMR cluster.

Options:

-c CONF_PATHS, --conf-path CONF_PATHS
                      Path to alternate mrjob.conf file to read from
--no-conf             Don't load mrjob.conf even if it's available
--emr-endpoint EMR_ENDPOINT
                      Force mrjob to connect to EMR on this endpoint (e.g.
                      us-west-1.elasticmapreduce.amazonaws.com). Default is
                      to infer this from region.
-h, --help            show this help message and exit
-q, --quiet           Don't print anything to stderr
--region REGION       GCE/AWS region to run Dataproc/EMR jobs in.
--s3-endpoint S3_ENDPOINT
                      Force mrjob to connect to S3 on this endpoint (e.g. s3
                      -us-west-1.amazonaws.com). You usually shouldn't set
                      this; by default mrjob will choose the correct
                      endpoint for each S3 bucket based on its location.
-t, --test            Don't actually delete any files; just log that we
                      would
-v, --verbose         print more messages to stderr

terminate-idle-clusters

Terminate idle EMR clusters that meet the criteria passed in on the command line (or, by default, clusters that have been idle for one hour).

Suggested usage: run this as a cron job with the -q option:

*/30 * * * * mrjob terminate-idle-clusters -q

Changed in version 0.6.4: Skips termination-protected idle clusters, rather than crashing. (This was also backported to mrjob v0.5.12.)

Options:

-c CONF_PATHS, --conf-path CONF_PATHS
                      Path to alternate mrjob.conf file to read from
--no-conf             Don't load mrjob.conf even if it's available
--dry-run             Don't actually kill idle jobs; just log that we would
--emr-endpoint EMR_ENDPOINT
                      Force mrjob to connect to EMR on this endpoint (e.g.
                      us-west-1.elasticmapreduce.amazonaws.com). Default is
                      to infer this from region.
-h, --help            show this help message and exit
--max-hours-idle MAX_HOURS_IDLE
                      Please use --max-mins-idle instead.
--max-mins-idle MAX_MINS_IDLE
                      Max number of minutes a cluster can go without
                      bootstrapping, running a step, or having a new step
                      created. This will fire even if there are pending
                      steps which EMR has failed to start. Make sure you set
                      this higher than the amount of time your jobs can take
                      to start instances and bootstrap.
--max-mins-locked MAX_MINS_LOCKED
                      Max number of minutes a cluster can be locked while
                      idle.
--mins-to-end-of-hour MINS_TO_END_OF_HOUR
                      Deprecated, does nothing.
--pool-name POOL_NAME
                      Only terminate clusters in the given named pool.
--pooled-only         Only terminate pooled clusters
-q, --quiet           Don't print anything to stderr
--region REGION       GCE/AWS region to run Dataproc/EMR jobs in.
--s3-endpoint S3_ENDPOINT
                      Force mrjob to connect to S3 on this endpoint (e.g. s3
                      -us-west-1.amazonaws.com). You usually shouldn't set
                      this; by default mrjob will choose the correct
                      endpoint for each S3 bucket based on its location.
--unpooled-only       Only terminate un-pooled clusters
-v, --verbose         print more messages to stderr

Running jobs

mrjob run (path to script or executable) [options]

Run a job. Takes same options as invoking a Python job. See Options available to all runners, Hadoop-related options, Dataproc runner options, and EMR runner options. While you can use this command to invoke your jobs, you can just as easily call python my_job.py [options].