The mrjob command

The mrjob command has two purposes:

  1. To provide easy access to EMR tools
  2. To eventually let you run Hadoop Streaming jobs written in languages other than Python

The mrjob command comes with Python-version-specific aliases (e.g. mrjob-3, mrjob-3.4), in case you choose to install mrjob for multiple versions of Python.

EMR tools

audit-emr-usage

Audit EMR usage over the past 2 weeks, sorted by cluster name and user.

Usage:

mrjob audit-emr-usage > report

Options:

-h, --help            show this help message and exit
-c CONF_PATHS, --conf-path=CONF_PATHS
                      Path to alternate mrjob.conf file to read from
--no-conf             Don't load mrjob.conf even if it's available
--emr-endpoint=EMR_ENDPOINT
                      Force mrjob to connect to EMR on this endpoint (e.g.
                      us-west-1.elasticmapreduce.amazonaws.com). Default is
                      to infer this from region.
--max-days-ago=MAX_DAYS_AGO
                      Max number of days ago to look at jobs. By default, we
                      go back as far as EMR supports (currently about 2
                      months)
-q, --quiet           Don't print anything to stderr
--region=REGION       GCE/AWS region to run Dataproc/EMR jobs in.
--s3-endpoint=S3_ENDPOINT
                      Force mrjob to connect to S3 on this endpoint (e.g. s3
                      -us-west-1.amazonaws.com). You usually shouldn't set
                      this; by default mrjob will choose the correct
                      endpoint for each S3 bucket based on its location.
-v, --verbose         print more messages to stderr

boss

Run a command on every node of a cluster. Store stdout and stderr for results in OUTPUT_DIR.

Usage:

mrjob boss CLUSTER_ID [options] "command string"

Options:

-h, --help            show this help message and exit
-c CONF_PATHS, --conf-path=CONF_PATHS
                      Path to alternate mrjob.conf file to read from
--no-conf             Don't load mrjob.conf even if it's available
--ec2-key-pair-file=EC2_KEY_PAIR_FILE
                      Path to file containing SSH key for EMR
--emr-endpoint=EMR_ENDPOINT
                      Force mrjob to connect to EMR on this endpoint (e.g.
                      us-west-1.elasticmapreduce.amazonaws.com). Default is
                      to infer this from region.
-o OUTPUT_DIR, --output-dir=OUTPUT_DIR
                      Specify an output directory (default: CLUSTER_ID)
-q, --quiet           Don't print anything to stderr
--region=REGION       GCE/AWS region to run Dataproc/EMR jobs in.
--s3-endpoint=S3_ENDPOINT
                      Force mrjob to connect to S3 on this endpoint (e.g. s3
                      -us-west-1.amazonaws.com). You usually shouldn't set
                      this; by default mrjob will choose the correct
                      endpoint for each S3 bucket based on its location.
--ssh-bin=SSH_BIN     Name/path of ssh binary. Arguments are allowed (e.g.
                      --ssh-bin 'ssh -v')
-v, --verbose         print more messages to stderr

create-cluster

Create a persistent EMR cluster to run clusters in, and print its ID to stdout.

Warning

Do not run this without mrjob terminate-idle-clusters in your crontab; clusters left idle can quickly become expensive!

Usage:

mrjob create-cluster

Options:

-h, --help            show this help message and exit
--additional-emr-info=ADDITIONAL_EMR_INFO
                      A JSON string for selecting additional features on EMR
--application=APPLICATIONS
                      Additional applications to run on 4.x AMIs (e.g.
                      Ganglia, Mahout, Spark)
--bootstrap=BOOTSTRAP
                      A shell command to set up libraries etc. before any
                      steps (e.g. "sudo apt-get -qy install python3"). You
                      may interpolate files available via URL or locally
                      with Hadoop Distributed Cache syntax ("sudo yum
                      install -y foo.rpm#")
--bootstrap-action=BOOTSTRAP_ACTIONS
                      Raw bootstrap action scripts to run before any of the
                      other bootstrap steps. You can use --bootstrap-action
                      more than once. Local scripts will be automatically
                      uploaded to S3. To add arguments, just use quotes:
                      "foo.sh arg1 arg2"
--bootstrap-mrjob     Automatically zip up the mrjob library and install it
                      when we run the mrjob. This is the default. Use --no-
                      bootstrap-mrjob if you've already installed mrjob on
                      your Hadoop cluster.
--no-bootstrap-mrjob  Don't automatically zip up the mrjob library and
                      install it when we run this job. Use this if you've
                      already installed mrjob on your Hadoop cluster.
--bootstrap-python    Attempt to install a compatible version of Python at
                      bootstrap time. Currently this only does anything for
                      Python 3, for which it is enabled by default.
--no-bootstrap-python
                      Don't automatically try to install a compatible
                      version of Python at bootstrap time.
--bootstrap-spark     Auto-install Spark on the cluster (even if not
                      needed).
--no-bootstrap-spark  Don't auto-install Spark on the cluster.
--cloud-fs-sync-secs=CLOUD_FS_SYNC_SECS
                      How long to wait for remote FS to reach eventual
                      consistency. This is typically less than a second but
                      the default is 5.0 to be safe.
--cloud-log-dir=CLOUD_LOG_DIR
                      URI on remote FS to write logs into
--cloud-tmp-dir=CLOUD_TMP_DIR
                      URI on remote FS to use as our temp directory.
--cloud-upload-part-size=CLOUD_UPLOAD_PART_SIZE
                      Upload files to S3 in parts no bigger than this many
                      megabytes. Default is 100 MiB. Set to 0 to disable
                      multipart uploading entirely.
-c CONF_PATHS, --conf-path=CONF_PATHS
                      Path to alternate mrjob.conf file to read from
--no-conf             Don't load mrjob.conf even if it's available
--core-instance-bid-price=CORE_INSTANCE_BID_PRICE
                      Bid price to specify for core nodes when setting them
                      up as EC2 spot instances (you probably only want to do
                      this for task instances).
--core-instance-type=CORE_INSTANCE_TYPE
                      Type of GCE/EC2 core instance(s) to launch
--ec2-key-pair=EC2_KEY_PAIR
                      Name of the SSH key pair you set up for EMR
--emr-api-param=EMR_API_PARAMS
                      Additional parameter to pass directly to the EMR API
                      when creating a cluster. Should take the form
                      KEY=VALUE. You can use --emr-api-param multiple times
--no-emr-api-param=EMR_API_PARAMS
                      Parameter to be unset when calling EMR API. You can
                      use --no-emr-api-param multiple times.
--emr-configuration=EMR_CONFIGURATIONS
                      Configuration to use on 4.x AMIs as a JSON-encoded
                      dict; see http://docs.aws.amazon.com/ElasticMapReduce/
                      latest/ReleaseGuide/emr-configure-apps.html for
                      examples
--emr-endpoint=EMR_ENDPOINT
                      Force mrjob to connect to EMR on this endpoint (e.g.
                      us-west-1.elasticmapreduce.amazonaws.com). Default is
                      to infer this from region.
--enable-emr-debugging
                      Enable storage of Hadoop logs in SimpleDB
--disable-emr-debugging
                      Disable storage of Hadoop logs in SimpleDB (the
                      default)
--iam-endpoint=IAM_ENDPOINT
                      Force mrjob to connect to IAM on this endpoint (e.g.
                      iam.us-gov.amazonaws.com)
--iam-instance-profile=IAM_INSTANCE_PROFILE
                      EC2 instance profile to use for the EMR cluster -- see
                      "Configure IAM Roles for Amazon EMR" in AWS docs
--iam-service-role=IAM_SERVICE_ROLE
                      IAM service role to use for the EMR cluster -- see
                      "Configure IAM Roles for Amazon EMR" in AWS docs
--image-version=IMAGE_VERSION
                      EMR/Dataproc machine image to launch clusters with
--instance-type=INSTANCE_TYPE
                      Type of GCE/EC2 instance(s) to launch   GCE - e.g.
                      n1-standard-1, n1-highcpu-4, n1-highmem-4 -- See
                      https://cloud.google.com/compute/docs/machine-types
                      EC2 - e.g. m1.medium, c3.xlarge, r3.xlarge  -- See
                      http://aws.amazon.com/ec2/instance-types/
--label=LABEL         Alternate label for the job, to help us identify it.
--master-instance-bid-price=MASTER_INSTANCE_BID_PRICE
                      Bid price to specify for the master node when setting
                      it up as an EC2 spot instance (you probably only want
                      to do this for task instances).
--master-instance-type=MASTER_INSTANCE_TYPE
                      Type of GCE/EC2 master instance to launch
--max-hours-idle=MAX_HOURS_IDLE
                      If we create a cluster, have it automatically
                      terminate itself after it's been idle this many hours
--mins-to-end-of-hour=MINS_TO_END_OF_HOUR
                      If --max-hours-idle is set, control how close to the
                      end of an hour the cluster can automatically terminate
                      itself (default is 5 minutes)
--num-core-instances=NUM_CORE_INSTANCES
                      Total number of core instances to launch
--num-task-instances=NUM_TASK_INSTANCES
                      Total number of task instances to launch
--owner=OWNER         User who ran the job (default is the current user)
--pool-clusters       Add to an existing cluster or create a new one that
                      does not terminate when the job completes. WARNING: do
                      not run this without --max-hours-idle or  with mrjob
                      terminate-idle-clusters in your crontab; clusters left
                      idle can quickly become expensive!
--no-pool-clusters    Don't run job on a pooled cluster (the default)
--pool-name=POOL_NAME
                      Specify a pool name to join. Default is "default"
-q, --quiet           Don't print anything to stderr
--region=REGION       GCE/AWS region to run Dataproc/EMR jobs in.
--release-label=RELEASE_LABEL
                      Release Label (e.g. "emr-4.0.0"). Overrides --image-
                      version
--s3-endpoint=S3_ENDPOINT
                      Force mrjob to connect to S3 on this endpoint (e.g. s3
                      -us-west-1.amazonaws.com). You usually shouldn't set
                      this; by default mrjob will choose the correct
                      endpoint for each S3 bucket based on its location.
--subnet=SUBNET       ID of Amazon VPC subnet to launch cluster in. If not
                      set or empty string, cluster is launched in the normal
                      AWS cloud
--tag=TAGS            Metadata tags to apply to the EMR cluster; should take
                      the form KEY=VALUE. You can use --tag multiple times
--task-instance-bid-price=TASK_INSTANCE_BID_PRICE
                      Bid price to specify for task nodes when setting them
                      up as EC2 spot instances
--task-instance-type=TASK_INSTANCE_TYPE
                      Type of GCE/EC2 task instance(s) to launch
-v, --verbose         print more messages to stderr
--visible-to-all-users
                      Make your cluster is visible to all IAM users on the
                      same AWS account (the default)
--no-visible-to-all-users
                      Hide your cluster from other IAM users on the same AWS
                      account
--zone=ZONE           GCE zone/AWS availability zone to run Dataproc/EMR
                      jobs in.

report-long-jobs

Report jobs running for more than a certain number of hours (by default, 24.0). This can help catch buggy jobs and Hadoop/EMR operational issues.

Suggested usage: run this as a daily cron job with the -q option:

0 0 * * * mrjob report-long-jobs

Options:

-h, --help            show this help message and exit
-c CONF_PATHS, --conf-path=CONF_PATHS
                      Path to alternate mrjob.conf file to read from
--no-conf             Don't load mrjob.conf even if it's available
--emr-endpoint=EMR_ENDPOINT
                      Force mrjob to connect to EMR on this endpoint (e.g.
                      us-west-1.elasticmapreduce.amazonaws.com). Default is
                      to infer this from region.
--min-hours=MIN_HOURS
                      Minimum number of hours a job can run before we report
                      it. Default: 24.0
-q, --quiet           Don't print anything to stderr
--region=REGION       GCE/AWS region to run Dataproc/EMR jobs in.
--s3-endpoint=S3_ENDPOINT
                      Force mrjob to connect to S3 on this endpoint (e.g. s3
                      -us-west-1.amazonaws.com). You usually shouldn't set
                      this; by default mrjob will choose the correct
                      endpoint for each S3 bucket based on its location.
-v, --verbose         print more messages to stderr

s3-tmpwatch

Delete all files in a given URI that are older than a specified time. The time parameter defines the threshold for removing files. If the file has not been accessed for time, the file is removed. The time argument is a number with an optional single-character suffix specifying the units: m for minutes, h for hours, d for days. If no suffix is specified, time is in hours.

Suggested usage: run this as a cron job with the -q option:

0 0 * * * mrjob s3-tmpwatch -q 30d s3://your-bucket/tmp/

Usage:

mrjob s3-tmpwatch [options] <time-untouched> <URIs>

Options:

-h, --help            show this help message and exit
-c CONF_PATHS, --conf-path=CONF_PATHS
                      Path to alternate mrjob.conf file to read from
--no-conf             Don't load mrjob.conf even if it's available
-q, --quiet           Don't print anything to stderr
--region=REGION       GCE/AWS region to run Dataproc/EMR jobs in.
--s3-endpoint=S3_ENDPOINT
                      Force mrjob to connect to S3 on this endpoint (e.g. s3
                      -us-west-1.amazonaws.com). You usually shouldn't set
                      this; by default mrjob will choose the correct
                      endpoint for each S3 bucket based on its location.
-t, --test            Don't actually delete any files; just log that we
                      would
-v, --verbose         print more messages to stderr

terminate-cluster

Terminate an existing EMR cluster.

Usage:

mrjob terminate-cluster [options] j-CLUSTERID

Terminate an existing EMR cluster.

Options:

-h, --help            show this help message and exit
-c CONF_PATHS, --conf-path=CONF_PATHS
                      Path to alternate mrjob.conf file to read from
--no-conf             Don't load mrjob.conf even if it's available
--emr-endpoint=EMR_ENDPOINT
                      Force mrjob to connect to EMR on this endpoint (e.g.
                      us-west-1.elasticmapreduce.amazonaws.com). Default is
                      to infer this from region.
-q, --quiet           Don't print anything to stderr
--region=REGION       GCE/AWS region to run Dataproc/EMR jobs in.
--s3-endpoint=S3_ENDPOINT
                      Force mrjob to connect to S3 on this endpoint (e.g. s3
                      -us-west-1.amazonaws.com). You usually shouldn't set
                      this; by default mrjob will choose the correct
                      endpoint for each S3 bucket based on its location.
-t, --test            Don't actually delete any files; just log that we
                      would
-v, --verbose         print more messages to stderr

terminate-idle-clusters

Terminate idle EMR clusters that meet the criteria passed in on the command line (or, by default, clusters that have been idle for one hour).

Suggested usage: run this as a cron job with the -q option:

*/30 * * * * mrjob terminate-idle-clusters -q

Options:

-h, --help            show this help message and exit
-c CONF_PATHS, --conf-path=CONF_PATHS
                      Path to alternate mrjob.conf file to read from
--no-conf             Don't load mrjob.conf even if it's available
--dry-run             Don't actually kill idle jobs; just log that we would
--emr-endpoint=EMR_ENDPOINT
                      Force mrjob to connect to EMR on this endpoint (e.g.
                      us-west-1.elasticmapreduce.amazonaws.com). Default is
                      to infer this from region.
--max-hours-idle=MAX_HOURS_IDLE
                      Max number of hours a cluster can go without
                      bootstrapping, running a step, or having a new step
                      created. This will fire even if there are pending
                      steps which EMR has failed to start. Make sure you set
                      this higher than the amount of time your jobs can take
                      to start instances and bootstrap.
--max-mins-locked=MAX_MINS_LOCKED
                      Max number of minutes a cluster can be locked while
                      idle.
--mins-to-end-of-hour=MINS_TO_END_OF_HOUR
                      Terminate clusters that are within this many minutes
                      of the end of a full hour since the job started
                      running AND have no pending steps.
--pool-name=POOL_NAME
                      Only terminate clusters in the given named pool.
--pooled-only         Only terminate pooled clusters
-q, --quiet           Don't print anything to stderr
--region=REGION       GCE/AWS region to run Dataproc/EMR jobs in.
--s3-endpoint=S3_ENDPOINT
                      Force mrjob to connect to S3 on this endpoint (e.g. s3
                      -us-west-1.amazonaws.com). You usually shouldn't set
                      this; by default mrjob will choose the correct
                      endpoint for each S3 bucket based on its location.
--unpooled-only       Only terminate un-pooled clusters
-v, --verbose         print more messages to stderr

Running jobs

mrjob run (path to script or executable) [options]

Run a job. Takes same options as invoking a Python job. See Options available to all runners, Hadoop-related options, Dataproc runner options, and EMR runner options. While you can use this command to invoke your jobs, you can just as easily call python my_job.py [options].