mrjob.cmd: The mrjob
command-line utility¶
The mrjob command provides a number of sub-commands that help you run and monitor jobs.
The mrjob command comes with Python-version-specific aliases (e.g.
mrjob-3, mrjob-3.4), in case you choose to install
mrjob
for multiple versions of Python. You may also run it as
python -m mrjob.cmd <subcommand>.
audit-emr-usage¶
Audit EMR usage over the past 2 weeks, sorted by cluster name and user.
Usage:
mrjob audit-emr-usage > reportOptions:
-c CONF_PATHS, --conf-path CONF_PATHS Path to alternate mrjob.conf file to read from --no-conf Don't load mrjob.conf even if it's available --ec2-endpoint EC2_ENDPOINT Force mrjob to connect to EC2 on this endpoint (e.g. ec2.us-west-1.amazonaws.com). Default is to infer this from region. --emr-endpoint EMR_ENDPOINT Force mrjob to connect to EMR on this endpoint (e.g. us-west-1.elasticmapreduce.amazonaws.com). Default is to infer this from region. -h, --help show this help message and exit --max-days-ago MAX_DAYS_AGO Max number of days ago to look at jobs. By default, we go back as far as EMR supports (currently about 2 months) -q, --quiet Don't print anything to stderr --region REGION GCE/AWS region to run Dataproc/EMR jobs in. --s3-endpoint S3_ENDPOINT Force mrjob to connect to S3 on this endpoint (e.g. s3 -us-west-1.amazonaws.com). You usually shouldn't set this; by default mrjob will choose the correct endpoint for each S3 bucket based on its location. -v, --verbose print more messages to stderr
boss¶
Run a command on every node of a cluster. Store stdout and stderr for results in OUTPUT_DIR.
Usage:
mrjob boss CLUSTER_ID [options] "command string"Options:
-c CONF_PATHS, --conf-path CONF_PATHS Path to alternate mrjob.conf file to read from --no-conf Don't load mrjob.conf even if it's available --ec2-endpoint EC2_ENDPOINT Force mrjob to connect to EC2 on this endpoint (e.g. ec2.us-west-1.amazonaws.com). Default is to infer this from region. --ec2-key-pair-file EC2_KEY_PAIR_FILE Path to file containing SSH key for EMR --emr-endpoint EMR_ENDPOINT Force mrjob to connect to EMR on this endpoint (e.g. us-west-1.elasticmapreduce.amazonaws.com). Default is to infer this from region. -h, --help show this help message and exit -o OUTPUT_DIR, --output-dir OUTPUT_DIR Specify an output directory (default: CLUSTER_ID) -q, --quiet Don't print anything to stderr --region REGION GCE/AWS region to run Dataproc/EMR jobs in. --s3-endpoint S3_ENDPOINT Force mrjob to connect to S3 on this endpoint (e.g. s3 -us-west-1.amazonaws.com). You usually shouldn't set this; by default mrjob will choose the correct endpoint for each S3 bucket based on its location. --ssh-bin SSH_BIN Name/path of ssh binary. Arguments are allowed (e.g. --ssh-bin 'ssh -v') -v, --verbose print more messages to stderr
create-cluster¶
Create a persistent EMR cluster to run clusters in, and print its ID to stdout.
Usage:
mrjob create-clusterOptions:
--additional-emr-info ADDITIONAL_EMR_INFO A JSON string for selecting additional features on EMR --applications APPLICATIONS, --application APPLICATIONS Additional applications to run on 4.x and 5.x AMIs, separated by commas (e.g. "Ganglia,Spark") --bootstrap BOOTSTRAP A shell command to set up libraries etc. before any steps (e.g. "sudo apt-get -qy install python3"). You may interpolate files available via URL or locally with Hadoop Distributed Cache syntax ("sudo yum install -y foo.rpm#") --bootstrap-action BOOTSTRAP_ACTIONS Raw bootstrap action scripts to run before any of the other bootstrap steps. You can use --bootstrap-action more than once. Local scripts will be automatically uploaded to S3. To add arguments, just use quotes: "foo.sh arg1 arg2" --bootstrap-mrjob Automatically zip up the mrjob library and install it when we run the mrjob. This is the default. Use --no- bootstrap-mrjob if you've already installed mrjob on your Hadoop cluster. --no-bootstrap-mrjob Don't automatically zip up the mrjob library and install it when we run this job. Use this if you've already installed mrjob on your Hadoop cluster. --bootstrap-python Attempt to install a compatible version of Python at bootstrap time. Currently this only does anything for Python 3, for which it is enabled by default. --no-bootstrap-python Don't automatically try to install a compatible version of Python at bootstrap time. --bootstrap-spark Auto-install Spark on the cluster (even if not needed). --no-bootstrap-spark Don't auto-install Spark on the cluster. --cloud-fs-sync-secs CLOUD_FS_SYNC_SECS How long to wait for remote FS to reach eventual consistency. This is typically less than a second but the default is 5.0 to be safe. --cloud-log-dir CLOUD_LOG_DIR URI on remote FS to write logs into --cloud-part-size-mb CLOUD_PART_SIZE_MB Upload files to cloud FS in parts no bigger than this many megabytes. Default is 100 MiB. Set to 0 to disable multipart uploading entirely. --cloud-upload-part-size CLOUD_PART_SIZE_MB Deprecated alias for --cloud-part-size-mb --cloud-tmp-dir CLOUD_TMP_DIR URI on remote FS to use as our temp directory. -c CONF_PATHS, --conf-path CONF_PATHS Path to alternate mrjob.conf file to read from --no-conf Don't load mrjob.conf even if it's available --core-instance-bid-price CORE_INSTANCE_BID_PRICE Bid price to specify for core nodes when setting them up as EC2 spot instances (you probably only want to do this for task instances). --core-instance-type CORE_INSTANCE_TYPE Type of GCE/EC2 core instance(s) to launch --ebs-root-volume-gb EBS_ROOT_VOLUME_GB Size of root EBS volume, in GiB. Must be an integer.Set to 0 to use the default --ec2-endpoint EC2_ENDPOINT Force mrjob to connect to EC2 on this endpoint (e.g. ec2.us-west-1.amazonaws.com). Default is to infer this from region. --ec2-key-pair EC2_KEY_PAIR Name of the SSH key pair you set up for EMR --emr-action-on-failure EMR_ACTION_ON_FAILURE Action to take when a step fails (e.g. TERMINATE_CLUSTER, CANCEL_AND_WAIT, CONTINUE) --emr-configuration EMR_CONFIGURATIONS Configuration to use on 4.x AMIs as a JSON-encoded dict; see http://docs.aws.amazon.com/ElasticMapReduce/ latest/ReleaseGuide/emr-configure-apps.html for examples --emr-endpoint EMR_ENDPOINT Force mrjob to connect to EMR on this endpoint (e.g. us-west-1.elasticmapreduce.amazonaws.com). Default is to infer this from region. --enable-emr-debugging Enable storage of Hadoop logs in SimpleDB --disable-emr-debugging Disable storage of Hadoop logs in SimpleDB (the default) --extra-cluster-param EXTRA_CLUSTER_PARAMS extra parameter to pass to cloud API when creating a cluster, to access features not currently supported by mrjob. Takes the form <param>=<value>, where value is JSON or a string. Use <param>=null to unset a parameter -h, --help show this help message and exit --iam-endpoint IAM_ENDPOINT Force mrjob to connect to IAM on this endpoint (e.g. iam.us-gov.amazonaws.com) --iam-instance-profile IAM_INSTANCE_PROFILE EC2 instance profile to use for the EMR cluster -- see "Configure IAM Roles for Amazon EMR" in AWS docs --iam-service-role IAM_SERVICE_ROLE IAM service role to use for the EMR cluster -- see "Configure IAM Roles for Amazon EMR" in AWS docs --image-id IMAGE_ID ID of custom AWS machine image (AMI) to use --image-version IMAGE_VERSION version of EMR/Dataproc machine image to run --instance-fleets INSTANCE_FLEETS detailed JSON list of instance fleets, including EBS configuration. See docs for --instance-fleets at http://docs.aws.amazon.com/cli/latest/reference/emr /create-cluster.html --instance-groups INSTANCE_GROUPS detailed JSON list of EMR instance configs, including EBS configuration. See docs for --instance-groups at http://docs.aws.amazon.com/cli/latest/reference/emr /create-cluster.html --instance-type INSTANCE_TYPE Type of GCE/EC2 instance(s) to launch GCE - e.g. n1-standard-1, n1-highcpu-4, n1-highmem-4 -- See https://cloud.google.com/compute/docs/machine-types EC2 - e.g. m1.medium, c3.xlarge, r3.xlarge -- See http://aws.amazon.com/ec2/instance-types/ --label LABEL Alternate label for the job, to help us identify it. --master-instance-bid-price MASTER_INSTANCE_BID_PRICE Bid price to specify for the master node when setting it up as an EC2 spot instance (you probably only want to do this for task instances). --master-instance-type MASTER_INSTANCE_TYPE Type of GCE/EC2 master instance to launch --max-mins-idle MAX_MINS_IDLE If we create a cluster, have it automatically terminate itself after it's been idle this many minutes --num-core-instances NUM_CORE_INSTANCES Total number of core instances to launch --num-task-instances NUM_TASK_INSTANCES Total number of task instances to launch --owner OWNER User who ran the job (default is the current user) --pool-clusters Add to an existing cluster or create a new one that does not terminate when the job completes. --no-pool-clusters Don't run job on a pooled cluster (the default) --pool-name POOL_NAME Specify a pool name to join. Default is "default" -q, --quiet Don't print anything to stderr --region REGION GCE/AWS region to run Dataproc/EMR jobs in. --release-label RELEASE_LABEL Release Label (e.g. "emr-4.0.0"). Overrides --image- version --s3-endpoint S3_ENDPOINT Force mrjob to connect to S3 on this endpoint (e.g. s3 -us-west-1.amazonaws.com). You usually shouldn't set this; by default mrjob will choose the correct endpoint for each S3 bucket based on its location. --subnet SUBNET ID of Amazon VPC subnet/URI of Google Compute Engine subnetwork to launch cluster in. --subnets SUBNET Like --subnet, but with a comma-separated list, to specify multiple subnets in conjunction with --instance-fleets (EMR only) --tag TAGS Metadata tags to apply to the EMR cluster; should take the form KEY=VALUE. You can use --tag multiple times --task-instance-bid-price TASK_INSTANCE_BID_PRICE Bid price to specify for task nodes when setting them up as EC2 spot instances --task-instance-type TASK_INSTANCE_TYPE Type of GCE/EC2 task instance(s) to launch -v, --verbose print more messages to stderr --zone ZONE GCE zone/AWS availability zone to run Dataproc/EMR jobs in.
diagnose¶
Print probable cause of error for a failed step.
Currently this only works on EMR.
Usage:
mrjob diagnose [opts] j-CLUSTERIDOptions:
-c CONF_PATHS, --conf-path CONF_PATHS Path to alternate mrjob.conf file to read from --no-conf Don't load mrjob.conf even if it's available --ec2-endpoint EC2_ENDPOINT Force mrjob to connect to EC2 on this endpoint (e.g. ec2.us-west-1.amazonaws.com). Default is to infer this from region. --emr-endpoint EMR_ENDPOINT Force mrjob to connect to EMR on this endpoint (e.g. us-west-1.elasticmapreduce.amazonaws.com). Default is to infer this from region. -h, --help show this help message and exit -q, --quiet Don't print anything to stderr --region REGION GCE/AWS region to run Dataproc/EMR jobs in. --s3-endpoint S3_ENDPOINT Force mrjob to connect to S3 on this endpoint (e.g. s3 -us-west-1.amazonaws.com). You usually shouldn't set this; by default mrjob will choose the correct endpoint for each S3 bucket based on its location. --step-id STEP_ID ID of a particular failed step to diagnose -v, --verbose print more messages to stderrNew in version 0.6.1.
report-long-jobs¶
Report jobs running for more than a certain number of hours (by default, 24.0). This can help catch buggy jobs and Hadoop/EMR operational issues.
Suggested usage: run this as a daily cron job with the
-q
option:0 0 * * * mrjob report-long-jobsOptions:
-c CONF_PATHS, --conf-path CONF_PATHS Path to alternate mrjob.conf file to read from --no-conf Don't load mrjob.conf even if it's available --ec2-endpoint EC2_ENDPOINT Force mrjob to connect to EC2 on this endpoint (e.g. ec2.us-west-1.amazonaws.com). Default is to infer this from region. --emr-endpoint EMR_ENDPOINT Force mrjob to connect to EMR on this endpoint (e.g. us-west-1.elasticmapreduce.amazonaws.com). Default is to infer this from region. -x EXCLUDE, --exclude EXCLUDE Exclude clusters that match the specified tags. Specifed in the form TAG_KEY,TAG_VALUE. -h, --help show this help message and exit --min-hours MIN_HOURS Minimum number of hours a job can run before we report it. Default: 24.0 -q, --quiet Don't print anything to stderr --region REGION GCE/AWS region to run Dataproc/EMR jobs in. --s3-endpoint S3_ENDPOINT Force mrjob to connect to S3 on this endpoint (e.g. s3 -us-west-1.amazonaws.com). You usually shouldn't set this; by default mrjob will choose the correct endpoint for each S3 bucket based on its location. -v, --verbose print more messages to stderr
s3-tmpwatch¶
Delete all files in a given URI that are older than a specified time. The time parameter defines the threshold for removing files. If the file has not been accessed for time, the file is removed. The time argument is a number with an optional single-character suffix specifying the units: m for minutes, h for hours, d for days. If no suffix is specified, time is in hours.
Suggested usage: run this as a cron job with the -q option:
0 0 * * * mrjob s3-tmpwatch -q 30d s3://your-bucket/tmp/Usage:
mrjob s3-tmpwatch [options] <time-untouched> <URIs>Options:
-c CONF_PATHS, --conf-path CONF_PATHS Path to alternate mrjob.conf file to read from --no-conf Don't load mrjob.conf even if it's available -h, --help show this help message and exit -q, --quiet Don't print anything to stderr --region REGION GCE/AWS region to run Dataproc/EMR jobs in. --s3-endpoint S3_ENDPOINT Force mrjob to connect to S3 on this endpoint (e.g. s3 -us-west-1.amazonaws.com). You usually shouldn't set this; by default mrjob will choose the correct endpoint for each S3 bucket based on its location. -t, --test Don't actually delete any files; just log that we would -v, --verbose print more messages to stderr
spark-submit¶
A drop-in replacement for spark-submit that can use mrjob’s runners. For example, you can submit your spark job to EMR just by adding
-r emr
.This also adds a few mrjob features that are not standard with spark-submit, such as
--cmdenv
,--dirs
, and--setup
.New in version 0.6.7.
Changed in version 0.6.8: added
local
,spark
runners, madespark
the default (washadoop
)Changed in version 0.7.1:
--archives
and--dirs
are supported on all masters (except local)Usage:
mrjob spark-submit [-r <runner>] [options] <python file | app jar> [app arguments]Options:
All runners: -r {emr,hadoop,local,spark}, --runner {emr,hadoop,local,spark} Where to run the job (default: "spark") --class MAIN_CLASS Your application's main class (for Java / Scala apps). --name NAME The name of your application. --jars LIBJARS Comma-separated list of jars to include on the driverand executor classpaths. --packages PACKAGES Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths. Will search the local maven repo, then maven central and any additional remote repositories given by --repositories. The format for the coordinates should be groupId:artifactId:version. --exclude-packages EXCLUDE_PACKAGES Comma-separated list of groupId:artifactId, to exclude while resolving the dependencies provided in --packages to avoid dependency conflicts. --repositories REPOSITORIES Comma-separated list of additional remote repositories to search for the maven coordinates given with --packages. --py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to placed on the PYTHONPATH for Python apps. --files UPLOAD_FILES Comma-separated list of files to be placed in the working directory of each executor. Ignored on local[*] master. --archives UPLOAD_ARCHIVES Comma-separated list of archives to be extracted into the working directory of each executor. --dirs UPLOAD_DIRS Comma-separated list of directors to be archived and then extracted into the working directory of each executor. --cmdenv CMDENV Arbitrary environment variable to set inside Spark, in the format NAME=VALUE. --conf JOBCONF Arbitrary Spark configuration property, in the format PROP=VALUE. --setup SETUP A command to run before each Spark executor in the shell ("touch foo"). In cluster mode, runs before the Spark driver as well. You may interpolate files available via URL or on your local filesystem using Hadoop Distributed Cache syntax (". setup.sh#"). To interpolate archives (YARN only), use #/: "cd foo.tar.gz#/; make. --properties-file PROPERTIES_FILE Path to a file from which to load extra properties. If not specified, this will look for conf/spark- defaults.conf. --driver-memory DRIVER_MEMORY Memory for driver (e.g. 1000M, 2G) (Default: 1024M). --driver-java-options DRIVER_JAVA_OPTIONS Extra Java options to pass to the driver. --driver-library-path DRIVER_LIBRARY_PATH Extra library path entries to pass to the driver. --driver-class-path DRIVER_CLASS_PATH Extra class path entries to pass to the driver. Note that jars added with --jars are automatically included in the classpath. --executor-memory EXECUTOR_MEMORY Memory per executor (e.g. 1000M, 2G) (Default: 1G). --proxy-user PROXY_USER User to impersonate when submitting the application. This argument does not work with --principal / --keytab. -c CONF_PATHS, --conf-path CONF_PATHS Path to alternate mrjob.conf file to read from --no-conf Don't load mrjob.conf even if it's available -q, --quiet Don't print anything to stderr -v, --verbose print more messages to stderr -h, --help show this message and exit
- Spark and Hadoop runners only:
--master SPARK_MASTER spark://host:port, mesos://host:port, yarn,k8s://https://host:port, or local. Defaults to local[*] on spark runner, yarn on hadoop runner. --deploy-mode SPARK_DEPLOY_MODE Whether to launch the driver program locally (“client”) or on one of the worker machines inside the cluster (“cluster”) (Default: client). - Cluster deploy mode only:
--driver-cores DRIVER_CORES Number of cores used by the driver (Default: 1). - Spark standalone or Mesos with cluster deploy mode only:
--supervise If given, restarts the driver on failure. - Spark standalone and Mesos only:
--total-executor-cores TOTAL_EXECUTOR_CORES Total cores for all executors. - Spark standalone and YARN only:
--executor-cores EXECUTOR_CORES Number of cores per executor. (Default: 1 in YARN mode, or all available cores on the worker in standalone mode) - YARN-only:
--queue QUEUE_NAME The YARN queue to submit to (Default: “default”). --num-executors NUM_EXECUTORS Number of executors to launch (Default: 2). If dynamic allocation is enabled, the initial number of executors will be at least NUM. --principal PRINCIPAL Principal to be used to login to KDC, while running onsecure HDFS. --keytab KEYTAB The full path to the file that contains the keytab for the principal specified above. This keytab will be copied to the node running the Application Master via the Secure Distributed Cache, for renewing the login tickets and the delegation tokens periodically. This also supports the same runner-specific switches as
MRJob
s (e.g.--hadoop-bin
,--region
).
terminate-cluster¶
Terminate an existing EMR cluster.
Usage:
mrjob terminate-cluster [options] CLUSTER_IDTerminate an existing EMR cluster.
Options:
-c CONF_PATHS, --conf-path CONF_PATHS Path to alternate mrjob.conf file to read from --no-conf Don't load mrjob.conf even if it's available --ec2-endpoint EC2_ENDPOINT Force mrjob to connect to EC2 on this endpoint (e.g. ec2.us-west-1.amazonaws.com). Default is to infer this from region. --emr-endpoint EMR_ENDPOINT Force mrjob to connect to EMR on this endpoint (e.g. us-west-1.elasticmapreduce.amazonaws.com). Default is to infer this from region. -h, --help show this help message and exit -q, --quiet Don't print anything to stderr --region REGION GCE/AWS region to run Dataproc/EMR jobs in. --s3-endpoint S3_ENDPOINT Force mrjob to connect to S3 on this endpoint (e.g. s3-us-west-1.amazonaws.com). You usually shouldn't set this; by default mrjob will choose the correct endpoint for each S3 bucket based on its location. -t, --test Don't actually delete any files; just log that we would -v, --verbose print more messages to stderr
terminate-idle-clusters¶
Terminate idle EMR clusters that meet the criteria passed in on the command line (or, by default, clusters that have been idle for one hour).
Suggested usage: run this as a cron job with the
-q
option:*/30 * * * * mrjob terminate-idle-clusters -qChanged in version 0.6.4: Skips termination-protected idle clusters, rather than crashing. (This was also backported to mrjob v0.5.12.)
Options:
-c CONF_PATHS, --conf-path CONF_PATHS Path to alternate mrjob.conf file to read from --no-conf Don't load mrjob.conf even if it's available --dry-run Don't actually kill idle jobs; just log that we would --ec2-endpoint EC2_ENDPOINT Force mrjob to connect to EC2 on this endpoint (e.g. ec2.us-west-1.amazonaws.com). Default is to infer this from region. --emr-endpoint EMR_ENDPOINT Force mrjob to connect to EMR on this endpoint (e.g. us-west-1.elasticmapreduce.amazonaws.com). Default is to infer this from region. -h, --help show this help message and exit --max-mins-idle MAX_MINS_IDLE Max number of minutes a cluster can go without bootstrapping, running a step, or having a new step created. This will fire even if there are pending steps which EMR has failed to start. Make sure you set this higher than the amount of time your jobs can take to start instances and bootstrap. --max-mins-locked MAX_MINS_LOCKED Deprecated, does nothing --pool-name POOL_NAME Only terminate clusters in the given named pool. --pooled-only Only terminate pooled clusters -q, --quiet Don't print anything to stderr --region REGION GCE/AWS region to run Dataproc/EMR jobs in. --s3-endpoint S3_ENDPOINT Force mrjob to connect to S3 on this endpoint (e.g. s3-us-west-1.amazonaws.com). You usually shouldn't set this; by default mrjob will choose the correct endpoint for each S3 bucket based on its location. --unpooled-only Only terminate un-pooled clusters -v, --verbose print more messages to stderr