Cloud runner options¶
These options are generally available whenever you run your job on a Hadoop cloud service (AWS Elastic MapReduce or Google Cloud Dataproc).
All options from Options available to all runners and Hadoop-related options are also available on cloud services.
Google credentials¶
See Getting started with Google Cloud for specific instructions about setting these options.
Choosing/creating a cluster to join¶
- cluster_id (
--cluster-id
) : string Default: automatically create a cluster and use it
The ID of a persistent cluster to run jobs in (on Dataproc, this is the same thing as “cluster name”).
It’s fine for other jobs to be using the cluster; we give our job’s steps a unique ID.
Job placement¶
- region (
--region
) : string Default:
'us-west-2'
on EMR,'us-west1'
on DataprocGeographic region to run jobs in (e.g.
us-central-1
).If mrjob create a temporary bucket, it will be created in this region as well.
If you set region, you do not need to set zone; a zone will be chosen for you automatically.
- subnet (
--subnet
) : string Default:
None
Optional subnet(s) to launch your job in.
On Amazon EMR, this is the ID of a VPC subnet to launch cluster in (e.g.
'subnet-12345678'
). This can also be a list of possible subnets if you are using instance_fleets.On Google Cloud Dataproc, this is the name of a subnetwork (e.g.
'default'
). Specifying subnet rather than network will ensure that your cluster only has access to one specific geographic region, rather than the entire VPC.Changed in version 0.6.8:
--subnet ''
un-sets the subnet on EMR (used to be ignored).Changed in version 0.6.3: Works on Google Cloud Dataproc as well as AWS Elastic MapReduce.
Number and type of instances¶
- instance_type (
--instance-type
) : string Default:
m4.large
orm5.xlarge
on EMR,n1-standard-1
on DataprocType of instance that runs your Hadoop tasks.
Once you’ve tested a job and want to run it at scale, it’s usually a good idea to use instances larger than the default. For EMR, see Amazon EC2 Instance Types and for Dataproc, see Machine Types.
Note
Many EC2 instance types can only run in a VPC (see subnet).
If you’re running multiple nodes (see num_core_instances), this option doesn’t apply to the master node because it’s just coordinating tasks, not running them. Use master_instance_type instead.
Changed in version 0.6.11: Default on EMR is
m5.xlarge
on AMI version 5.13.0 and later,m4.large
on earlier versionsChanged in version 0.6.10: Default on EMR changed to
m5.xlarge
Changed in version 0.6.6: Default on EMR changed to
m4.large
. Was previously m1.large` if running Spark,m1.small
if running on the (deprecated) 2.x AMIs, andm1.medium
otherwise
- core_instance_type (
--core-instance-type
) : string Default: value of instance_type
like instance_type, but only for the core (worker) Hadoop nodes; these nodes run tasks and host HDFS. Usually you just want to use instance_type.
- num_core_instances (
--num-core-instances
) : integer Default: 0 on EMR, 2 on Dataproc
Number of core (worker) instances to start up. These run your job and host HDFS. This is in addition to the single master instance.
On Google Cloud Dataproc, this must be at least 2.
- master_instance_type (
--master-instance-type
) : string Default: (automatic)
like instance_type, but only for the master Hadoop node. This node hosts the task tracker/resource manager and HDFS, and runs tasks if there are no other nodes.
If you’re running a single node (no num_core_instances or num_task_instances), this will default to the value of instance_type.
Otherwise, on Dataproc, defaults to
n1-standard-1
, and on EMR defaults tom1.medium
(exception:m1.small
on the deprecated 2.x AMIs), which is usually adequate for all but the largest jobs.
- task_instance_type (
--task-instance-type
) : string Default: value of core_instance_type
like instance_type, but only for the task (secondary worker) Hadoop nodes; these nodes run tasks but do not host HDFS. Usually you just want to use instance_type.
- num_task_instances (
--num-task-instances
) : integer Default: 0
Number of task (secondary worker) instances to start up. These run your job but do not host HDFS.
You must have at least one core instance (see num_core_instances) to run task instances; otherwise there’s nowhere to host HDFS.
Cluster software configuration¶
- image_id (
--image-id
) : string Default: None
ID of a custom machine image.
On EMR, this is complimentary with image_version; you can install packages and libraries on your custom AMI, but it’s up to EMR to install Hadoop, create the
hadoop
user, etc. image_version may not be less than 5.7.0.You can use
describe_base_emr_images()
to identify Amazon Linux images that are compatible with EMR.For more details about how to create a custom AMI that works with EMR, see Best Practices and Considerations.
Note
This is not yet implemented in the Dataproc runner.
New in version 0.6.5.
- image_version (
--image-version
) : string Default:
'6.0.0'
on EMR,'1.3'
on DataprocMachine image version to use. This controls which Hadoop version(s) are available and which version of Python is installed, among other things.
See the AMI version docs (EMR) or the Dataproc version docs for more details.
You can use this instead of release_label on EMR, even for 4.x+ AMIs; mrjob will just prepend
emr-
to form the release label.Changed in version 0.6.12: Default on Dataproc changed from
1.0
to1.3
Changed in version 0.6.11: Default on EMR is now
5.27.0
Changed in version 0.6.5: Default on EMR is now
5.16.0
(was5.8.0
)Warning
The 2.x series of AMIs is deprecated by Amazon and not recommended.
Warning
The 1.x series of AMIs is no longer supported because they use Python 2.5.
- bootstrap (
--bootstrap
) : string list Default:
[]
A list of lines of shell script to run once on each node in your cluster, at bootstrap time.
This option is complex and powerful. On EMR, the best way to get started is to read the EMR Bootstrapping Cookbook.
Passing expressions like
path#name
will cause path to be automatically uploaded to the task’s working directory with the filename name, marked as executable, and interpolated into the script by their absolute path on the machine running the script.path may also be a URI, and
~
and environment variables within path will be resolved based on the local environment. name is optional. For details of parsing, seeparse_setup_cmd()
.Unlike with setup, archives are not supported (unpack them yourself).
Remember to put
sudo
before commands requiring root privileges!
- bootstrap_python (
--bootstrap-python
,--no-bootstrap-python
) : boolean Default:
True
on Dataproc, as needed on EMR.Attempt to install a compatible (major) version of Python at bootstrap time, including header files and pip (see Installing Python packages with pip).
The only reason to set this to
False
is if you want to customize Python/pip installation using bootstrap.
- extra_cluster_params (
--extra-cluster-param
) : dict Default:
{}
An escape hatch that allows you to pass extra parameters to the EMR/Dataproc API at cluster create time, to access API features that mrjob does not yet support.
For EMR, see the API documentation for RunJobFlow for the full list of options.
Option names are strings, and values are data structures. On the command line,
--extra-cluster-param name=value
:--extra-cluster-param SupportedProducts='["mapr-m3"]' --extra-cluster-param AutoScalingRole=HankPym
value can be either a JSON or a string (unless it starts with
{
,[
, or"
, so that we don’t convert malformed JSON to strings). Parameters can be suppressed by setting them tonull
:--extra-cluster-param LogUri=null
This also works with Google dataproc:
--extra-cluster-param labels='{"name": "wrench"}'
In the config file, extra_cluster_param is a dict:
runners: emr: extra_cluster_params: AutoScalingRole: HankPym LogUri: null # !clear works too SupportedProducts: - mapr-m3
Changed in version 0.7.2: Dictionaries will be recursively merged into existing parameters. For example:
runners: emr: extra_cluster_params: Instances: EmrManagedMasterSecurityGroup: sg-foo
Changed in version 0.6.8: You may use a name with dots in it to set (or unset) nested properties. For example:
--extra-cluster-param Instances.EmrManagedMasterSecurityGroup=sg-foo
.
Monitoring your job¶
- check_cluster_every (
--check-cluster-every
) : float Default: 10 seconds on Dataproc, 30 seconds on EMR
How often to check on the status of your job, in seconds.
Changed in version 0.6.5: When the EMR client encounters a transient error, it will wait at least this many seconds before trying again.
- ssh_tunnel (
--ssh-tunnel
,--no-ssh-tunnel
) : boolean Default:
False
If True, create an ssh tunnel to the job tracker/resource manager and listen on a randomly chosen port.
On EMR, this requires you to set ec2_key_pair and ec2_key_pair_file. See Configuring SSH credentials for detailed instructions.
On Dataproc, you don’t need to set a key, but you do need to have the gcloud utility installed and set up (make sure you ran gcloud auth login and gcloud config set project <project_id>). See Installing gcloud, gsutil, and other utilities.
Changed in version 0.6.3: Enabled on Google Cloud Dataproc
- ssh_tunnel_is_open (
--ssh-tunnel-is-open
) : boolean Default:
False
if True, any host can connect to the job tracker through the SSH tunnel you open. Mostly useful if your browser is running on a different machine from your job runner.
Does nothing unless ssh_tunnel is set.
- ssh_bind_ports (
--ssh-bind-ports
) : list of integers Default:
range(40001, 40841)
A list of ports that are safe to listen on.
The main reason to set this is if your firewall blocks the default range of ports, or if you want to pick a single port for consistency.
On the command line, this looks like
--ssh-bind-ports 2000[:2001][,2003,2005:2008,etc]
, where commas separate ranges and colons separate range endpoints.
Cloud Filesystem¶
- cloud_fs_sync_secs (
--cloud-fs-sync-secs
) : float Default: 5.0
How long to wait for cloud filesystem (e.g. S3, GCS) to reach eventual consistency? This is typically less than a second, but the default is 5 seconds to be safe.
- cloud_part_size_mb (
--cloud-part-size-mb
) : integer Default: 100
Upload files to cloud filesystem in parts no bigger than this many megabytes (technically, mebibytes). Default is 100 MiB.
Set to 0 to disable multipart uploading entirely.
Currently, Amazon requires parts to be between 5 MiB and 5 GiB. mrjob does not enforce these limits.
Changed in version 0.6.3: Enabled on Google Cloud Storage. This used to be called cloud_upload_part_size.
- cloud_tmp_dir (
--cloud-tmp-dir
) : string Default: (automatic)
Directory on your cloud filesystem to use as temp space (e.g.
s3://yourbucket/tmp/
,gs://yourbucket/tmp/
).By default, mrjob looks for a bucket belong to you whose name starts with
mrjob-
and which matches region. If it can’t find one, it creates one with a random name. This option is then set to tmp/ in this bucket (e.g.s3://mrjob-01234567890abcdef/tmp/
).
Auto-termination¶
- max_mins_idle (
--max-mins-idle
) : float Default: 10.0
Automatically terminate your cluster after it has been idle at least this many minutes. You cannot turn this off (clusters left idle rack up billing charges).
If your cluster is only running a single job, mrjob will attempt to terminate it as soon as your job finishes. This acts as an additional safeguard, as well as affecting Cluster Pooling on EMR.
Changed in version 0.6.5: EMR’s idle termination script is more robust against sudo shutdown -h now being ignored, and logs the script’s stdout and stderr to
/var/log/bootstrap-actions/mrjob-idle-termination.log
.Changed in version 0.6.3: Uses Dataproc’s built-in cluster termination feature rather than a custom script. The API will not allow you to set an idle time of less than 10 minutes.
Changed in version 0.6.2: No matter how small a value you set this to, there is a grace period of 10 minutes between when the idle termination daemon launches and when it may first terminate the cluster, to allow Hadoop to accept your first job.