Cluster PoolingΒΆ

Clusters on EMR take several minutes to spin up, which can make development painfully slow.

To get around this, mrjob provides cluster pooling.. If you set pool_clusters to true, once your job completes, the cluster will stay open to accept additional jobs, and eventually shut itself down after it has been idle for a certain amount of time (by default, ten minutes; see max_mins_idle).

Note

Pooling is a way to reduce latency, not to save money. Though pooling was originally created to optimize AWS’s practice of billing by the full hour, this ended in October 2017.

Pooling is designed so that jobs run with the same version of mrjob and the same (or similar) mrjob.conf can share the same clusters. Options that affect which cluster a job can join:

Pooled jobs will also only use clusters with the same pool name, so you can use the pool_name option to partition your clusters into separate pools.

Pooling is flexible about instance type and number of instances. It will attempt to select the cluster with the greatest CPU capacity (based on NormalizedInstanceHours in the cluster summary returned by the ListClusters API call), as long as the cluster’s instances provide at least as much memory and at least as much CPU as your job requests.

Pooling is also somewhat flexible about EBS volumes (see instance_groups). Each volume must have the same volume type, but larger volumes or volumes with more I/O ops per second are acceptable, as are additional volumes of any type.

Pooling cannot match configurations with explicitly set ebs_root_volume_gb against clusters that use the default (or vice versa) because the EMR API does not report what the default value is.

If you are using instance_fleets, your jobs will only join other clusters which use instance fleets. The rules are similar, but jobs will only join clusters whose fleets use the same set of instances or a subset; there is no concept of “better” instances.

Pooling uses EMR tags to implement a simple “locking” mechanism that keeps two jobs from joining the same cluster simultaneously. Locks automatically expire after a minute (which is more than long enough for a new step to be submitted to the EMR API and enter the RUNNING state).

You can allow jobs to wait for an available cluster instead of immediately starting a new one by specifying a value for –pool-wait-minutes. mrjob will try to find a cluster every 30 seconds for pool_wait_minutes. If none is found during that time, mrjob will start a new one.