Cluster PoolingΒΆ

Clusters on EMR take several minutes to spin up, which can make development painfully slow.

To get around this, mrjob provides cluster pooling.. If you set pool_clusters to true, once your job completes, the cluster will stay open to accept additional jobs, and eventually shut itself down after it has been idle for a certain amount of time (by default, ten minutes; see max_mins_idle).

Note

When using cluster pooling prior to v0.6.0, make sure to set max_hours_idle, or your cluster will never shut down.

Pooling is designed so that jobs run against the same mrjob.conf can share the same clusters. This means that the version of mrjob and bootstrap configuration. Other options that affect which cluster a job can join:

Pooled jobs will also only use clusters with the same pool name, so you can use the pool_name option to partition your clusters into separate pools.

Pooling is flexible about instance type and number of instances; it will attempt to select the most powerful cluster available as long as the cluster’s instances provide at least as much memory and at least as much CPU as your job requests.

Pooling is also somewhat flexible about EBS volumes (see instance_groups). Each volume must have the same volume type, but larger volumes or volumes with more I/O ops per second are acceptable, as are additional volumes of any type.

If you are using instance_fleets, your jobs will only join other clusters which use instance fleets. The rules are similar, but jobs will only join clusters whose fleets use the same set of instances or a subset; there is no concept of “better” instances.

mrjob’s pooling won’t add more than 1000 steps to a cluster, as the EMR API won’t show more than this many steps. (For very old AMIs there is a stricter limit of 256 steps).

mrjob also uses an S3-based “locking” mechanism to prevent two jobs from simultaneously joining the same cluster. This is somewhat ugly but works in practice, and avoids mrjob depending on Amazon services other than EMR and S3.

Warning

If S3 eventual consistency takes longer than cloud_fs_sync_secs, then you may encounter race conditions when using pooling, e.g. two jobs claiming the same cluster at the same time, or the idle cluster killer shutting down your job before it has started to run. Regions with read-after-write consistency (i.e. every region except US Standard) should not experience these issues.

You can allow jobs to wait for an available cluster instead of immediately starting a new one by specifying a value for –pool-wait-minutes. mrjob will try to find a cluster every 30 seconds for pool_wait_minutes. If none is found during that time, mrjob will start a new one.