mrjob.dataproc - run on Dataproc

Job Runner

class mrjob.dataproc.DataprocJobRunner(**kwargs)

Runs an MRJob on Google Cloud Dataproc. Invoked when you run your job with -r dataproc.

DataprocJobRunner runs your job in an Dataproc cluster, which is basically a temporary Hadoop cluster.

Input, support, and jar files can be either local or on GCS; use gs://... URLs to refer to files on GCS.

This class has some useful utilities for talking directly to GCS and Dataproc, so you may find it useful to instantiate it without a script:

from mrjob.dataproc import DataprocJobRunner

GCS Utilities

class mrjob.dataproc.GCSFilesystem(local_tmp_dir=None, credentials=None, project_id=None)

Filesystem for Google Cloud Storage (GCS) URIs. Typically you will get one of these via DataprocJobRunner().fs, composed with SSHFilesystem and LocalFilesystem.