mrjob.dataproc - run on Dataproc

Job Runner

class mrjob.dataproc.DataprocJobRunner(**kwargs)

Runs an MRJob on Google Cloud Dataproc. Invoked when you run your job with -r dataproc.

DataprocJobRunner runs your job in an Dataproc cluster, which is basically a temporary Hadoop cluster.

Input, support, and jar files can be either local or on GCS; use gs://... URLs to refer to files on GCS.

This class has some useful utilities for talking directly to GCS and Dataproc, so you may find it useful to instantiate it without a script:

from mrjob.dataproc import DataprocJobRunner

GCS Utilities

class mrjob.dataproc.GCSFilesystem(credentials=None, project_id=None, part_size=None, location=None, object_ttl_days=None)

Filesystem for Google Cloud Storage (GCS) URIs

  • credentials – an optional google.auth.credentials.Credentials, used to initialize the storage client
  • project_id – an optional project ID, used to initialize the storage client
  • part_size – Part size for multi-part uploading, in bytes, or None
  • location – Default location to use when creating a bucket
  • object_ttl_days – Default object expiry for newly created buckets

Changed in version 0.7.0: removed local_tmp_dir

Changed in version 0.6.8: deprecated local_tmp_dir, added part_size, location, object_ttl_days