mrjob.dataproc - run on Dataproc¶
Job Runner¶
-
class
mrjob.dataproc.
DataprocJobRunner
(**kwargs)¶ Runs an
MRJob
on Google Cloud Dataproc. Invoked when you run your job with-r dataproc
.DataprocJobRunner
runs your job in an Dataproc cluster, which is basically a temporary Hadoop cluster.Input, support, and jar files can be either local or on GCS; use
gs://...
URLs to refer to files on GCS.This class has some useful utilities for talking directly to GCS and Dataproc, so you may find it useful to instantiate it without a script:
from mrjob.dataproc import DataprocJobRunner ...
GCS Utilities¶
-
class
mrjob.dataproc.
GCSFilesystem
(credentials=None, project_id=None, part_size=None, location=None, object_ttl_days=None)¶ Filesystem for Google Cloud Storage (GCS) URIs
Parameters: - credentials – an optional
google.auth.credentials.Credentials
, used to initialize the storage client - project_id – an optional project ID, used to initialize the storage client
- part_size – Part size for multi-part uploading, in bytes, or
None
- location – Default location to use when creating a bucket
- object_ttl_days – Default object expiry for newly created buckets
Changed in version 0.7.0: removed local_tmp_dir
Changed in version 0.6.8: deprecated local_tmp_dir, added part_size, location, object_ttl_days
- credentials – an optional