Dataproc Quickstart =================== .. _google-setup: Getting started with Google Cloud --------------------------------- Using mrjob with Google Cloud Dataproc is as simple creating an account, enabling Google Cloud Dataproc, and creating credentials. Creating an account ^^^^^^^^^^^^^^^^^^^ * Go to `cloud.google.com `__. * Click the circle in the upper right, and select your Google account (if you don't have one sign up `here `__. `If you have multiple Google accounts, sign out first, and then sign into the account you want to use.` * Click **Try it Free** in the upper right * Enter your name and payment information * Wait a few minutes while your first project is created Enabling Google Cloud Dataproc ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * Go `here `__ (or search for "dataproc" under **APIs & Services > Library** in the upper left-hand menu) * Click **Enable** .. _google-credentials-setup: Configuring Google Cloud credentials ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * Go `here `__ (or pick **APIs & Services > Credentials** in the upper left-hand menu) * Pick **Create credentials > Service account key** * Select **Compute engine default service account** * Click **Create** to download a JSON file. Then you should either install and set up the optional :command:`gcloud` utility (see below) or point **$GOOGLE_APPLICATION_CREDENTIALS** at the file you downloaded (``export GOOGLE_APPLICATION_CREDENTIALS="/path/to/Your Credentials.json"``). .. _installing-gcloud: Installing gcloud, gsutil, and other utilities ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ mrjob does not require you to install the :command:`gcloud` command in order to run jobs on Google Cloud Dataproc, unless you want to set up an SSH tunnel to the Hadoop resource manager (see :mrjob-opt:`ssh_tunnel`). The :command:`gcloud` command can be very useful for monitoring your job. The :command:`gsutil` utility, packaged with it, is very helpful for dealing with Google Storage, the cloud filesystem that Google Cloud Dataproc uses. To install :command:`gcloud` and :command:`gsutil`: * Follow `these three steps `__ to install the utilities * Log in with your Google credentials (these will launch a browser): * :command:`gcloud auth login` * :command:`gcloud auth application-default init` On some versions of gcloud, you may have to manually configure project ID for :mrjob-opt:`ssh_tunnel` to work. * run :command:`gcloud projects list` to get your project ID * :command:`gcloud config set project ` It's also helpful to set :command:`gcloud`\'s :mrjob-opt:`region` and :mrjob-opt:`zone` to match mrjob's defaults: * :command:`gcloud config set compute/region us-west1` * :command:`gcloud config set compute/zone us-west1-a` * :command:`gcloud config set dataproc/region us-west1` .. _running-a-dataproc-job: Running a Dataproc Job ---------------------- Running a job on Dataproc is just like running it locally or on your own Hadoop cluster, with the following changes: * The job and related files are uploaded to GCS before being run * The job is run on Dataproc (of course) * Output is written to GCS before mrjob streams it to stdout locally * The Hadoop version is specified by the `Dataproc version `_ This the output of this command should be identical to the output shown in :doc:`quickstart`, but it should take much longer:: > python word_count.py -r dataproc README.txt "chars" 3654 "lines" 123 "words" 417 Sending Output to a Specific Place ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If you'd rather have your output go to somewhere deterministic on GCS, use :option:`--output-dir`:: > python word_count.py -r dataproc README.rst \ > --output-dir=gs://my-bucket/wc_out/ .. _picking-dataproc-cluster-config: Choosing Type and Number of GCE Instances ----------------------------------------- When you create a cluster on Dataproc, you'll have the option of specifying a number and type of GCE instances, which are basically virtual machines. Each instance type has different memory, CPU, I/O and network characteristics, and costs a different amount of money. See `Machine Types `_ and `Pricing `_ for details. Instances perform one of three roles: * **Master**: There is always one master instance. It handles scheduling of tasks (i.e. mappers and reducers), but does not run them itself. * **Worker**: You may have one or more worker instances. These run tasks and host HDFS. * **Preemptible Worker**: You may have zero or more of these. These run tasks, but do *not* host HDFS. This is mostly useful because your cluster can lose task instances without killing your job (see `Preemptible VMs `_). By default, :py:mod:`mrjob` runs a single ``n1-standard-1``, which is a cheap but not very powerful instance type. This can be quite adequate for testing your code on a small subset of your data, but otherwise give little advantage over running a job locally. To get more performance out of your job, you can either add more instances, use more powerful instances, or both. Here are some things to consider when tuning your instance settings: * Google Cloud bills you a 10-minute minimum even if your cluster only lasts for a few minutes (this is an artifact of the Google Cloud billing structure), so for many jobs that you run repeatedly, it is a good strategy to pick instance settings that make your job consistently run in a little less than 10 minutes. * Your job will take much longer and may fail if any task (usually a reducer) runs out of memory and starts using swap. (You can verify this by using :command:`vmstat`.) Restructuring your job is often the best solution, but if you can't, consider using a high-memory instance type. * Larger instance types are usually a better deal if you have the workload to justify them. For example, a ``n1-highcpu-8`` costs about 6 times as much as an ``n1-standard-1``, but it has about 8 times as much processing power (and more memory). The basic way to control type and number of instances is with the :mrjob-opt:`instance_type` and :mrjob-opt:`num_core_instances` options, on the command line like this:: --instance-type n1-highcpu-8 --num-core-instances 4 or in :py:mod:`mrjob.conf`, like this:: runners: dataproc: instance_type: n1-highcpu-8 num_core_instances: 4 In most cases, your master instance type doesn't need to be larger than ``n1-standard-1`` to schedule tasks. *instance_type* only applies to instances that actually run tasks. (In this example, there are 1 ``n1-standard-1`` master instance, and 4 ``n1-highcpu-8`` worker instances.) You *will* need a larger master instance if you have a very large number of input files; in this case, use the :mrjob-opt:`master_instance_type` option. If you want to run preemptible instances, use the :mrjob-opt:`task_instance_type` and :mrjob-opt:`num_task_instances` options.