Dataproc Quickstart

Getting started with Google Cloud

Using mrjob with Google Cloud Dataproc is as simple creating an account, enabling Google Cloud Dataproc, and creating credentials.

Creating an account

  • Go to cloud.google.com.
  • Click the circle in the upper right, and select your Google account (if you don’t have one sign up here. If you have multiple Google accounts, sign out first, and then sign into the account you want to use.
  • Click Try it Free in the upper right
  • Enter your name and payment information
  • Wait a few minutes while your first project is created

Enabling Google Cloud Dataproc

  • Go here (or search for “dataproc” under APIs & Services > Library in the upper left-hand menu)
  • Click Enable

Configuring Google Cloud credentials

  • Go here (or pick APIs & Services > Credentials in the upper left-hand menu)
  • Pick Create credentials > Service account key
  • Select Compute engine default service account
  • Click Create to download a JSON file.

Then you should either install and set up the optional gcloud utility (see below) or point $GOOGLE_APPLICATION_CREDENTIALS at the file you downloaded (export GOOGLE_APPLICATION_CREDENTIALS="/path/to/Your Credentials.json").

Installing gcloud, gsutil, and other utilities

mrjob does not require you to install the gcloud command in order to run jobs on Google Cloud Dataproc, unless you want to set up an SSH tunnel to the Hadoop resource manager (see ssh_tunnel).

The gcloud command can be very useful for monitoring your job. The gsutil utility, packaged with it, is very helpful for dealing with Google Storage, the cloud filesystem that Google Cloud Dataproc uses.

To install gcloud and gsutil:

  • Follow these three steps to install the utilities
  • Log in with your Google credentials (these will launch a browser): * gcloud auth login * gcloud auth application-default init

On some versions of gcloud, you may have to manually configure project ID for ssh_tunnel to work.

  • run gcloud projects list to get your project ID
  • gcloud config set project <project_id>

It’s also helpful to set gcloud‘s region and zone to match mrjob’s defaults:

  • gcloud config set compute/region us-west1
  • gcloud config set compute/zone us-west1-a
  • gcloud config set dataproc/region us-west1

Running a Dataproc Job

Running a job on Dataproc is just like running it locally or on your own Hadoop cluster, with the following changes:

  • The job and related files are uploaded to GCS before being run
  • The job is run on Dataproc (of course)
  • Output is written to GCS before mrjob streams it to stdout locally
  • The Hadoop version is specified by the Dataproc version

This the output of this command should be identical to the output shown in Fundamentals, but it should take much longer:

> python word_count.py -r dataproc README.txt
"chars" 3654
"lines" 123
"words" 417

Sending Output to a Specific Place

If you’d rather have your output go to somewhere deterministic on GCS, use --output-dir:

> python word_count.py -r dataproc README.rst \
>   --output-dir=gs://my-bucket/wc_out/

Choosing Type and Number of GCE Instances

When you create a cluster on Dataproc, you’ll have the option of specifying a number and type of GCE instances, which are basically virtual machines. Each instance type has different memory, CPU, I/O and network characteristics, and costs a different amount of money. See Machine Types and Pricing for details.

Instances perform one of three roles:

  • Master: There is always one master instance. It handles scheduling of tasks (i.e. mappers and reducers), but does not run them itself.
  • Worker: You may have one or more worker instances. These run tasks and host HDFS.
  • Preemptible Worker: You may have zero or more of these. These run tasks, but do not host HDFS. This is mostly useful because your cluster can lose task instances without killing your job (see Preemptible VMs).

By default, mrjob runs a single n1-standard-1, which is a cheap but not very powerful instance type. This can be quite adequate for testing your code on a small subset of your data, but otherwise give little advantage over running a job locally. To get more performance out of your job, you can either add more instances, use more powerful instances, or both.

Here are some things to consider when tuning your instance settings:

  • Google Cloud bills you a 10-minute minimum even if your cluster only lasts for a few minutes (this is an artifact of the Google Cloud billing structure), so for many jobs that you run repeatedly, it is a good strategy to pick instance settings that make your job consistently run in a little less than 10 minutes.
  • Your job will take much longer and may fail if any task (usually a reducer) runs out of memory and starts using swap. (You can verify this by using vmstat.) Restructuring your job is often the best solution, but if you can’t, consider using a high-memory instance type.
  • Larger instance types are usually a better deal if you have the workload to justify them. For example, a n1-highcpu-8 costs about 6 times as much as an n1-standard-1, but it has about 8 times as much processing power (and more memory).

The basic way to control type and number of instances is with the instance_type and num_core_instances options, on the command line like this:

--instance-type n1-highcpu-8 --num-core-instances 4

or in mrjob.conf, like this:

runners:
  dataproc:
    instance_type: n1-highcpu-8
    num_core_instances: 4

In most cases, your master instance type doesn’t need to be larger than n1-standard-1 to schedule tasks. instance_type only applies to instances that actually run tasks. (In this example, there are 1 n1-standard-1 master instance, and 4 n1-highcpu-8 worker instances.) You will need a larger master instance if you have a very large number of input files; in this case, use the master_instance_type option.

If you want to run preemptible instances, use the task_instance_type and num_task_instances options.