Elastic MapReduce Quickstart ============================ .. _amazon-setup: Configuring AWS credentials --------------------------- Configuring your AWS credentials allows mrjob to run your jobs on Elastic MapReduce and use S3. * Create an `Amazon Web Services account `_ * Go to `Security Credentials `__ in the login menu (upper right, third from the right), say yes, you want to proceed, click on **Access Keys**, and then **Create New Access Key**. Make sure to copy the secret access key, as there is no way to recover it after creation. Now you can either set the environment variables :envvar:`AWS_ACCESS_KEY_ID` and :envvar:`AWS_SECRET_ACCESS_KEY`, or set **aws_access_key_id** and **aws_secret_access_key** in your ``mrjob.conf`` file like this:: runners: emr: aws_access_key_id: aws_secret_access_key: .. _ssh-tunneling: Configuring SSH credentials --------------------------- Configuring your SSH credentials lets mrjob open an SSH tunnel to your jobs' master nodes to view live progress, see the job tracker in your browser, and fetch error logs quickly. * Go to https://console.aws.amazon.com/ec2/home * Make sure the **Region** dropdown (upper right, second from the right) matches the region you want to run jobs in (usually "Oregon"). * Click on **Key Pairs** (left sidebar, under **Network & Security**) * Click on **Create Key Pair** (top left). * Name your key pair ``EMR`` (any name will work but that's what we're using in this example) * Save :file:`EMR.pem` wherever you like (``~/.ssh`` is a good place) * Run ``chmod og-rwx /path/to/EMR.pem`` so that ``ssh`` will be happy * Add the following entries to your :py:mod:`mrjob.conf`:: runners: emr: ec2_key_pair: EMR ec2_key_pair_file: /path/to/EMR.pem # ~/ and $ENV_VARS allowed here ssh_tunnel: true .. _running-an-emr-job: Running an EMR Job ------------------ Running a job on EMR is just like running it locally or on your own Hadoop cluster, with the following changes: * The job and related files are uploaded to S3 before being run * The job is run on EMR (of course) * Output is written to S3 before mrjob streams it to stdout locally * The Hadoop version is specified by the EMR AMI version This the output of this command should be identical to the output shown in :doc:`quickstart`, but it should take much longer: > python word_count.py -r emr README.txt "chars" 3654 "lines" 123 "words" 417 Sending Output to a Specific Place ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If you'd rather have your output go to somewhere deterministic on S3, use :option:`--output-dir`:: > python word_count.py -r emr README.rst \ > --output-dir=s3://my-bucket/wc_out/ There are many other ins and outs of effectively using mrjob with EMR. See :doc:`emr-advanced` for some of the ins, but the outs are left as an exercise for the reader. This is a strictly no-outs body of documentation! .. _picking-emr-cluster-config: Choosing Type and Number of EC2 Instances ----------------------------------------- When you create a cluster on EMR, you'll have the option of specifying a number and type of EC2 instances, which are basically virtual machines. Each instance type has different memory, CPU, I/O and network characteristics, and costs a different amount of money. See `Instance Types `_ and `Pricing `_ for details. Instances perform one of three roles: * **Master**: There is always one master instance. It handles scheduling of tasks (i.e. mappers and reducers), but does not run them itself. * **Core**: You may have one or more core instances. These run tasks and host HDFS. * **Task**: You may have zero or more of these. These run tasks, but do *not* host HDFS. This is mostly useful because your cluster can lose task instances without killing your job (see :ref:`spot-instances`). There's a special case where your cluster *only* has a single master instance, in which case the master instance schedules tasks, runs them, and hosts HDFS. By default, :py:mod:`mrjob` runs a single ``m1.medium``, which is a cheap but not very powerful instance type. This can be quite adequate for testing your code on a small subset of your data, but otherwise give little advantage over running a job locally. To get more performance out of your job, you can either add more instances, use more powerful instances, or both. Here are some things to consider when tuning your instance settings: * Your job will take much longer and may fail if any task (usually a reducer) runs out of memory and starts using swap. (You can verify this by running :command:`mrjob boss j-CLUSTERID vmstat` and then looking in ``j-CLUSTERID/*/stdout``.) Restructuring your job is often the best solution, but if you can't, consider using a high-memory instance type. * Larger instance types are usually a better deal if you have the workload to justify them. For example, a ``c1.xlarge`` costs about 6 times as much as an ``m1.medium``, but it has about 8 times as much processing power (and more memory). The basic way to control type and number of instances is with the *instance_type* and *num_core_instances* options, on the command line like this:: --instance-type c1.medium --num-core-instances 4 or in :py:mod:`mrjob.conf`, like this:: runners: emr: instance_type: c1.medium num_core_instances: 4 In most cases, your master instance type doesn't need to be larger than ``m1.medium`` to schedule tasks, so *instance_type* only applies to the 4 instances that actually run tasks. You *will* need a larger master instance if you have a very large number of input files; in this case, use the *master_instance_type* option. The *num_task_instances* option can be used to run 1 or more task instances (these run tasks but don't host HDFS). There are also *core_instance_type* and *task_instance_type* options if you want to set these directly.