Troubleshooting
===============

Many things can go wrong in an EMR job, and the system's distributed nature
can make it difficult to find the source of a problem. :py:mod:`mrjob` attempts to
simplify the debugging process by automatically scanning logs for probable
causes of failure.

In addition to looking at S3, :py:mod:`mrjob` can be configured to
also use SSH to fetch error logs directly from the master and worker nodes.
This can speed up debugging significantly (EMR only transfers logs to S3
every five minutes).

Using persistent clusters
--------------------------

When troubleshooting a job, it can be convenient to use a persistent cluster
to avoid having to wait for bootstrapping every run.

First, use the :command:`mrjob create-cluster` to create a
persistent cluster::

    $ mrjob create-cluster
    Using configs in /Users/davidmarin/.mrjob.conf
    Using s3://mrjob-35cdec11663cb1cb/tmp/ as our temp dir on S3
    Creating persistent cluster to run several jobs in...
    Creating temp directory /var/folders/zv/jmtt5bxs6xl3kzt38470hcxm0000gn/T/no_script.davidmarin.20160324.231018.720057
    Copying local files to s3://mrjob-35cdec11663cb1cb/tmp/no_script.davidmarin.20160324.231018.720057/files/...
    j-3BYHP30KB81XE

Now you can use the cluster ID to start the troublesome job::

    $ python mrjob/examples/mr_boom.py README.rst -r emr --cluster-id j-3BYHP30KB81XE
    Using configs in /Users/davidmarin/.mrjob.conf
    Using s3://mrjob-35cdec11663cb1cb/tmp/ as our temp dir on S3
    Creating temp directory /var/folders/zv/jmtt5bxs6xl3kzt38470hcxm0000gn/T/mr_boom.davidmarin.20160324.231045.501027
    Copying local files to s3://mrjob-35cdec11663cb1cb/tmp/mr_boom.davidmarin.20160324.231045.501027/files/...
    Adding our job to existing cluster j-3BYHP30KB81XE
    Waiting for step 1 of 1 (s-SGVW9B5LEXF5) to complete...
      PENDING (cluster is STARTING: Provisioning Amazon EC2 capacity)
      PENDING (cluster is STARTING: Provisioning Amazon EC2 capacity)
      PENDING (cluster is STARTING: Provisioning Amazon EC2 capacity)
      PENDING (cluster is STARTING: Provisioning Amazon EC2 capacity)
      PENDING (cluster is STARTING: Provisioning Amazon EC2 capacity)
      PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
      PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
      PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
      PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
      PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
      Opening ssh tunnel to resource manager...
      Connect to resource manager at: http://localhost:40069/cluster
      RUNNING for 9.2s
      RUNNING for 42.3s
         0.0% complete
      RUNNING for 72.6s
         5.0% complete
      RUNNING for 102.9s
         5.0% complete
      RUNNING for 133.4s
       100.0% complete
      FAILED
    Cluster j-3BYHP30KB81XE is WAITING: Cluster ready after last step failed.
    Attempting to fetch counters from logs...
    Looking for step log in /mnt/var/log/hadoop/steps/s-SGVW9B5LEXF5 on ec2-52-37-112-240.us-west-2.compute.amazonaws.com...
      Parsing step log: ssh://ec2-52-37-112-240.us-west-2.compute.amazonaws.com/mnt/var/log/hadoop/steps/s-SGVW9B5LEXF5/syslog
    Counters: 9
    	Job Counters
    		Data-local map tasks=1
    		Failed map tasks=4
    		Launched map tasks=4
    		Other local map tasks=3
    		Total megabyte-seconds taken by all map tasks=58125312
    		Total time spent by all map tasks (ms)=75684
    		Total time spent by all maps in occupied slots (ms)=227052
    		Total time spent by all reduces in occupied slots (ms)=0
    		Total vcore-seconds taken by all map tasks=75684
    Scanning logs for probable cause of failure...
    Looking for task logs in /mnt/var/log/hadoop/userlogs/application_1458861299388_0001 on ec2-52-37-112-240.us-west-2.compute.amazonaws.com and task/core nodes...
      Parsing task syslog: ssh://ec2-52-37-112-240.us-west-2.compute.amazonaws.com/mnt/var/log/hadoop/userlogs/application_1458861299388_0001/container_1458861299388_0001_01_000005/syslog
      Parsing task stderr: ssh://ec2-52-37-112-240.us-west-2.compute.amazonaws.com/mnt/var/log/hadoop/userlogs/application_1458861299388_0001/container_1458861299388_0001_01_000005/stderr
    Probable cause of failure:

    PipeMapRed failed!
    java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
    	at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:330)
    	at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:543)
    	at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
    	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:81)
    	at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
    	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432)
    	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
    	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
    	at java.security.AccessController.doPrivileged(Native Method)
    	at javax.security.auth.Subject.doAs(Subject.java:415)
    	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
    	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)

    (from lines 37-50 of ssh://ec2-52-37-112-240.us-west-2.compute.amazonaws.com/mnt/var/log/hadoop/userlogs/application_1458861299388_0001/container_1458861299388_0001_01_000005/syslog)

    caused by:

    Traceback (most recent call last):
      File "mr_boom.py", line 10, in <module>
        MRBoom.run()
      File "/usr/lib/python3.4/dist-packages/mrjob/job.py", line 430, in run
        mr_job.execute()
      File "/usr/lib/python3.4/dist-packages/mrjob/job.py", line 439, in execute
        self.run_mapper(self.options.step_num)
      File "/usr/lib/python3.4/dist-packages/mrjob/job.py", line 499, in run_mapper
        for out_key, out_value in mapper_init() or ():
      File "mr_boom.py", line 7, in mapper_init
        raise Exception('BOOM')
    Exception: BOOM

    (from lines 1-12 of ssh://ec2-52-37-112-240.us-west-2.compute.amazonaws.com/mnt/var/log/hadoop/userlogs/application_1458861299388_0001/container_1458861299388_0001_01_000005/stderr)

    while reading input from s3://mrjob-35cdec11663cb1cb/tmp/mr_boom.davidmarin.20160324.231045.501027/files/README.rst


    Step 1 of 1 failed
    Killing our SSH tunnel (pid 52847)

Now you can fix the bug and try again, without having to wait for a new
cluster to bootstrap.

.. note::

   mrjob *can* fetch logs from persistent jobs even without SSH set up, but
   it has to pause 10 minutes to wait for EMR to transfer logs to S3, which
   defeats the purpose of rapid iteration.