Contributing to mrjob

Contribution guidelines

mrjob is developed using a standard Github pull request process. Almost all code is reviewed in pull requests.

The general process for working on mrjob is:

  • Fork the project on Github
  • Clone your fork to your local machine
  • Create a feature branch from master (e.g. git branch delete_all_the_code)
  • Write code, commit often
  • Write test cases for all changed functionality
  • Submit a pull request against master on Github
  • Wait for code review!

It would also help to discuss your ideas on the mailing list so we can warn you of possible merge conflicts with ongoing work or offer suggestions for where to put code.

Things that will make your branch more likely to be pulled:

  • Comprehensive, fast test cases
  • Detailed explanation of what the change is and how it works
  • Reference relevant issue numbers in the tracker
  • API backward compatibility

If you add a new configuration option, please try to do all of these things:

  • Add command line switches that allow full control over the option
  • Document the option and its switches in the appropriate file under docs

A quick tour through the code

mrjob’s modules can be put in four categories:

  • Reading command line arguments and config files, and invoking machinery accordingly
    • mrjob.conf: Read config files
    • mrjob.launch: Invoke runners based on command line and configs
    • mrjob.options: Define command line options
  • Interacting with Hadoop Streaming
  • Runners and support; submitting the job to various MapReduce environments
    • mrjob.runner: Common functionality across runners
    • mrjob.hadoop: Submit jobs to Hadoop
    • mrjob.step: Define/implement interface between runners and script steps
    • Local
    • Google Cloud Dataproc
    • Amazon Elastic MapReduce
      • mrjob.emr: Submit jobs to EMR
      • mrjob.pool: Utilities for cluster pooling functionality
      • mrjob.retry: Wrapper for S3 and EMR connections to handle recoverable errors
      • mrjob.ssh: Run commands on EMR cluster machines
  • Interacting with different “filesystems”
    • mrjob.fs.base: Common functionality
    • mrjob.fs.composite: Support multiple filesystems; if one fails, “fall through” to another
    • mrjob.fs.gcs: Google Cloud Storage
    • mrjob.fs.hadoop: HDFS
    • mrjob.fs.local: Local filesystem
    • mrjob.fs.s3: S3
    • mrjob.fs.ssh: SSH
  • Utilities
    • mrjob.compat: Transparently handle differences between Hadoop versions
    • mrjob.logs: Log interpretation (counters, probable cause of job failure)
    • mrjob.parse: Parsing utilities for URIs, command line options, etc.
    • mrjob.util: Utilities for dealing with files, command line options, various other things