Contributing to mrjob¶

Contribution guidelines¶

mrjob is developed using a standard Github pull request process. Almost all code is reviewed in pull requests.

The general process for working on mrjob is:

Fork the project on Github
Clone your fork to your local machine
Create a feature branch from master (e.g. git branch delete_all_the_code)
Write code, commit often
Write test cases for all changed functionality
Submit a pull request against master on Github
Wait for code review!

It would also help to discuss your ideas on the mailing list so we can warn you of possible merge conflicts with ongoing work or offer suggestions for where to put code.

Things that will make your branch more likely to be pulled:

Comprehensive, fast test cases
Detailed explanation of what the change is and how it works
Reference relevant issue numbers in the tracker
API backward compatibility

If you add a new configuration option, please try to do all of these things:

Add command line switches that allow full control over the option
Document the option and its switches in the appropriate file under docs

A quick tour through the code¶

mrjob’s modules can be put in four categories:

Reading command line arguments and config files, and invoking machinery accordingly
- mrjob.conf: Read config files
- mrjob.launch: Invoke runners based on command line and configs
- mrjob.options: Define command line options
Interacting with Hadoop Streaming
- mrjob.job: Python interface for writing jobs
- mrjob.protocol: Defining data formats between Python steps
Runners and support; submitting the job to various MapReduce environments
- mrjob.runner: Common functionality across runners
- mrjob.hadoop: Submit jobs to Hadoop
- mrjob.step: Define/implement interface between runners and script steps
- Local
  - mrjob.inline: Run Python-only jobs in-process
  - mrjob.local: Run Hadoop Streaming-only jobs in subprocesses
- Google Cloud Dataproc
  - mrjob.dataproc: Submit jobs to Dataproc
- Amazon Elastic MapReduce
  - mrjob.emr: Submit jobs to EMR
  - mrjob.pool: Utilities for cluster pooling functionality
  - mrjob.retry: Wrapper for S3 and EMR connections to handle recoverable errors
Interacting with different “filesystems”
- mrjob.fs.base: Common functionality
- mrjob.fs.composite: Support multiple filesystems; if one fails, “fall through” to another
- mrjob.fs.gcs: Google Cloud Storage
- mrjob.fs.hadoop: HDFS
- mrjob.fs.local: Local filesystem
- mrjob.fs.s3: S3
- mrjob.fs.ssh: SSH
Utilities
- mrjob.compat: Transparently handle differences between Hadoop versions
- mrjob.logs: Log interpretation (counters, probable cause of job failure)
- mrjob.parse: Parsing utilities for URIs, command line options, etc.
- mrjob.util: Utilities for dealing with files, command line options, various other things

mrjob v0.7.4 documentation

Contributing to mrjob¶

Contribution guidelines¶

A quick tour through the code¶

Table Of Contents

Need help?

This Page