Contributing to mrjob¶
Contribution guidelines¶
mrjob is developed using a standard Github pull request process. Almost all code is reviewed in pull requests.
The general process for working on mrjob is:
- Fork the project on Github
- Clone your fork to your local machine
- Create a feature branch from master (e.g.
git branch delete_all_the_code
) - Write code, commit often
- Write test cases for all changed functionality
- Submit a pull request against
master
on Github - Wait for code review!
It would also help to discuss your ideas on the mailing list so we can warn you of possible merge conflicts with ongoing work or offer suggestions for where to put code.
Things that will make your branch more likely to be pulled:
- Comprehensive, fast test cases
- Detailed explanation of what the change is and how it works
- Reference relevant issue numbers in the tracker
- API backward compatibility
If you add a new configuration option, please try to do all of these things:
- Add command line switches that allow full control over the option
- Document the option and its switches in the appropriate file under
docs
A quick tour through the code¶
mrjob’s modules can be put in four categories:
- Reading command line arguments and config files, and invoking machinery
accordingly
mrjob.conf
: Read config filesmrjob.launch
: Invoke runners based on command line and configsmrjob.options
: Define command line options
- Interacting with Hadoop Streaming
mrjob.job
: Python interface for writing jobsmrjob.protocol
: Defining data formats between Python steps
- Runners and support; submitting the job to various MapReduce environments
mrjob.runner
: Common functionality across runnersmrjob.hadoop
: Submit jobs to Hadoopmrjob.step
: Define/implement interface between runners and script steps- Local
mrjob.inline
: Run Python-only jobs in-processmrjob.local
: Run Hadoop Streaming-only jobs in subprocesses
- Google Cloud Dataproc
mrjob.dataproc
: Submit jobs to Dataproc
- Amazon Elastic MapReduce
mrjob.emr
: Submit jobs to EMRmrjob.pool
: Utilities for cluster pooling functionalitymrjob.retry
: Wrapper for S3 and EMR connections to handle recoverable errors
- Interacting with different “filesystems”
mrjob.fs.base
: Common functionalitymrjob.fs.composite
: Support multiple filesystems; if one fails, “fall through” to anothermrjob.fs.gcs
: Google Cloud Storagemrjob.fs.hadoop
: HDFSmrjob.fs.local
: Local filesystemmrjob.fs.s3
: S3mrjob.fs.ssh
: SSH
- Utilities
mrjob.compat
: Transparently handle differences between Hadoop versionsmrjob.logs
: Log interpretation (counters, probable cause of job failure)mrjob.parse
: Parsing utilities for URIs, command line options, etc.mrjob.util
: Utilities for dealing with files, command line options, various other things