mrjob.step - represent Job Steps

mrjob.step.INPUT = '<input>'

If passed as an argument to JarStep or py:class:SparkScriptStep, it’ll be replaced with the step’s input path(s) (if there are multiple paths, they’ll be joined with commas)

class mrjob.step.JarStep(jar, **kwargs)

Represents a running a custom Jar as a step.

Accepts the following keyword arguments:

Parameters:
  • jar – The local path to the Jar. On EMR, this can also be an s3:// URI, or file:// to reference a jar on the local filesystem of your EMR instance(s).
  • args – (optional) A list of arguments to the jar. Use mrjob.step.INPUT and OUTPUT to interpolate input and output paths.
  • jobconf – (optional) A dictionary of Hadoop properties
  • main_class – (optional) The main class to run from the jar. If not specified, Hadoop will use the main class in the jar’s manifest file.

jar can also be passed as a positional argument

See Jar steps for sample usage.

class mrjob.step.MRStep(**kwargs)

Represents steps handled by the script containing your job.

Used by MRJob.steps. See Multi-step jobs for sample usage.

Accepts the following keyword arguments:

Parameters:
  • mapper – function with same function signature as mapper(), or None for an identity mapper.
  • reducer – function with same function signature as reducer(), or None for no reducer.
  • combiner – function with same function signature as combiner(), or None for no combiner.
  • mapper_init – function with same function signature as mapper_init(), or None for no initial mapper action.
  • mapper_final – function with same function signature as mapper_final(), or None for no final mapper action.
  • reducer_init – function with same function signature as reducer_init(), or None for no initial reducer action.
  • reducer_final – function with same function signature as reducer_final(), or None for no final reducer action.
  • combiner_init – function with same function signature as combiner_init(), or None for no initial combiner action.
  • combiner_final – function with same function signature as combiner_final(), or None for no final combiner action.
  • jobconf – dictionary with custom jobconf arguments to pass to hadoop.
description(step_num)

Returns a dictionary representation of this step:

{
    'type': 'streaming',
    'mapper': { ... },
    'combiner': { ... },
    'reducer': { ... },
    'jobconf': dictionary of Hadoop configuration properties
}

jobconf is optional, and only one of mapper, combiner, and reducer need be included.

mapper, combiner, and reducer are either handled by the script containing your job definition:

{
    'type': 'script',
    'pre_filter': (optional) cmd to pass input through, as a string
}

or they simply run a command:

{
    'type': 'command',
    'command': command to run, as a string
}

See Format of –steps for examples.

mrjob.step.OUTPUT = '<output>'

If this is passed as passed as an argument to JarStep or py:class:SparkScriptStep, it’ll be replaced with the step’s output path

class mrjob.step.SparkJarStep(jar, main_class, **kwargs)

Represents a running a separate Jar through Spark

Accepts the following keyword arguments:

Parameters:
  • jar – The local path to the Python script to run. On EMR, this can also be an s3:// URI, or file:// to reference a jar on the local filesystem of your EMR instance(s).
  • main_class – Your application’s main class (e.g. 'org.apache.spark.examples.SparkPi')
  • args – (optional) A list of arguments to the script. Use mrjob.step.INPUT and OUTPUT to interpolate input and output paths.
  • jobconf – (optional) A dictionary of Hadoop properties
  • spark_args – (optional) an array of arguments to pass to spark-submit (e.g. ['--executor-memory', '2G']).

jar and main_class can also be passed as positional arguments

class mrjob.step.SparkScriptStep(script, **kwargs)

Represents a running a separate Python script through Spark

Accepts the following keyword arguments:

Parameters:
  • script – The local path to the Python script to run. On EMR, this can also be an s3:// URI, or file:// to reference a jar on the local filesystem of your EMR instance(s).
  • args – (optional) A list of arguments to the script. Use mrjob.step.INPUT and OUTPUT to interpolate input and output paths.
  • jobconf – (optional) A dictionary of Hadoop properties
  • spark_args – (optional) an array of arguments to pass to spark-submit (e.g. ['--executor-memory', '2G']).

script can also be passed as a positional argument

class mrjob.step.SparkStep(spark, **kwargs)

Represents running a Spark step defined in your job.

Accepts the following keyword arguments:

Parameters:
  • spark – function containing your Spark code with same function signature as spark()
  • jobconf – (optional) A dictionary of Hadoop properties
  • spark_args – (optional) an array of arguments to pass to spark-submit (e.g. ['--executor-memory', '2G']).
exception mrjob.step.StepFailedException(reason=None, step_num=None, num_steps=None, step_desc=None)

Exception to throw when a step fails.

This will automatically be caught and converted to an error message by mrjob.job.MRJob.run(), but you may wish to catch it if you run your job programatically.