Spark runner optionsΒΆ

All options from Options available to all runners and Hadoop-related options are available in the Spark runner.

In addition, the Spark runner has the following options in common with other runners:

Options unique to the Spark runner:

emulate_map_input_file (--emulate-map-input-file, --no-emulate-map-input-file) : boolean

Default: False

Imitate Hadoop by setting $mapreduce_map_input_file to the path of the input file for the current partition. This helps support jobs that rely on jobconf_from_env('mapreduce.map.input.file').

This feature only applies to the mapper of the job’s first step, and is ignored by jobs that set HADOOP_INPUT_FORMAT.

New in version 0.6.9.

gcs_region (--gcs-region) : string

Default: None

The region to use when creating a temporary bucket on Google Cloud Storage.

Similar in meaning to region, but only used to configure GCS (not S3)

s3_region (--s3-region) : string

Default: None

The region to use when creating a temporary bucket on S3.

Similar in meaning to region, but only used to configure S3 (not GCS)

skip_internal_protocol (--skip-internal-protocol, --no-skip-internal-protocol) : boolean

Default: False

Don’t emulate the job’s internal protocol (used for communicating between job steps and tasks in the same step), instead relying on Spark to encode and decode data structures.

This should work for most but not all jobs, and make them run at least somewhat faster. Some things to keep in mind:

  • data will no longer be “normalized” by being converted to and from string representation. For example, running a tuple through JSONProtocol (the default) implicitly converts it to a list because there are no tuples in JSON. With internal protocols skipped, it would remain a tuple.
  • if your job uses SORT_VALUES, keep in mind that your values will need to be comparable as Spark will be comparing them directly, rather than comparing their internal-protocol-encoded representation. This may also affect sorting order.

New in version 0.6.10.

spark_tmp_dir (--spark-tmp-dir) : string

Default: (automatic)

A place to put files where they are visible to Spark executors, similar to cloud_tmp_dir.

If running locally, defaults to a directory inside local_tmp_dir, and if running on a cluster, to tmp/mrjob on HDFS.