What is Spark?

Apache Spark is a general purpose data processing engine which offers a paradigm where data is mostly cached in memory thus speeding up iterative computations.

Spark complements the Hadoop ecosystem, enabling the building of data applications which leverage the vast corpus of libraries and input formats available for Hadoop.

Computations in Spark are defined by building a DAG (Directed Acyclic Graph) of operations (edges) and datasets (vertices). This graph is then translated into an execution plan which will perform the actual computations on a single machine or a cluster of workers.

Where can WarpScript be used in Spark?

WarpScript code execution can be inserted into Spark DAGs in filter, map and flatmap operations.

The warp10-spark2 package defines the following functions (in package io.warp10.spark):

All those functions accept a single parameter which is the WarpScript code to execute. If you want to reference WarpScript code which resides in an external file, use the @file.mc2 syntax for the code.

This code will be called with the elements they apply to pushed onto the stack. The resulting stack will be returned back to the caller, make sure that the number of levels on the stack and the types of the elements they contain are compatible with the expectations of your DAG.

Another way of invoking WarpScript code in a Spark application is via Spark SQL. For this type of usage, please refer to the PySpark documentation.

Running your Spark job

In order for your job to be able to call the WarpScript related functions, you must add the warp10-spark2 jar to the list of jars to include in your job via the --jars options to spark-submit and all the referenced .mc2 files via the --files option.