Calling WarpScript from Pig

The integration of WarpScript in Pig is done via a User Defined Function (UDF) provided by the warp10-pig package. This UDF, called WarpScriptRun, allows the execution of WarpScript code in a FOREACH .. GENERATE construct.

The parameters of this function are the WarpScript code to execute and the elements of the input Pig relation which will be pushed onto the stack (the first element ending on top of the stack). The resulting stack levels will be packed as a Pig tuple to form the resulting relation (again with the first element being the top of the stack).

WarpScript code stored in external files can be called using the @file.mc2 construct in place of actual WarpScript code.

Given that the call to WarpScriptRun returns a Pig tuple, you may want to wrap the call to WarpScriptRun inside a call to FLATTEN to remove a level of nesting.

Defining WarpScriptRun

Before the WarpScriptRun UDF can be used it must be defined in your Pig script. This is done using Pig's DEFINE function:

-- Make the Warp 10 Pig jar known to Pig
REGISTER warp10-pig.jar;

DEFINE WarpScriptRun io.warp10.pig.WarpScriptRun(OPTIONS);

The optional OPTIONS parameter to the WarpScriptRun definition is a String which can contain the following elements separated by whitespaces:

-s xxxModifies the semantics of the stack. xxx can be one of SYNCHRONIZED, to use a single stack and synchronize its use among all accessing threads, PERTHREAD to allocate a stack per thread (the default behavior), or NEW to allocate a new stack whenever WarpScriptRun is used.
--convbagsIndicates that Pig databags should be converted to Vector lists (VLIST) in WarpScript. Vector lists can then be converted to lists using V->.
--noconvbagsIndicates that Pig databags should be pushed onto the stack as is. As they are iterables, they can be processed using FOREACH and their content converted using PIG->.
prop=valueSets a system property prop with value value.

Note that since the WarpScriptRun function will execute WarpScript code, some configuration needs to be done, for example to define the time units WarpScript will use and possibly a set of extensions which should be loaded.

Such a configuration is done by setting warp10.config in Pig (using SET) to point to a Warp 10 configuration file. Such file should also be registered in Pig using REGISTER so it is propagated to the various workers doing the actual processing.

Using WarpScriptRun

Once the function is known to Pig, it can be used to process data in the following way:

A = LOAD ... AS (foo: xxx, bar: yyy, ...)

B = FOREACH A GENERATE FLATTEN(WarpScriptRun('CODE',foo, bar)) AS (....);

In the example above, CODE will be executed on a stack with the value of the field foo on top and the value of field bar below, for each tuple from relation A.

Working with schemas

Pig likes schemas, i.e. the more precisely the fields of a relation are defined, the easier it is for Pig to work. Defining a Pig Schema can be cumbersome, that's why WarpScript offers the PIGSCHEMA function which will push onto the stack a string defining the schema of the stack content.

By using this function when you develop your WarpScript code, you can easily obtain the schemas to put in the AS clauses of your Pig statements.