Calling WarpScript from Pig
The integration of WarpScript in Pig is done via a User Defined Function (UDF) provided by the warp10-pig
package. This UDF, called WarpScriptRun
, allows the execution of WarpScript code in a FOREACH .. GENERATE
construct.
The parameters of this function are the WarpScript code to execute and the elements of the input Pig relation which will be pushed onto the stack (the first element ending on top of the stack). The resulting stack levels will be packed as a Pig tuple to form the resulting relation (again with the first element being the top of the stack).
WarpScript code stored in external files can be called using the @file.mc2
construct in place of actual WarpScript code.
Given that the call to WarpScriptRun
returns a Pig tuple, you may want to wrap the call to WarpScriptRun
inside a call to FLATTEN
to remove a level of nesting.
Defining WarpScriptRun
Before the WarpScriptRun
UDF can be used it must be defined in your Pig script. This is done using Pig's DEFINE
function:
-- Make the Warp 10 Pig jar known to Pig
REGISTER warp10-pig.jar;
DEFINE WarpScriptRun io.warp10.pig.WarpScriptRun(OPTIONS);
The optional OPTIONS
parameter to the WarpScriptRun
definition is a String which can contain the following elements separated by whitespaces:
Option | Description |
---|---|
-s xxx | Modifies the semantics of the stack. xxx can be one of SYNCHRONIZED , to use a single stack and synchronize its use among all accessing threads, PERTHREAD to allocate a stack per thread (the default behavior), or NEW to allocate a new stack whenever WarpScriptRun is used. |
--convbags | Indicates that Pig databags should be converted to Vector lists (VLIST ) in WarpScript. Vector lists can then be converted to lists using V-> . |
--noconvbags | Indicates that Pig databags should be pushed onto the stack as is. As they are iterables, they can be processed using FOREACH and their content converted using PIG-> . |
prop=value | Sets a system property prop with value value . |
Note that since the WarpScriptRun
function will execute WarpScript code, some configuration needs to be done, for example to define the time units WarpScript will use and possibly a set of extensions which should be loaded.
Such a configuration is done by setting warp10.config
in Pig (using SET
) to point to a Warp 10 configuration file. Such file should also be registered in Pig using REGISTER
so it is propagated to the various workers doing the actual processing.
Using WarpScriptRun
Once the function is known to Pig, it can be used to process data in the following way:
A = LOAD ... AS (foo: xxx, bar: yyy, ...)
B = FOREACH A GENERATE FLATTEN(WarpScriptRun('CODE',foo, bar)) AS (....);
In the example above, CODE
will be executed on a stack with the value of the field foo
on top and the value of field bar
below, for each tuple from relation A.
Working with schemas
Pig likes schemas, i.e. the more precisely the fields of a relation are defined, the easier it is for Pig to work. Defining a Pig Schema can be cumbersome, that's why WarpScript offers the PIGSCHEMA
function which will push onto the stack a string defining the schema of the stack content.
By using this function when you develop your WarpScript code, you can easily obtain the schemas to put in the AS
clauses of your Pig statements.