Calling WarpScript from Pig
The integration of WarpScript in Pig is done via a User Defined Function (UDF) provided by the
warp10-pig package. This UDF, called
WarpScriptRun, allows the execution of WarpScript code in a
FOREACH .. GENERATE construct.
The parameters of this function are the WarpScript code to execute and the elements of the input Pig relation which will be pushed onto the stack (the first element ending on top of the stack). The resulting stack levels will be packed as a Pig tuple to form the resulting relation (again with the first element being the top of the stack).
WarpScript code stored in external files can be called using the
@file.mc2 construct in place of actual WarpScript code.
Given that the call to
WarpScriptRun returns a Pig tuple, you may want to wrap the call to
WarpScriptRun inside a call to
FLATTEN to remove a level of nesting.
WarpScriptRun UDF can be used it must be defined in your Pig script. This is done using Pig's
-- Make the Warp 10 Pig jar known to Pig REGISTER warp10-pig.jar; DEFINE WarpScriptRun io.warp10.pig.WarpScriptRun(OPTIONS);
OPTIONS parameter to the
WarpScriptRun definition is a String which can contain the following elements separated by whitespaces:
|Modifies the semantics of the stack. |
|Indicates that Pig databags should be converted to Vector lists (|
|Indicates that Pig databags should be pushed onto the stack as is. As they are iterables, they can be processed using |
|Sets a system property |
Note that since the
WarpScriptRun function will execute WarpScript code, some configuration needs to be done, for example to define the time units WarpScript will use and possibly a set of extensions which should be loaded.
Such a configuration is done by setting
warp10.config in Pig (using
SET) to point to a Warp 10 configuration file. Such file should also be registered in Pig using
REGISTER so it is propagated to the various workers doing the actual processing.
Once the function is known to Pig, it can be used to process data in the following way:
A = LOAD ... AS (foo: xxx, bar: yyy, ...) B = FOREACH A GENERATE FLATTEN(WarpScriptRun('CODE',foo, bar)) AS (....);
In the example above,
CODE will be executed on a stack with the value of the field
foo on top and the value of field
bar below, for each tuple from relation A.
Working with schemas
Pig likes schemas, i.e. the more precisely the fields of a relation are defined, the easier it is for Pig to work. Defining a Pig Schema can be cumbersome, that's why WarpScript offers the
PIGSCHEMA function which will push onto the stack a string defining the schema of the stack content.
By using this function when you develop your WarpScript code, you can easily obtain the schemas to put in the
AS clauses of your Pig statements.