The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
This vignette provides examples of how to use the xform_function
transformation to create new data features for PMML models.
Given a xform_wrap
object and a transformation expression, xform_function
calculates data for a new feature and creates a new xform_wrap
object. When PMML is produced with pmml::pmml()
, the transformation is inserted into the LocalTransformations
node as a DerivedField
.
Multiple data fields and functions can be combined to produce a new feature.
The code below uses knitr::kable()
to make tables more readable.
Using the iris
dataset as an example, let’s construct a new feature by transforming one variable. Load the dataset and show the first few lines:
data(iris)
kable(head(iris,3))
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
Create the iris_box
object with xform_wrap
:
<- xform_wrap(iris) iris_box
iris_box
contains the data and transform information that will be used to produce PMML later. The original data is in iris_box$data
. Any new features created with a transformation are added as columns to this data frame.
kable(head(iris_box$data,3))
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
Transform and field information is in iris_box$field_data
. The field_data data frame contains information on every field in the dataset, as well as every transform used. The xform_function
column contains expressions used in the xform_function
transform.
kable(iris_box$field_data)
type | dataType | orig_field_name | sampleMin | sampleMax | xformedMin | xformedMax | centers | scales | fieldsMap | transform | default | missingValue | xform_function | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Sepal.Length | original | numeric | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Sepal.Width | original | numeric | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Petal.Length | original | numeric | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Petal.Width | original | numeric | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Species | original | factor | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Now add a new feature, Sepal.Length.Sqrt
, using xform_function
:
<- xform_function(iris_box,orig_field_name="Sepal.Length",
iris_box new_field_name="Sepal.Length.Sqrt",
expression="sqrt(Sepal.Length)")
The new feature is calculated and added as a column to the iris_box$data
data frame:
kable(head(iris_box$data,3))
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | Sepal.Length.Sqrt |
---|---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa | 2.258318 |
4.9 | 3.0 | 1.4 | 0.2 | setosa | 2.213594 |
4.7 | 3.2 | 1.3 | 0.2 | setosa | 2.167948 |
iris_box$field_data
now contains a new row with the transformation expression:
kable(iris_box$field_data[6,c(1:3,14)])
type | dataType | orig_field_name | xform_function | |
---|---|---|---|---|
Sepal.Length.Sqrt | derived | numeric | Sepal.Length | sqrt(Sepal.Length) |
Construct a linear model for Petal.Width
using this new feature, and convert it to PMML:
<- lm(Petal.Width ~ Sepal.Length.Sqrt, data=iris_box$data)
fit <- pmml(fit, transform=iris_box) fit_pmml
Since the model predicts Petal.Width
using a variable based on Sepal.Length
, the PMML will contain these two fields in the DataDictionary
and MiningSchema
:
2]] #Data Dictionary node
fit_pmml[[#> <DataDictionary numberOfFields="2">
#> <DataField name="Petal.Width" optype="continuous" dataType="double"/>
#> <DataField name="Sepal.Length" optype="continuous" dataType="double"/>
#> </DataDictionary>
3]][[1]] #Mining Schema node
fit_pmml[[#> <MiningSchema>
#> <MiningField name="Petal.Width" usageType="predicted" invalidValueTreatment="returnInvalid"/>
#> <MiningField name="Sepal.Length" usageType="active" invalidValueTreatment="returnInvalid"/>
#> </MiningSchema>
The LocalTransformations
node contains Sepal.Length.Sqrt
as a derived field:
3]][[3]]
fit_pmml[[#> <LocalTransformations>
#> <DerivedField name="Sepal.Length.Sqrt" dataType="double" optype="continuous">
#> <Apply function="sqrt">
#> <FieldRef field="Sepal.Length"/>
#> </Apply>
#> </DerivedField>
#> </LocalTransformations>
xform_function
can also operate on categorical data. In this example, let’s create a numeric feature that equals 1 when Species
is setosa
, and 0 otherwise:
<- xform_wrap(iris)
iris_box <- xform_function(iris_box,orig_field_name="Species",
iris_box new_field_name="Species.Setosa",
expression="if (Species == 'setosa') {1} else {0}")
kable(head(iris_box$data,3))
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | Species.Setosa |
---|---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa | 1 |
4.9 | 3.0 | 1.4 | 0.2 | setosa | 1 |
4.7 | 3.2 | 1.3 | 0.2 | setosa | 1 |
Create a linear model and check the LocalTransformations
node:
<- lm(Petal.Width ~ Species.Setosa, data=iris_box$data)
fit <- pmml(fit, transform=iris_box)
fit_pmml 3]][[3]]
fit_pmml[[#> <LocalTransformations>
#> <DerivedField name="Species.Setosa" dataType="double" optype="continuous">
#> <Apply function="if">
#> <Apply function="equal">
#> <FieldRef field="Species"/>
#> <Constant dataType="string">setosa</Constant>
#> </Apply>
#> <Constant dataType="double">1</Constant>
#> <Constant dataType="double">0</Constant>
#> </Apply>
#> </DerivedField>
#> </LocalTransformations>
Several fields can be combined to create new features. Let’s make a new field from the ratio of sepal and petal lengths:
<- xform_wrap(iris)
iris_box <- xform_function(iris_box,orig_field_name="Sepal.Length,Petal.Length",
iris_box new_field_name="Length.Ratio",
expression="Sepal.Length / Petal.Length")
As before, the new field is added as a column to the iris_box$data
data frame:
kable(head(iris_box$data,3))
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | Length.Ratio |
---|---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa | 3.642857 |
4.9 | 3.0 | 1.4 | 0.2 | setosa | 3.500000 |
4.7 | 3.2 | 1.3 | 0.2 | setosa | 3.615385 |
Fit a linear model using this new feature, and convert it to pmml:
<- lm(Petal.Width ~ Length.Ratio, data=iris_box$data)
fit <- pmml(fit, transform=iris_box) fit_pmml
The pmml will contain Sepal.Length
and Petal.Length
in the DataDictionary
and MiningSchema
:
2]] #Data Dictionary node
fit_pmml[[#> <DataDictionary numberOfFields="3">
#> <DataField name="Petal.Width" optype="continuous" dataType="double"/>
#> <DataField name="Sepal.Length" optype="continuous" dataType="double"/>
#> <DataField name="Petal.Length" optype="continuous" dataType="double"/>
#> </DataDictionary>
3]][[1]] #Mining Schema node
fit_pmml[[#> <MiningSchema>
#> <MiningField name="Petal.Width" usageType="predicted" invalidValueTreatment="returnInvalid"/>
#> <MiningField name="Sepal.Length" usageType="active" invalidValueTreatment="returnInvalid"/>
#> <MiningField name="Petal.Length" usageType="active" invalidValueTreatment="returnInvalid"/>
#> </MiningSchema>
The Local.Transformations
node contains Length.Ratio
as a derived field:
3]][[3]]
fit_pmml[[#> <LocalTransformations>
#> <DerivedField name="Length.Ratio" dataType="double" optype="continuous">
#> <Apply function="/">
#> <FieldRef field="Sepal.Length"/>
#> <FieldRef field="Petal.Length"/>
#> </Apply>
#> </DerivedField>
#> </LocalTransformations>
It is possible to pass a feature derived with xform_function
to another xform_function
call. To do this, the second call to xform_function
must use the original data field names (instead of the derived field) in the orig_field_name
argument.
<- xform_wrap(iris)
iris_box <- xform_function(iris_box,orig_field_name="Sepal.Length,Petal.Length",
iris_box new_field_name="Length.Ratio",
expression="Sepal.Length / Petal.Length")
<- xform_function(iris_box,orig_field_name="Sepal.Length,Petal.Length,Sepal.Width",
iris_box new_field_name="Length.R.Times.S.Width",
expression="Length.Ratio * Sepal.Width")
kable(iris_box$field_data[6:7,c(1:3,14)])
type | dataType | orig_field_name | xform_function | |
---|---|---|---|---|
Length.Ratio | derived | numeric | Sepal.Length,Petal.Length | Sepal.Length / Petal.Length |
Length.R.Times.S.Width | derived | numeric | Sepal.Length,Petal.Length,Sepal.Width | Length.Ratio * Sepal.Width |
<- lm(Petal.Width ~ Length.R.Times.S.Width, data=iris_box$data)
fit <- pmml(fit, transform=iris_box) fit_pmml
The pmml will contain Sepal.Length
, Petal.Length
, and Sepal.Width
in the DataDictionary
and MiningSchema
:
2]] #Data Dictionary node
fit_pmml[[#> <DataDictionary numberOfFields="4">
#> <DataField name="Petal.Width" optype="continuous" dataType="double"/>
#> <DataField name="Sepal.Length" optype="continuous" dataType="double"/>
#> <DataField name="Petal.Length" optype="continuous" dataType="double"/>
#> <DataField name="Sepal.Width" optype="continuous" dataType="double"/>
#> </DataDictionary>
3]][[1]] #Mining Schema node
fit_pmml[[#> <MiningSchema>
#> <MiningField name="Petal.Width" usageType="predicted" invalidValueTreatment="returnInvalid"/>
#> <MiningField name="Sepal.Length" usageType="active" invalidValueTreatment="returnInvalid"/>
#> <MiningField name="Petal.Length" usageType="active" invalidValueTreatment="returnInvalid"/>
#> <MiningField name="Sepal.Width" usageType="active" invalidValueTreatment="returnInvalid"/>
#> </MiningSchema>
The Local.Transformations
node contains Length.Ratio
and Length.R.Times.S.Width
as derived fields:
3]][[3]]
fit_pmml[[#> <LocalTransformations>
#> <DerivedField name="Length.Ratio" dataType="double" optype="continuous">
#> <Apply function="/">
#> <FieldRef field="Sepal.Length"/>
#> <FieldRef field="Petal.Length"/>
#> </Apply>
#> </DerivedField>
#> <DerivedField name="Length.R.Times.S.Width" dataType="double" optype="continuous">
#> <Apply function="*">
#> <FieldRef field="Length.Ratio"/>
#> <FieldRef field="Sepal.Width"/>
#> </Apply>
#> </DerivedField>
#> </LocalTransformations>
The resulting field can be numeric or factor. Note that factors are exported with dataType = "string"
and optype = "categorical"
in PMML. The following code creates a factor with 3 levels from Sepal.Length
:
<- xform_wrap(iris)
iris_box
<- xform_function(wrap_object = iris_box,
iris_box orig_field_name = "Sepal.Length",
new_field_name = "SL_factor",
new_field_data_type = "factor",
expression = "if(Sepal.Length<5.1) {'level_A'} else if (Sepal.Length>6.6) {'level_B'} else {'level_C'}")
kable(head(iris_box$data, 3))
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | SL_factor |
---|---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa | level_C |
4.9 | 3.0 | 1.4 | 0.2 | setosa | level_A |
4.7 | 3.2 | 1.3 | 0.2 | setosa | level_A |
The feature can then be used to create a model as usual:
<- lm(Petal.Width ~ SL_factor, data=iris_box$data)
fit <- pmml(fit, transform=iris_box) fit_pmml
xform_function
The following R functions and operators are directly supported by xform_function
. Their PMML equivalents are listed in the second column:
R | PMML |
---|---|
+ | + |
- | - |
/ | / |
* | * |
^ | pow |
< | lessThan |
<= | lessOrEqual |
> | greaterThan |
>= | greaterOrEqual |
&& | and |
& | and |
| | or |
|| | or |
== | equal |
!= | notEqual |
! | not |
ceiling | ceil |
prod | product |
log | ln |
For these functions, no extra code is required for translation.
The R function prod
can be used as long as only numeric arguments are specified. That is, prod
can take an na.rm
argument, but specifying this in xform_function
directly will not produce PMML equivalent to the R expression.
Similarly, the R function log
can be used directly as long as the second argument (the base) is not specified.
xform_function
There are built-in functions defined in PMML that cannot be directly translated to PMML using xform_function
as described above.
In this case, an error will be thrown when R tries to calculate a new feature using the function passed to xform_function
, but does not see that function in the environment.
It is still possible to make xform_function
work, but the PMML function must be defined in the R environment first.
Let’s use isIn
, a PMML function, as an example. The function returns a boolean indicating whether the first argument is contained in a list of values. Detailed specification for this function is available on this DMG page.
One way to implement this in R is by using %in%
, with the list of values being represented by ...
:
<- function(x, ...) {
isIn <- c(...)
dots if (x %in% dots) {
return(TRUE)
else {
} return(FALSE)
}
}
isIn(1,2,1,4)
#> [1] TRUE
This function can now be passed to xform_function
. The following code creates a feature that indicates whether Species
is either setosa
or versicolor
:
<- xform_wrap(iris)
iris_box <- xform_function(iris_box,orig_field_name="Species",
iris_box new_field_name="Species.Setosa.or.Versicolor",
expression="isIn(Species,'setosa','versicolor')")
The data
data frame now contains the new feature:
kable(head(iris_box$data,3))
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | Species.Setosa.or.Versicolor |
---|---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa | 1 |
4.9 | 3.0 | 1.4 | 0.2 | setosa | 1 |
4.7 | 3.2 | 1.3 | 0.2 | setosa | 1 |
Create a linear model and view the corresponding PMML for the function:
<- lm(Petal.Width ~ Species.Setosa.or.Versicolor, data=iris_box$data)
fit <- pmml(fit, transform=iris_box)
fit_pmml 3]][[3]]
fit_pmml[[#> <LocalTransformations>
#> <DerivedField name="Species.Setosa.or.Versicolor" dataType="double" optype="continuous">
#> <Apply function="isIn">
#> <FieldRef field="Species"/>
#> <Constant dataType="string">setosa</Constant>
#> <Constant dataType="string">versicolor</Constant>
#> </Apply>
#> </DerivedField>
#> </LocalTransformations>
xform_function
- another exampleAs another example, let’s use R’s mean
function to create a new feature. PMML has a built-in avg
, so we will define an R function with this name.
<- function(...) {
avg <- c(...)
dots return(mean(dots))
}
Now use this function to take an average of several other features and combine with another field:
<- xform_wrap(iris)
iris_box <- xform_function(iris_box,orig_field_name="Sepal.Length,Petal.Length,Sepal.Width",
iris_box new_field_name="Length.Average.Ratio",
expression="avg(Sepal.Length,Petal.Length)/Sepal.Width")
The data
data frame now contains the new feature:
kable(head(iris_box$data,3))
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | Length.Average.Ratio |
---|---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa | 0.9285714 |
4.9 | 3.0 | 1.4 | 0.2 | setosa | 1.0500000 |
4.7 | 3.2 | 1.3 | 0.2 | setosa | 0.9375000 |
Create a simple linear model and view the corresponding PMML for the function:
<- lm(Petal.Width ~ Length.Average.Ratio, data=iris_box$data)
fit <- pmml(fit, transform=iris_box)
fit_pmml 3]][[3]]
fit_pmml[[#> <LocalTransformations>
#> <DerivedField name="Length.Average.Ratio" dataType="double" optype="continuous">
#> <Apply function="/">
#> <Apply function="avg">
#> <FieldRef field="Sepal.Length"/>
#> <FieldRef field="Petal.Length"/>
#> </Apply>
#> <FieldRef field="Sepal.Width"/>
#> </Apply>
#> </DerivedField>
#> </LocalTransformations>
In the PMML, avg
will be recognized as a valid function.
The function function_to_pmml
(part of the pmml
package) makes it possible to convert an R expression into PMML directly, without creating a model or calculating values.
As long as the expression passed to the function is a valid R expression (e.g., no unbalanced parentheses), it can contain arbitrary function names not defined in R. Variables in the expression passed to xform_function
are always assumed to be field names, and not substituted. That is, even if x
has a value in the R environment, the resulting expression will still use x
.
function_to_pmml("1 + 2")
#> <Apply function="+">
#> <Constant dataType="double">1</Constant>
#> <Constant dataType="double">2</Constant>
#> </Apply>
<- 3
x function_to_pmml("foo(bar(x * y))")
#> <Apply function="foo">
#> <Apply function="bar">
#> <Apply function="*">
#> <FieldRef field="x"/>
#> <FieldRef field="y"/>
#> </Apply>
#> </Apply>
#> </Apply>
There are several limitations to parsing expressions in xform_function
.
Each transformation operates on one data row at a time. For example, it is not possible to compute the mean of an entire feature column in xform_function
.
An expression such as foo(x)
is treated as a function foo
with argument x
. Consequently, passing in an R vector c(1,2,3)
will produce PMML where c
is a function and 1,2,3
are the arguments:
function_to_pmml("c(1,2,3)")
#> <Apply function="c">
#> <Constant dataType="double">1</Constant>
#> <Constant dataType="double">2</Constant>
#> <Constant dataType="double">3</Constant>
#> </Apply>
We can also see what happens when passing an na.rm
argument to prod
, as mentioned in an above example:
function_to_pmml("prod(1,2,na.rm=FALSE)") #produces incorrect PMML
#> <Apply function="product">
#> <Constant dataType="double">1</Constant>
#> <Constant dataType="double">2</Constant>
#> <Constant dataType="boolean">FALSE</Constant>
#> </Apply>
function_to_pmml("prod(1,2)") #produces correct PMML
#> <Apply function="product">
#> <Constant dataType="double">1</Constant>
#> <Constant dataType="double">2</Constant>
#> </Apply>
Additionally, passing in a vector to prod
produces incorrect PMML:
prod(c(1,2,3))
#> [1] 6
function_to_pmml("prod(c(1,2,3))")
#> <Apply function="product">
#> <Apply function="c">
#> <Constant dataType="double">1</Constant>
#> <Constant dataType="double">2</Constant>
#> <Constant dataType="double">3</Constant>
#> </Apply>
#> </Apply>
The following are additional examples of pmml produced from R expressions.
Extra parentheses:
function_to_pmml("pmmlT(((1+2))*(x))")
#> <Apply function="pmmlT">
#> <Apply function="*">
#> <Apply function="+">
#> <Constant dataType="double">1</Constant>
#> <Constant dataType="double">2</Constant>
#> </Apply>
#> <FieldRef field="x"/>
#> </Apply>
#> </Apply>
If-else expressions:
function_to_pmml("if(a<2) {x+3} else if (a>4) {4} else {5}")
#> <Apply function="if">
#> <Apply function="lessThan">
#> <FieldRef field="a"/>
#> <Constant dataType="double">2</Constant>
#> </Apply>
#> <Apply function="+">
#> <FieldRef field="x"/>
#> <Constant dataType="double">3</Constant>
#> </Apply>
#> <Apply function="if">
#> <Apply function="greaterThan">
#> <FieldRef field="a"/>
#> <Constant dataType="double">4</Constant>
#> </Apply>
#> <Constant dataType="double">4</Constant>
#> <Constant dataType="double">5</Constant>
#> </Apply>
#> </Apply>
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.