Executing business rules at scale using RDrools - an interface to Drools

Naren Srinivasan, Dheekshitha PS

7/4/2018

Introduction

Objectives of Rdrools

The Rdrools package aims to accomplish two main objectives:

The advantages of a rule engine

Rule engines allow for optimal checking of rules against data for large rule sets [of the order of hundreds or even thousands of rules]. Drools [and other rule engines] implement an enhanced version of the Rete algorithm, which efficiently match facts [data tuples] against conditions [rules]. This allows for codifying intuition/ business context which can be used to power intelligent systems.

Why Rdrools

RDrools brings the efficiencies of large-scale production rule systems to data science users. Rule sets can be used alone, or in conjunction with machine learning models, to develop and operationalize intelligent systems. RDrools allows for deployment of rules defined through an R interface into a production system. As data comes in [periodic or real-time], a pre-defined set of rules can be checked on the data, and actions can be triggered based on the result

Running rules on Rdrools

Executing rules on a dataset

In order to achieve the objective of providing data scientists an intuitive interface to execute rules on datasets, the Rdrools package exposes the executeRulesOnDataset function, which is explicitly designed for data scientists. As input to this function rules are defined using the typical language of data science with verbs such as

For ease of use, the rules can be defined in a csv format and imported into the R session through the usual read functions. The require format follows a familiar structure using the verbs discussed earlier. We take the example of the iris dataset and define rules on it. The sample rules for the iris dataset are defined in the irisRules data object [for the purpose of the example]

data("iris")
data("irisRules")
sampleRules <- irisRules
rownames(sampleRules) <- seq(1:nrow(sampleRules))
sampleRules[is.na(sampleRules)]    <-""
sampleRules
##                   Filters GroupBy       Column Function Operation
## 1     Species == 'setosa'                                        
## 2                         Species Sepal.Length  average        >=
## 3                                 Sepal.Length  average         <
## 4         Sepal.Width > 3         Sepal.Length  average        >=
## 5       Petal.Width > 0.4 Species Petal.Length  average         <
## 6                                 Petal.Length  compare        >=
## 7 Species == 'versicolor'         Petal.Length  compare        >=
##      Argument
## 1            
## 2         5.9
## 3           5
## 4           5
## 5           5
## 6 Sepal.Width
## 7 Sepal.Width

Through this function, various typical types of rules can be executed with a combination of the verbs described above.

Note - In order to plot graphs to show counts of number of facts passing/ failing rules, we have defined a function internal to the vignette to plot graphs called ‘plotgraphs’

Applying a simple filter

The first type of rule is applying a simple filter based on the condition on a particular column. This is done by specifying the full condition under the filter column.

In the case of the iris dataset, we filter out a specific type of Species. To illustrate this case, we apply only rule 1.

filterRule <- sampleRules[1,]
filterRule
              Filters GroupBy Column Function Operation Argument
1 Species == 'setosa'                                           
filterRuleOutput <- executeRulesOnDataset(iris, filterRule)
List of 1
 $ :List of 20
  ..$ : chr "import java.util.HashMap"
  ..$ : chr "import java.lang.Double"
  ..$ : chr "global java.util.HashMap output"
  ..$ : chr ""
  ..$ : chr "  dialect \"mvel\""
  ..$ : chr "rule \"Rule1\""
  ..$ : chr "       salience 0"
  ..$ : chr "       when"
  ..$ : chr "        input: HashMap()"
  ..$ : chr "result: Double()\n                               from accumulate($condition:HashMap(),(Double.valueOf($conditio"| __truncated__
  ..$ : chr "then"
  ..$ : chr "output.put('SepalLength',input.get('SepalLength'));"
  ..$ : chr "output.put('SepalWidth',input.get('SepalWidth'));"
  ..$ : chr "output.put('PetalLength',input.get('PetalLength'));"
  ..$ : chr "output.put('PetalWidth',input.get('PetalWidth'));"
  ..$ : chr "output.put('Species',input.get('Species'));"
  ..$ : chr "output.put('rowNumber',input.get('rowNumber'));"
  ..$ : chr "output.put(\"Rule1\",result);"
  ..$ : chr "output.put('Rule1Value',result);"
  ..$ : chr "end"
str(filterRuleOutput)
List of 1
 $ :List of 3
  ..$ input             :Classes 'tbl_df', 'tbl' and 'data.frame':  1 obs. of  7 variables:
  .. ..$ Filters  : chr "Species == 'setosa'"
  .. ..$ GroupBy  : chr ""
  .. ..$ Column   : chr ""
  .. ..$ Function : chr ""
  .. ..$ Operation: chr ""
  .. ..$ Argument : chr ""
  .. ..$ ruleNum  : int 1
  ..$ intermediateOutput: list()
  ..$ output            :'data.frame':  150 obs. of  3 variables:
  .. ..$ Group  : int [1:150] 1 2 3 4 5 6 7 8 9 10 ...
  .. ..$ Indices: int [1:150] 1 2 3 4 5 6 7 8 9 10 ...
  .. ..$ IsTrue : chr [1:150] "true" "true" "true" "true" ...

The output has three objects:

  • input: has the rule defined by the user in a data frame
  • intermediateOutput: is an empty list as there is no grouped aggregation
  • output: has the output data frame with 3 columns:
    • Group: the above rule has no group by and hence the rule is applied row-wise. Group, in this case, represents the row number
    • Indices: the row numbers of the data frame
    • IsTrue: flag to say if the data point is satisfying the rule or not. In this case, Flag is true if the Species is setosa and false if not

Plotting graphs of the result obtained

The output obtained can be visualized by plotting the graphs of the distribution of true and false in the output. true here represents the points which satisfy the rule i.e Species = setosa and false represents the points which do not.

anomaliesCountGraph <- plotgraphs(result=filterRuleOutput, plotName="Plot of points distribution")
anomaliesCountGraph[[1]][[1]]

Applying a condition on aggregated grouped data

The second type of rule is to apply a condition to the aggregated value of metrics for different groups. In the case of the iris dataset, we aggregate the Sepal.Length variable across different Species, and identify the Species which have an average Sepal.Length greater than a threshold value.

To illustrate this case, we apply only rule 2 from the set of sample rules.

groupedAggregationRule <- sampleRules[2,]
groupedAggregationRule
  Filters GroupBy       Column Function Operation Argument
2         Species Sepal.Length  average        >=      5.9
groupedAggregationRuleOutput <- executeRulesOnDataset(iris, groupedAggregationRule)
List of 1
 $ :List of 20
  ..$ : chr "import java.util.HashMap"
  ..$ : chr "import java.lang.Double"
  ..$ : chr "global java.util.HashMap output"
  ..$ : chr ""
  ..$ : chr "  dialect \"mvel\""
  ..$ : chr "rule \"Rule1\""
  ..$ : chr "       salience 0"
  ..$ : chr "       when"
  ..$ : chr "        input: HashMap()"
  ..$ : chr "result: Double()\n                                 from accumulate($condition:HashMap(Species == input.get(\"Sp"| __truncated__
  ..$ : chr "then"
  ..$ : chr "output.put('SepalLength',input.get('SepalLength'));"
  ..$ : chr "output.put('SepalWidth',input.get('SepalWidth'));"
  ..$ : chr "output.put('PetalLength',input.get('PetalLength'));"
  ..$ : chr "output.put('PetalWidth',input.get('PetalWidth'));"
  ..$ : chr "output.put('Species',input.get('Species'));"
  ..$ : chr "output.put('rowNumber',input.get('rowNumber'));"
  ..$ : chr "output.put(\"Rule1\",result>=5.9);"
  ..$ : chr "output.put('Rule1Value',result);"
  ..$ : chr "end"
str(groupedAggregationRuleOutput)
List of 1
 $ :List of 3
  ..$ input             :Classes 'tbl_df', 'tbl' and 'data.frame':  1 obs. of  7 variables:
  .. ..$ Filters  : chr ""
  .. ..$ GroupBy  : chr "Species"
  .. ..$ Column   : chr "SepalLength"
  .. ..$ Function : chr "average"
  .. ..$ Operation: chr ">="
  .. ..$ Argument : chr "5.9"
  .. ..$ ruleNum  : int 1
  ..$ intermediateOutput:Classes 'tbl_df', 'tbl' and 'data.frame':  3 obs. of  3 variables:
  .. ..$ Species   : chr [1:3] "setosa" "versicolor" "virginica"
  .. ..$ Rule1     : chr [1:3] "false" "true" "true"
  .. ..$ Rule1Value: num [1:3] 5.01 5.94 6.59
  ..$ output            :Classes 'tbl_df', 'tbl' and 'data.frame':  3 obs. of  3 variables:
  .. ..$ Group  : chr [1:3] "setosa" "versicolor" "virginica"
  .. ..$ Indices: chr [1:3] "1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,"| __truncated__ "51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,"| __truncated__ "101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128"| __truncated__
  .. ..$ IsTrue : chr [1:3] "false" "true" "true"

The output has three objects:

  • input: has the rule defined by the user in a data frame
  • intermediateOutput: has group (group by column) and it’s corresponding flag (true/false)
  • output: has the output data frame with 3 columns:
    • Group: the above rule has a group by condition and hence the rule is applied to each group. Group, in this case, represents the values of the column on which the group by condition was applied i.e. Species
    • Indices: the row numbers form the dataset present in each group
    • IsTrue: flag to say if the data point is satisfying the rule or not. In this case, Flag is true if the aggregated value of the group is greater than or equal to the threshold value and false if it’s not

Plotting graphs of the result obtained

anomalousSetGraph<-plotgraphs(result=groupedAggregationRuleOutput, plotName="Plot of groups")
anomalousSetGraph[[1]][[1]]

The above graph shows the groups i.e, the Species for which the average of Sepal.Length is greater than or equal to 5.9. The Y-axis shows the average Sepal.Length for each Species.

The plot below shows the number of groups which satisfied the rule. As we can see from above, 2 of the 3 groups satisfy the rule, and hence true has a count of 2.

anomaliesCountGraph<-plotgraphs(result=groupedAggregationRuleOutput, plotName="Plot of points distribution")
anomaliesCountGraph[[1]][[1]]

Applying an aggregation on a column

This type of rule allows the data scientist to aggregate an entire column and compare that with a threshold value. In the case of the iris dataset, we aggregate the Sepal.Length variable across all cases, and check if it is less than a threshold value

To illustrate this case, we apply only rule 3 from the set of sample rules.

columnAggregationRule <- sampleRules[3,]
columnAggregationRule
  Filters GroupBy       Column Function Operation Argument
3                 Sepal.Length  average         <        5
columnAggregationRuleOutput <- executeRulesOnDataset(iris, columnAggregationRule)
List of 1
 $ :List of 20
  ..$ : chr "import java.util.HashMap"
  ..$ : chr "import java.lang.Double"
  ..$ : chr "global java.util.HashMap output"
  ..$ : chr ""
  ..$ : chr "  dialect \"mvel\""
  ..$ : chr "rule \"Rule1\""
  ..$ : chr "       salience 0"
  ..$ : chr "       when"
  ..$ : chr "        input: HashMap()"
  ..$ : chr "result: Double()\n                               from accumulate($condition:HashMap(),average(Double.valueOf($c"| __truncated__
  ..$ : chr "then"
  ..$ : chr "output.put('SepalLength',input.get('SepalLength'));"
  ..$ : chr "output.put('SepalWidth',input.get('SepalWidth'));"
  ..$ : chr "output.put('PetalLength',input.get('PetalLength'));"
  ..$ : chr "output.put('PetalWidth',input.get('PetalWidth'));"
  ..$ : chr "output.put('Species',input.get('Species'));"
  ..$ : chr "output.put('rowNumber',input.get('rowNumber'));"
  ..$ : chr "output.put(\"Rule1\",result<5);"
  ..$ : chr "output.put('Rule1Value',result);"
  ..$ : chr "end"
str(columnAggregationRuleOutput)
List of 1
 $ :List of 3
  ..$ input             :Classes 'tbl_df', 'tbl' and 'data.frame':  1 obs. of  7 variables:
  .. ..$ Filters  : chr ""
  .. ..$ GroupBy  : chr ""
  .. ..$ Column   : chr "SepalLength"
  .. ..$ Function : chr "average"
  .. ..$ Operation: chr "<"
  .. ..$ Argument : chr "5"
  .. ..$ ruleNum  : int 1
  ..$ intermediateOutput: list()
  ..$ output            :Classes 'tbl_df', 'tbl' and 'data.frame':  1 obs. of  3 variables:
  .. ..$ Group  : num 1
  .. ..$ Indices: chr "1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,"| __truncated__
  .. ..$ IsTrue : chr "false"

The output has three objects:

  • input: has the rule defined by the user in a data frame
  • intermediateOutput: is an empty list as there is no grouped aggregation
  • output: has the ****output**** data frame with 3 columns:
    • Group: the above rule has no group by and no filter. The rule is applied to the whole column. Group, in this case, represents the whole column
    • Indices: the row numbers of the whole data frame
    • IsTrue: flag to say if the data point is satisfying the rule or not. IN this case, Flag is true if the aggregated value is greater than the threshold value and false if not

Applying a filter with aggregation

In this case, we apply a filter, and then on the filtered data, aggregate a column and compare it to a threshold value. In the case of the iris dataset, we check if for cases with Sepal.Width > 3, if the average Sepal.Length is greater than 5

To illustrate this case, we apply only rule 4 from the set of sample rules.

filterColAggregationRule <- sampleRules[4,]
filterColAggregationRule
          Filters GroupBy       Column Function Operation Argument
4 Sepal.Width > 3         Sepal.Length  average        >=        5
filterColAggregationRuleOutput <- executeRulesOnDataset(iris, filterColAggregationRule)
List of 1
 $ :List of 20
  ..$ : chr "import java.util.HashMap"
  ..$ : chr "import java.lang.Double"
  ..$ : chr "global java.util.HashMap output"
  ..$ : chr ""
  ..$ : chr "  dialect \"mvel\""
  ..$ : chr "rule \"Rule1\""
  ..$ : chr "       salience 0"
  ..$ : chr "       when"
  ..$ : chr "        input: HashMap()"
  ..$ : chr "result: Double()\n                               from accumulate($condition:HashMap(),average(Double.valueOf($c"| __truncated__
  ..$ : chr "then"
  ..$ : chr "output.put('SepalLength',input.get('SepalLength'));"
  ..$ : chr "output.put('SepalWidth',input.get('SepalWidth'));"
  ..$ : chr "output.put('PetalLength',input.get('PetalLength'));"
  ..$ : chr "output.put('PetalWidth',input.get('PetalWidth'));"
  ..$ : chr "output.put('Species',input.get('Species'));"
  ..$ : chr "output.put('rowNumber',input.get('rowNumber'));"
  ..$ : chr "output.put(\"Rule1\",result>=5);"
  ..$ : chr "output.put('Rule1Value',result);"
  ..$ : chr "end"
str(filterColAggregationRuleOutput)
List of 1
 $ :List of 3
  ..$ input             :Classes 'tbl_df', 'tbl' and 'data.frame':  1 obs. of  7 variables:
  .. ..$ Filters  : chr "SepalWidth > 3"
  .. ..$ GroupBy  : chr ""
  .. ..$ Column   : chr "SepalLength"
  .. ..$ Function : chr "average"
  .. ..$ Operation: chr ">="
  .. ..$ Argument : chr "5"
  .. ..$ ruleNum  : int 1
  ..$ intermediateOutput: list()
  ..$ output            :Classes 'tbl_df', 'tbl' and 'data.frame':  1 obs. of  3 variables:
  .. ..$ Group  : num 1
  .. ..$ Indices: chr "1,3,4,5,6,7,8,10,11,12,15,16,17,18,19,20,21,22,23,24,25,27,28,29,30,31,32,33,34,35,36,37,38,40,41,43,44,45,47,4"| __truncated__
  .. ..$ IsTrue : chr "true"

The output has three objects:

  • input: has the rule defined by the user in a data frame
  • intermediateOutput: is an empty list as there is no grouped aggregation
  • output: has the output data frame with 3 columns:
    • Group: the above rule has no group by and hence the rule is applied to the whole column after filtering the data. Group, in this case, represents the whole column
    • Indices: the row numbers of the filtered data frame on applying the condition
    • IsTrue: flag to say if the data point is satisfying the rule or not. In this case, Flag is true if the aggregated value of the filtered data is greater than or equal to 5 and false if it’s not

Applying a filter with grouped aggregation

We now combine all types if verbs into one rule. In the iris dataset, we check if for all cases with Petal.Width greater than a threshold value, if each type of Species [which is a group] has an average Petal.Length greater than another threshold.

To illustrate this case, we apply only rule 5 from the set of sample rules.

filterGroupByAggrRule <- sampleRules[5,]
filterGroupByAggrRule
            Filters GroupBy       Column Function Operation Argument
5 Petal.Width > 0.4 Species Petal.Length  average         <        5
filterGroupByAggrRuleOutput <- executeRulesOnDataset(iris, filterGroupByAggrRule)
List of 1
 $ :List of 20
  ..$ : chr "import java.util.HashMap"
  ..$ : chr "import java.lang.Double"
  ..$ : chr "global java.util.HashMap output"
  ..$ : chr ""
  ..$ : chr "  dialect \"mvel\""
  ..$ : chr "rule \"Rule1\""
  ..$ : chr "       salience 0"
  ..$ : chr "       when"
  ..$ : chr "        input: HashMap()"
  ..$ : chr "result: Double()\n                                 from accumulate($condition:HashMap(Species == input.get(\"Sp"| __truncated__
  ..$ : chr "then"
  ..$ : chr "output.put('SepalLength',input.get('SepalLength'));"
  ..$ : chr "output.put('SepalWidth',input.get('SepalWidth'));"
  ..$ : chr "output.put('PetalLength',input.get('PetalLength'));"
  ..$ : chr "output.put('PetalWidth',input.get('PetalWidth'));"
  ..$ : chr "output.put('Species',input.get('Species'));"
  ..$ : chr "output.put('rowNumber',input.get('rowNumber'));"
  ..$ : chr "output.put(\"Rule1\",result<5);"
  ..$ : chr "output.put('Rule1Value',result);"
  ..$ : chr "end"
str(filterGroupByAggrRuleOutput)
List of 1
 $ :List of 3
  ..$ input             :Classes 'tbl_df', 'tbl' and 'data.frame':  1 obs. of  7 variables:
  .. ..$ Filters  : chr "PetalWidth > 0.4"
  .. ..$ GroupBy  : chr "Species"
  .. ..$ Column   : chr "PetalLength"
  .. ..$ Function : chr "average"
  .. ..$ Operation: chr "<"
  .. ..$ Argument : chr "5"
  .. ..$ ruleNum  : int 1
  ..$ intermediateOutput:Classes 'tbl_df', 'tbl' and 'data.frame':  3 obs. of  3 variables:
  .. ..$ Species   : chr [1:3] "setosa" "versicolor" "virginica"
  .. ..$ Rule1     : chr [1:3] "true" "true" "false"
  .. ..$ Rule1Value: num [1:3] 1.65 4.26 5.55
  ..$ output            :Classes 'tbl_df', 'tbl' and 'data.frame':  3 obs. of  3 variables:
  .. ..$ Group  : chr [1:3] "setosa" "versicolor" "virginica"
  .. ..$ Indices: chr [1:3] "24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44" "51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,"| __truncated__ "101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128"| __truncated__
  .. ..$ IsTrue : chr [1:3] "true" "true" "false"

The output has three objects:

  • input: has the rule defined by the user in a data frame
  • intermediateOutput: has the groups (group by column) and their corresponding flag (true/false)
  • output: has the output data frame with 3 columns:
    • Group: the above rule has a group by and filter. Hence the rule is applied to each group after filtering the data. Group, in this case, represents the grouped by column i.e. Species
    • Indices: the row numbers of the data frame present in each group after filtering
    • IsTrue: flag to say if the data point is satisfying the rule or not. In this case, Flag is true if the aggregated value for the group after filtering is less than or equal to the threshold and false if not
anomalousSetGraph<-plotgraphs(result=filterGroupByAggrRuleOutput, plotName="Plot of groups")
anomalousSetGraph[[1]][[1]]

The above graph shows the groups i.e, the Species for which the average of Petal.Length is less than 5. The Y-axis shows the average Petal.Length for each Species.

Applying a condition to compare columns

Here we compare values of two columns. In the case of the iris dataset, we compare the Petal.Length with Sepal.Width, and identify the rows which have a Petal.Length greater than Sepal.Width.

To illustrate this case, we apply only rule 6 from the set of sample rules.

compareColumnsRule <- sampleRules[6,]
compareColumnsRule
  Filters GroupBy       Column Function Operation    Argument
6                 Petal.Length  compare        >= Sepal.Width
compareColumnsRuleOutput <- executeRulesOnDataset(iris, compareColumnsRule)
List of 1
 $ :List of 20
  ..$ : chr "import java.util.HashMap"
  ..$ : chr "import java.lang.Double"
  ..$ : chr "global java.util.HashMap output"
  ..$ : chr ""
  ..$ : chr "  dialect \"mvel\""
  ..$ : chr "rule \"Rule1\""
  ..$ : chr "       salience 0"
  ..$ : chr "       when"
  ..$ : chr "        input: HashMap()"
  ..$ : chr "result: Double()\n                               from accumulate($condition:HashMap(),compare(Double.valueOf($c"| __truncated__
  ..$ : chr "then"
  ..$ : chr "output.put('SepalLength',input.get('SepalLength'));"
  ..$ : chr "output.put('SepalWidth',input.get('SepalWidth'));"
  ..$ : chr "output.put('PetalLength',input.get('PetalLength'));"
  ..$ : chr "output.put('PetalWidth',input.get('PetalWidth'));"
  ..$ : chr "output.put('Species',input.get('Species'));"
  ..$ : chr "output.put('rowNumber',input.get('rowNumber'));"
  ..$ : chr "output.put(\"Rule1\",result>=SepalWidth);"
  ..$ : chr "output.put('Rule1Value',result);"
  ..$ : chr "end"
str(compareColumnsRuleOutput)
List of 1
 $ :List of 3
  ..$ input             :Classes 'tbl_df', 'tbl' and 'data.frame':  1 obs. of  7 variables:
  .. ..$ Filters  : chr ""
  .. ..$ GroupBy  : chr ""
  .. ..$ Column   : chr "PetalLength"
  .. ..$ Function : chr "compare"
  .. ..$ Operation: chr ">="
  .. ..$ Argument : chr "SepalWidth"
  .. ..$ ruleNum  : int 1
  ..$ intermediateOutput: list()
  ..$ output            :'data.frame':  150 obs. of  3 variables:
  .. ..$ Group  : int [1:150] 51 52 53 54 55 56 57 58 59 60 ...
  .. ..$ Indices: int [1:150] 51 52 53 54 55 56 57 58 59 60 ...
  .. ..$ IsTrue : chr [1:150] "true" "true" "true" "true" ...

The output has three objects:

  • input: has the rule defined by the user in a data frame
  • intermediateOutput: is an empty list as there is no aggregation
  • output: has the output data frame with 3 columns:
    • Group: the above rule has no group by and filter. Hence the rule is applied to each row. Group, in this case, represents the row
    • Indices: each row number
    • IsTrue: flag to say if the data point is satisfying the rule or not. In this case, Flag is true if the Petal.Length greater than Sepal.Width and false if not
anomaliesCountGraph<-plotgraphs(result=compareColumnsRuleOutput, plotName="Plot of points distribution")
anomaliesCountGraph[[1]][[1]]

Applying a filter and comparing columns

Here we compare values of two columns after filtering the dataset. In the case of the iris dataset, we compare the Petal.Length with Sepal.Width, and identify the rows which have a Petal.Length greater than Sepal.Width.

To illustrate this case, we apply only rule 7 from the set of sample rules.

compareFilterRule <- sampleRules[7,]
compareFilterRule
                  Filters GroupBy       Column Function Operation
7 Species == 'versicolor'         Petal.Length  compare        >=
     Argument
7 Sepal.Width
compareFilterRuleOutput <- executeRulesOnDataset(iris, compareFilterRule)
List of 1
 $ :List of 20
  ..$ : chr "import java.util.HashMap"
  ..$ : chr "import java.lang.Double"
  ..$ : chr "global java.util.HashMap output"
  ..$ : chr ""
  ..$ : chr "  dialect \"mvel\""
  ..$ : chr "rule \"Rule1\""
  ..$ : chr "       salience 0"
  ..$ : chr "       when"
  ..$ : chr "        input: HashMap()"
  ..$ : chr "result: Double()\n                               from accumulate($condition:HashMap(),compare(Double.valueOf($c"| __truncated__
  ..$ : chr "then"
  ..$ : chr "output.put('SepalLength',input.get('SepalLength'));"
  ..$ : chr "output.put('SepalWidth',input.get('SepalWidth'));"
  ..$ : chr "output.put('PetalLength',input.get('PetalLength'));"
  ..$ : chr "output.put('PetalWidth',input.get('PetalWidth'));"
  ..$ : chr "output.put('Species',input.get('Species'));"
  ..$ : chr "output.put('rowNumber',input.get('rowNumber'));"
  ..$ : chr "output.put(\"Rule1\",result>=SepalWidth);"
  ..$ : chr "output.put('Rule1Value',result);"
  ..$ : chr "end"
str(compareFilterRuleOutput)
List of 1
 $ :List of 3
  ..$ input             :Classes 'tbl_df', 'tbl' and 'data.frame':  1 obs. of  7 variables:
  .. ..$ Filters  : chr "Species == 'versicolor'"
  .. ..$ GroupBy  : chr ""
  .. ..$ Column   : chr "PetalLength"
  .. ..$ Function : chr "compare"
  .. ..$ Operation: chr ">="
  .. ..$ Argument : chr "SepalWidth"
  .. ..$ ruleNum  : int 1
  ..$ intermediateOutput: list()
  ..$ output            :'data.frame':  150 obs. of  3 variables:
  .. ..$ Group  : int [1:150] 51 52 53 54 55 56 57 58 59 60 ...
  .. ..$ Indices: int [1:150] 51 52 53 54 55 56 57 58 59 60 ...
  .. ..$ IsTrue : chr [1:150] "true" "true" "true" "true" ...

The output has three objects:

  • input: has the rule defined by the user in a data frame
  • intermediateOutput: is an empty list as there is no aggregation
  • output: has the output data frame with 3 columns:
    • Group: the above rule has no group by condition. Hence the rule is applied to each row. Group, in this case, represents the row
    • Indices: each row number
    • IsTrue: flag to say if the data point is satisfying the rule or not. In this case, Flag is true if the Petal.Length of Versicolor species is greater than Sepal.Width and false if not
anomaliesCountGraph<-plotgraphs(result=compareColumnsRuleOutput, plotName="Plot of points distribution")
anomaliesCountGraph[[1]][[1]]

Use case

We now consider a more business-specific problem, where such a rule system might be deployed.

Problem statement

Consider the customers of a retail bank, who make transactions against their bank account for different purposes such as shopping, money transfers, etc. In the banking system, there is a huge potential for fraud. Typically, abnormal transaction behavior is a strong indicator of fraud.

We explore how such transactions can be monitored intelligently to detect fraud using Rdrools by applying business rules.

Details of the dataset

The following dataset provides transaction data for multiple customers of the retail bank (identified by their Account IDs) is used. Every transaction that a user (account) does is recorded with the following details:

data("transactionData")
transactionData$Date <- lubridate::ymd(transactionData$Date)
transactionData <- transactionData[1:500,]

Displaying a sample (top 10 rows) of the uploaded dataset

'data.frame':   500 obs. of  16 variables:
 $ Account_ID                     : chr  "2266 97472609" "2266 97472609" "2266 97472609" "2266 97472609" ...
 $ Customer_ID                    : chr  "HS 10003669" "HS 10003669" "HS 10003669" "HS 10003669" ...
 $ Month                          : chr  "Trans_Month1" "Trans_Month1" "Trans_Month1" "Trans_Month1" ...
 $ Product_Type                   : chr  "Savings Account" "Savings Account" "Savings Account" "Savings Account" ...
 $ No_of_transactions             : int  5 5 5 5 5 5 5 5 5 5 ...
 $ Account_open_date              : chr  "2016-03-22" "2016-03-22" "2016-03-22" "2016-03-22" ...
 $ Transaction_ID                 : int  81993859 50847914 58383961 31707922 14904755 23169362 26156823 59730045 83134275 8863921 ...
 $ Date                           : Date, format: "2016-10-03" "2016-10-06" ...
 $ account_month                  : chr  "226697472609Trans_Month1" "226697472609Trans_Month1" "226697472609Trans_Month1" "226697472609Trans_Month1" ...
 $ trans_tender_type              : chr  "Overseas Transfer" "Domestic Transfer" "Cash Withdrawal" "Bill Payment" ...
 $ Credit_Debit_Indicator         : chr  "Credit" "Credit" "Debit" "Debit" ...
 $ Total_Transactions             : int  7 7 7 7 7 7 1 7 2 3 ...
 $ Transaction_Amount             : num  1262 141 700 739 600 ...
 $ Balance                        : num  1262 1403 791 52 -512 ...
 $ risk_level                     : chr  "Low" "Low" "Low" "Low" ...
 $ Credit_Card_Monthly_Expenditure: num  0 0 0 0 0 0 0 0 0 0 ...

Defining the rules file

There might be certain cases where we simply want to check the behavior of customers based on a constant benchmark value. These might be cases such as compliance and policy violations, etc.

In our case we check rules like:

data("transactionRules")
rownames(transactionRules) <- seq(1:nrow(transactionRules))
transactionRules[is.na(transactionRules)]    <-""
transactionRules
                                                   Filters
1                  Credit_Card_Monthly_Expenditure > 28000
2                                                         
3                                                         
4                                      risk_level == 'Low'
5 Date > '2017-05-01' && Credit_Debit_Indicator == 'Debit'
           GroupBy             Column Function Operation Argument
1                                                                
2 Account_ID,Month Transaction_Amount      sum        >=    10000
3                  Total_Transactions      max         >        5
4                  Transaction_Amount  average         <     1400
5       Account_ID Transaction_Amount      sum        >=    40000

One example of the rules to mark anomalous transactions from the above list is

\[\textsf{For an account, the total Transaction_Amount } \\ \textsf{should be greater than or equal to USD 40,000}\]

Executing rules on the dataset

We now take the entire set of rules and execute it on the transaction data as follows:

transactionDataOutput  <- executeRulesOnDataset(transactionData, transactionRules)
## List of 1
##  $ :List of 31
##   ..$ : chr "import java.util.HashMap"
##   ..$ : chr "import java.lang.Double"
##   ..$ : chr "global java.util.HashMap output"
##   ..$ : chr ""
##   ..$ : chr "  dialect \"mvel\""
##   ..$ : chr "rule \"Rule1\""
##   ..$ : chr "       salience 0"
##   ..$ : chr "       when"
##   ..$ : chr "        input: HashMap()"
##   ..$ : chr "result: Double()\n                               from accumulate($condition:HashMap(),(Double.valueOf($conditio"| __truncated__
##   ..$ : chr "then"
##   ..$ : chr "output.put('AccountID',input.get('AccountID'));"
##   ..$ : chr "output.put('CustomerID',input.get('CustomerID'));"
##   ..$ : chr "output.put('Month',input.get('Month'));"
##   ..$ : chr "output.put('ProductType',input.get('ProductType'));"
##   ..$ : chr "output.put('Nooftransactions',input.get('Nooftransactions'));"
##   ..$ : chr "output.put('Accountopendate',input.get('Accountopendate'));"
##   ..$ : chr "output.put('TransactionID',input.get('TransactionID'));"
##   ..$ : chr "output.put('Date',input.get('Date'));"
##   ..$ : chr "output.put('accountmonth',input.get('accountmonth'));"
##   ..$ : chr "output.put('transtendertype',input.get('transtendertype'));"
##   ..$ : chr "output.put('CreditDebitIndicator',input.get('CreditDebitIndicator'));"
##   ..$ : chr "output.put('TotalTransactions',input.get('TotalTransactions'));"
##   ..$ : chr "output.put('TransactionAmount',input.get('TransactionAmount'));"
##   ..$ : chr "output.put('Balance',input.get('Balance'));"
##   ..$ : chr "output.put('risklevel',input.get('risklevel'));"
##   ..$ : chr "output.put('CreditCardMonthlyExpenditure',input.get('CreditCardMonthlyExpenditure'));"
##   ..$ : chr "output.put('rowNumber',input.get('rowNumber'));"
##   ..$ : chr "output.put(\"Rule1\",result);"
##   ..$ : chr "output.put('Rule1Value',result);"
##   ..$ : chr "end"
## List of 1
##  $ :List of 31
##   ..$ : chr "import java.util.HashMap"
##   ..$ : chr "import java.lang.Double"
##   ..$ : chr "global java.util.HashMap output"
##   ..$ : chr ""
##   ..$ : chr "  dialect \"mvel\""
##   ..$ : chr "rule \"Rule2\""
##   ..$ : chr "       salience 0"
##   ..$ : chr "       when"
##   ..$ : chr "        input: HashMap()"
##   ..$ :List of 1
##   .. ..$ : chr "result: Double()\n                               from accumulate($condition:HashMap(AccountID==input.get(\"Acco"| __truncated__
##   ..$ : chr "then"
##   ..$ : chr "output.put('AccountID',input.get('AccountID'));"
##   ..$ : chr "output.put('CustomerID',input.get('CustomerID'));"
##   ..$ : chr "output.put('Month',input.get('Month'));"
##   ..$ : chr "output.put('ProductType',input.get('ProductType'));"
##   ..$ : chr "output.put('Nooftransactions',input.get('Nooftransactions'));"
##   ..$ : chr "output.put('Accountopendate',input.get('Accountopendate'));"
##   ..$ : chr "output.put('TransactionID',input.get('TransactionID'));"
##   ..$ : chr "output.put('Date',input.get('Date'));"
##   ..$ : chr "output.put('accountmonth',input.get('accountmonth'));"
##   ..$ : chr "output.put('transtendertype',input.get('transtendertype'));"
##   ..$ : chr "output.put('CreditDebitIndicator',input.get('CreditDebitIndicator'));"
##   ..$ : chr "output.put('TotalTransactions',input.get('TotalTransactions'));"
##   ..$ : chr "output.put('TransactionAmount',input.get('TransactionAmount'));"
##   ..$ : chr "output.put('Balance',input.get('Balance'));"
##   ..$ : chr "output.put('risklevel',input.get('risklevel'));"
##   ..$ : chr "output.put('CreditCardMonthlyExpenditure',input.get('CreditCardMonthlyExpenditure'));"
##   ..$ : chr "output.put('rowNumber',input.get('rowNumber'));"
##   ..$ : chr "output.put(\"Rule2\",result>=10000);"
##   ..$ : chr "output.put('Rule2Value',result);"
##   ..$ : chr "end"
## List of 1
##  $ :List of 31
##   ..$ : chr "import java.util.HashMap"
##   ..$ : chr "import java.lang.Double"
##   ..$ : chr "global java.util.HashMap output"
##   ..$ : chr ""
##   ..$ : chr "  dialect \"mvel\""
##   ..$ : chr "rule \"Rule3\""
##   ..$ : chr "       salience 0"
##   ..$ : chr "       when"
##   ..$ : chr "        input: HashMap()"
##   ..$ : chr "result: Double()\n                               from accumulate($condition:HashMap(),max(Double.valueOf($condi"| __truncated__
##   ..$ : chr "then"
##   ..$ : chr "output.put('AccountID',input.get('AccountID'));"
##   ..$ : chr "output.put('CustomerID',input.get('CustomerID'));"
##   ..$ : chr "output.put('Month',input.get('Month'));"
##   ..$ : chr "output.put('ProductType',input.get('ProductType'));"
##   ..$ : chr "output.put('Nooftransactions',input.get('Nooftransactions'));"
##   ..$ : chr "output.put('Accountopendate',input.get('Accountopendate'));"
##   ..$ : chr "output.put('TransactionID',input.get('TransactionID'));"
##   ..$ : chr "output.put('Date',input.get('Date'));"
##   ..$ : chr "output.put('accountmonth',input.get('accountmonth'));"
##   ..$ : chr "output.put('transtendertype',input.get('transtendertype'));"
##   ..$ : chr "output.put('CreditDebitIndicator',input.get('CreditDebitIndicator'));"
##   ..$ : chr "output.put('TotalTransactions',input.get('TotalTransactions'));"
##   ..$ : chr "output.put('TransactionAmount',input.get('TransactionAmount'));"
##   ..$ : chr "output.put('Balance',input.get('Balance'));"
##   ..$ : chr "output.put('risklevel',input.get('risklevel'));"
##   ..$ : chr "output.put('CreditCardMonthlyExpenditure',input.get('CreditCardMonthlyExpenditure'));"
##   ..$ : chr "output.put('rowNumber',input.get('rowNumber'));"
##   ..$ : chr "output.put(\"Rule3\",result>5);"
##   ..$ : chr "output.put('Rule3Value',result);"
##   ..$ : chr "end"
## List of 1
##  $ :List of 31
##   ..$ : chr "import java.util.HashMap"
##   ..$ : chr "import java.lang.Double"
##   ..$ : chr "global java.util.HashMap output"
##   ..$ : chr ""
##   ..$ : chr "  dialect \"mvel\""
##   ..$ : chr "rule \"Rule4\""
##   ..$ : chr "       salience 0"
##   ..$ : chr "       when"
##   ..$ : chr "        input: HashMap()"
##   ..$ : chr "result: Double()\n                               from accumulate($condition:HashMap(),average(Double.valueOf($c"| __truncated__
##   ..$ : chr "then"
##   ..$ : chr "output.put('AccountID',input.get('AccountID'));"
##   ..$ : chr "output.put('CustomerID',input.get('CustomerID'));"
##   ..$ : chr "output.put('Month',input.get('Month'));"
##   ..$ : chr "output.put('ProductType',input.get('ProductType'));"
##   ..$ : chr "output.put('Nooftransactions',input.get('Nooftransactions'));"
##   ..$ : chr "output.put('Accountopendate',input.get('Accountopendate'));"
##   ..$ : chr "output.put('TransactionID',input.get('TransactionID'));"
##   ..$ : chr "output.put('Date',input.get('Date'));"
##   ..$ : chr "output.put('accountmonth',input.get('accountmonth'));"
##   ..$ : chr "output.put('transtendertype',input.get('transtendertype'));"
##   ..$ : chr "output.put('CreditDebitIndicator',input.get('CreditDebitIndicator'));"
##   ..$ : chr "output.put('TotalTransactions',input.get('TotalTransactions'));"
##   ..$ : chr "output.put('TransactionAmount',input.get('TransactionAmount'));"
##   ..$ : chr "output.put('Balance',input.get('Balance'));"
##   ..$ : chr "output.put('risklevel',input.get('risklevel'));"
##   ..$ : chr "output.put('CreditCardMonthlyExpenditure',input.get('CreditCardMonthlyExpenditure'));"
##   ..$ : chr "output.put('rowNumber',input.get('rowNumber'));"
##   ..$ : chr "output.put(\"Rule4\",result<1400);"
##   ..$ : chr "output.put('Rule4Value',result);"
##   ..$ : chr "end"
## List of 1
##  $ :List of 31
##   ..$ : chr "import java.util.HashMap"
##   ..$ : chr "import java.lang.Double"
##   ..$ : chr "global java.util.HashMap output"
##   ..$ : chr ""
##   ..$ : chr "  dialect \"mvel\""
##   ..$ : chr "rule \"Rule5\""
##   ..$ : chr "       salience 0"
##   ..$ : chr "       when"
##   ..$ : chr "        input: HashMap()"
##   ..$ : chr "result: Double()\n                                 from accumulate($condition:HashMap(AccountID == input.get(\""| __truncated__
##   ..$ : chr "then"
##   ..$ : chr "output.put('AccountID',input.get('AccountID'));"
##   ..$ : chr "output.put('CustomerID',input.get('CustomerID'));"
##   ..$ : chr "output.put('Month',input.get('Month'));"
##   ..$ : chr "output.put('ProductType',input.get('ProductType'));"
##   ..$ : chr "output.put('Nooftransactions',input.get('Nooftransactions'));"
##   ..$ : chr "output.put('Accountopendate',input.get('Accountopendate'));"
##   ..$ : chr "output.put('TransactionID',input.get('TransactionID'));"
##   ..$ : chr "output.put('Date',input.get('Date'));"
##   ..$ : chr "output.put('accountmonth',input.get('accountmonth'));"
##   ..$ : chr "output.put('transtendertype',input.get('transtendertype'));"
##   ..$ : chr "output.put('CreditDebitIndicator',input.get('CreditDebitIndicator'));"
##   ..$ : chr "output.put('TotalTransactions',input.get('TotalTransactions'));"
##   ..$ : chr "output.put('TransactionAmount',input.get('TransactionAmount'));"
##   ..$ : chr "output.put('Balance',input.get('Balance'));"
##   ..$ : chr "output.put('risklevel',input.get('risklevel'));"
##   ..$ : chr "output.put('CreditCardMonthlyExpenditure',input.get('CreditCardMonthlyExpenditure'));"
##   ..$ : chr "output.put('rowNumber',input.get('rowNumber'));"
##   ..$ : chr "output.put(\"Rule5\",result>=40000);"
##   ..$ : chr "output.put('Rule5Value',result);"
##   ..$ : chr "end"

Viewing results

length(transactionDataOutput)
[1] 5
str(transactionDataOutput[[5]]) #Rule 5 output
List of 3
 $ input             :Classes 'tbl_df', 'tbl' and 'data.frame': 1 obs. of  7 variables:
  ..$ Filters  : chr "Date > '2017-05-01' && CreditDebitIndicator == 'Debit'"
  ..$ GroupBy  : chr "AccountID"
  ..$ Column   : chr "TransactionAmount"
  ..$ Function : chr "sum"
  ..$ Operation: chr ">="
  ..$ Argument : chr "40000"
  ..$ ruleNum  : int 5
 $ intermediateOutput:Classes 'tbl_df', 'tbl' and 'data.frame': 11 obs. of  3 variables:
  ..$ AccountID : chr [1:11] "1300 41463086" "3077 81314800" "3256 22875398" "3335 81433260" ...
  ..$ Rule5     : chr [1:11] "true" "true" "true" "false" ...
  ..$ Rule5Value: num [1:11] 129946 122968 118012 32400 281931 ...
 $ output            :Classes 'tbl_df', 'tbl' and 'data.frame': 11 obs. of  3 variables:
  ..$ Group  : chr [1:11] "1300 41463086" "3077 81314800" "3256 22875398" "3335 81433260" ...
  ..$ Indices: chr [1:11] "78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,1"| __truncated__ "311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338"| __truncated__ "434,435,436" "438,439,440" ...
  ..$ IsTrue : chr [1:11] "true" "true" "true" "false" ...

Let us take the results obtained for Rule5 to understand the applications of Rdrools. Rule 5 was

\[\textsf{For a fraudulent/ anomalous account, the maximum of Transaction_Amount } \\ \textsf{should be greater than or equal to USD 40,000 for all the debit transactions done after 2017-05-01}\]

The output has three objects:

Plotting graphs of the result obtained

The distribution of points i.e, the Account_ID that are true or false is shown in the graph below. In this case, the true values can be called as Anomalous Account_IDs and the points that are false are Non-Anomalous Account_IDs.

anomaliesCountGraph<-plotgraphs(result=transactionDataOutput, plotName="Plot of points distribution")
anomaliesCountGraph[[5]][[5]]

The above graph shows that there are 4 anomalous Account_IDs which satisfy the rule given and 7 Account_IDs that are non-anomalous.

anomalousSetGraph<-plotgraphs(result=transactionDataOutput, plotName="Plot of groups")
anomalousSetGraph[[5]][[5]]

The above graph gives more information about the anomalous Account_IDs. The graph shows the sum of Transaction_Amount for each anomalous Account_ID

References

Drools Documentation

Rdrools Documentation