## Saturday, September 11, 2010

### Results summarization from MagicPlan

Release 0.6.10 introduces a new, optional, capacity to the MagicPlan. Previously, the file (report) produced by the MagicPlan only displayed individually itemized diffs. That is, each diff appeared, only once, in the output as a detailed description of just that discrete diff. 0.6.10 allows you to instruct the MagicPlan to produce aggregate level summary information as well as the detailed individual diffs.

If you include this:
``````           <property name="withSummary" value="TRUE" />
``````
as a property of the MagicPlan, the output file will have a header that looks like this:
`````` --- vhl summary ---
diff'd 8 rows in 0:00:38.011, found:
!4 row diffs
@7 column diffs
-------------------
--- row diff summary ---
1 row diffs <
3 row diffs >
------------------------
--- column diff summary ---
columns having diffs->(column3, column4, column2)
column3 has 4 diffs
column4 has 2 diffs
column2 has 1 diffs
---------------------------
--- column diffs clustered ---
columnClusters having diffs->(column3, column2.column3.column4, column3.column4)
column3 has 2 diffs
column2.column3.column4 has 1 diffs
column3.column4 has 1 diffs
---------------------------
``````

Above is the output from TestCase 23, which provides functional test coverage for the results summarization feature. The input data that produced this report:
``````lhs:                                      rhs:
column1,column2,column3,column4           column1,column2,column3,column4
----------------------------              1,      0000,   x,      aaaa
2,      1111,   x,      aaaa              ----------------------------
3,      2222,   y,      aaaa              3,      2222,   x,      aaaa
4,      0000,   z,      bbbb              4,      3333,   x,      aaaa
5,      4444,   z,      bbbb              5,      4444,   x,      aaaa
6,      5555,   u,      aaaa              6,      5555,   x,      aaaa
7,      0000,   v,      aaaa              ----------------------------
8,      1111,   x,      aaaa              ----------------------------``````
Note well that the primary key on both the lhs and rhs tables is column1. So DK will use column1 as the diff'ng key, to align the rows.

Dissecting this report, section by section; first, there is the Very High Level (vhl) summary:
`````` --- vhl summary ---
diff'd 8 rows in 0:00:38.011, found:
!4 row diffs
@7 column diffs
-------------------
``````
The first line tells us how many rows were diff'd and how long it took. In this case 8 unique rows were evaluated for diffs. If a row occurs on only one side (is a ROW_DIFF), it counts as 1 row diff'd. In the case where DK is able to match the lhs row with a rhs row, that counts as 1 row diff'd, not 2. So the 8 rows that were diff'd are: the 1 row that appears only on the rhs (1), the 3 rows that appear only on the lhs (2,7,8), and the 4 rows that appear on both sides (3,4,5,6).

0:00:38.011 is an ISO 8601 formatted time specification. It represents 0 hours, 0 minutes, 38 seconds, and 11 milliseconds. The next line, "!4 row diffs", starts with the ! mark, which is the symbol for ROW_DIFF in both the summary and detail sections of the report. The 4 row diffs are the rows of dashed lines in the tables above: 1 on the lhs, and 3 on the rhs. The terminology that DK uses is: "there are 4 rows missing". The final line of the vhl summary, "@7 column diffs", shows that there are a total of 7 individual column (or cell) value diffs. The @ sign is the symbol for COLUMN_DIFF in both the summary and detail sections of the report. The 7 column diffs are: row 3 column3, row 4 column2, row 4 column3, row 4 column4, row 5 column 3, row 5 column4, row 6 column3.

The next section is the row diff summary:
`````` --- row diff summary ---
1 row diffs <
3 row diffs >
------------------------
``````
This breaks down the row diffs according to which side they occur on. The line, "1 row diffs <", tells us that there is 1 row missing from the lhs: row 1. The next line states that there are 3 rows missing from the rhs: row 2, row 7, and row 8.

Next is the column diff summary section:
`````` --- column diff summary ---
columns having diffs->(column3, column4, column2)
column3 has 4 diffs
column4 has 2 diffs
column2 has 1 diffs
---------------------------``````
This is a very straightforward grouping of the COLUMN_DIFFs, grouped according to which column the diff occurs in. column3 has 4 diffs: row 3, row 4, row 5, and row 6. column4 has 2 diffs: row 4, and row 5. column2 has 1 diff: row 4.

Finally, the column diffs clustered section:
`````` --- column diffs clustered ---
columnClusters having diffs->(column3, column2.column3.column4, column3.column4)
column3 has 2 diffs
column2.column3.column4 has 1 diffs
column3.column4 has 1 diffs
--------------------------- ``````
This groups the COLUMN_DIFF columns according to which row the diffs occur in. "Cluster" is another name for "pattern of column names having diffs all in the same row". The first line tells us that there are 3 clusters, and which columns participate in each cluster. The column3 cluster has 2 diffs. That is, there are two rows where the only COLUMN_DIFFs are in column3: row 3 and row 6. The column2.column3.column4 cluster has 1 diff: row 4. Finally, the column3.column4 cluster has 1 diff: row 5. Column diff clusters are useful for spotting patterns of linked or related column diffs, which can be helpful in understanding the origin of diffs.