Monday, September 20, 2010

0.6.12 adds some documentation, fixes minor issues

ID Type Status Priority Summary AllLabels
29 Defect Fixed Medium vhl summary reports wrong number of rows diff'd Type-Defect, Priority-Medium, OpSys-All
30 Defect Fixed Critical TestCase 23 fails on linux Type-Defect, Priority-Critical, OpSys-All
31 Enhancement Fixed High create README file Type-Enhancement, Priority-High, OpSys-All
32 Enhancement Fixed High create QuickStart document Type-Enhancement, Priority-High, OpSys-All
34 Defect Duplicate Medium create User Guide Type-Defect, Priority-Medium
35 Enhancement Fixed High create faq Type-Enhancement, Priority-High, OpSys-All

Thursday, September 16, 2010

TextDiffor in 0.6.11

0.6.11 introduces the TextDiffor, which is useful for diff'ng chunks of text that might have small formatting differences that you would like to ignore. A good example of this is programming language code. For instance, I was recently trying to diff SQL schemas (DDL) using meta data tables. A troublesome schema object to diff was the TEXT definition of stored procedures. One side looked like this:

 DECLARE  
  number1 NUMBER(2);  
  number2 NUMBER(2)  := 17;       -- value default   
  text1  VARCHAR2(12) := 'Hello world';  
  text2  DATE     := SYSDATE;    -- current date and time  
 BEGIN  
  SELECT street_number  
   INTO number1  
   FROM address  
   WHERE name = 'INU';  
 END;  

while the other side looked like this:

 DECLARE  
  number1  NUMBER(2);  
  number2  NUMBER(2)  := 17;       -- value default   
  text1    VARCHAR2(12) := 'Hello world';  
  text2    DATE     := SYSDATE;    -- current date and time  
 BEGIN  
  SELECT street_number  INTO number1 FROM address WHERE name = 'INU';  
 END;  

Note that the lines of the text1 and text2 variable declarations have small alignment differences between the two sides, and that the SQL SELECT statement is multiline in the first case, but only 1 line in the second case. These two snippets are identical PL SQL programmings (produce the same AST), but are different textually.

The TextDiffor will, by default, see these two snippets as identical. It uses a very simple text normalization before performing the String comparison.

1) replace all tabs and newlines ([\t\r\n]) with a single space character
2) compress all multi-character whitespace runs to a single space character
3) trim all whitespace from both ends

Saturday, September 11, 2010

Results summarization from MagicPlan

Release 0.6.10 introduces a new, optional, capacity to the MagicPlan. Previously, the file (report) produced by the MagicPlan only displayed individually itemized diffs. That is, each diff appeared, only once, in the output as a detailed description of just that discrete diff. 0.6.10 allows you to instruct the MagicPlan to produce aggregate level summary information as well as the detailed individual diffs.

If you include this:
           <property name="withSummary" value="TRUE" />  
as a property of the MagicPlan, the output file will have a header that looks like this:
 --- vhl summary ---  
 diff'd 8 rows in 0:00:38.011, found:  
 !4 row diffs  
 @7 column diffs  
 -------------------  
 --- row diff summary ---  
 1 row diffs <  
 3 row diffs >  
 ------------------------  
 --- column diff summary ---  
 columns having diffs->(column3, column4, column2)  
 column3 has 4 diffs  
 column4 has 2 diffs  
 column2 has 1 diffs  
 ---------------------------  
 --- column diffs clustered ---  
 columnClusters having diffs->(column3, column2.column3.column4, column3.column4)  
 column3 has 2 diffs  
 column2.column3.column4 has 1 diffs  
 column3.column4 has 1 diffs  
 ---------------------------  

Above is the output from TestCase 23, which provides functional test coverage for the results summarization feature. The input data that produced this report:
lhs:                                      rhs:
column1,column2,column3,column4           column1,column2,column3,column4
----------------------------              1,      0000,   x,      aaaa
2,      1111,   x,      aaaa              ----------------------------
3,      2222,   y,      aaaa              3,      2222,   x,      aaaa
4,      0000,   z,      bbbb              4,      3333,   x,      aaaa
5,      4444,   z,      bbbb              5,      4444,   x,      aaaa
6,      5555,   u,      aaaa              6,      5555,   x,      aaaa
7,      0000,   v,      aaaa              ----------------------------
8,      1111,   x,      aaaa              ----------------------------
Note well that the primary key on both the lhs and rhs tables is column1. So DK will use column1 as the diff'ng key, to align the rows.

Dissecting this report, section by section; first, there is the Very High Level (vhl) summary:
 --- vhl summary ---  
 diff'd 8 rows in 0:00:38.011, found:  
 !4 row diffs  
 @7 column diffs  
-------------------  
The first line tells us how many rows were diff'd and how long it took. In this case 8 unique rows were evaluated for diffs. If a row occurs on only one side (is a ROW_DIFF), it counts as 1 row diff'd. In the case where DK is able to match the lhs row with a rhs row, that counts as 1 row diff'd, not 2. So the 8 rows that were diff'd are: the 1 row that appears only on the rhs (1), the 3 rows that appear only on the lhs (2,7,8), and the 4 rows that appear on both sides (3,4,5,6).

0:00:38.011 is an ISO 8601 formatted time specification. It represents 0 hours, 0 minutes, 38 seconds, and 11 milliseconds. The next line, "!4 row diffs", starts with the ! mark, which is the symbol for ROW_DIFF in both the summary and detail sections of the report. The 4 row diffs are the rows of dashed lines in the tables above: 1 on the lhs, and 3 on the rhs. The terminology that DK uses is: "there are 4 rows missing". The final line of the vhl summary, "@7 column diffs", shows that there are a total of 7 individual column (or cell) value diffs. The @ sign is the symbol for COLUMN_DIFF in both the summary and detail sections of the report. The 7 column diffs are: row 3 column3, row 4 column2, row 4 column3, row 4 column4, row 5 column 3, row 5 column4, row 6 column3.

The next section is the row diff summary:
 --- row diff summary ---  
 1 row diffs <  
 3 row diffs >  
------------------------ 
This breaks down the row diffs according to which side they occur on. The line, "1 row diffs <", tells us that there is 1 row missing from the lhs: row 1. The next line states that there are 3 rows missing from the rhs: row 2, row 7, and row 8.

Next is the column diff summary section:
 --- column diff summary ---  
 columns having diffs->(column3, column4, column2)  
 column3 has 4 diffs  
 column4 has 2 diffs  
 column2 has 1 diffs  
 ---------------------------
This is a very straightforward grouping of the COLUMN_DIFFs, grouped according to which column the diff occurs in. column3 has 4 diffs: row 3, row 4, row 5, and row 6. column4 has 2 diffs: row 4, and row 5. column2 has 1 diff: row 4.

Finally, the column diffs clustered section:
 --- column diffs clustered ---  
 columnClusters having diffs->(column3, column2.column3.column4, column3.column4)  
 column3 has 2 diffs  
 column2.column3.column4 has 1 diffs  
 column3.column4 has 1 diffs  
 --------------------------- 
This groups the COLUMN_DIFF columns according to which row the diffs occur in. "Cluster" is another name for "pattern of column names having diffs all in the same row". The first line tells us that there are 3 clusters, and which columns participate in each cluster. The column3 cluster has 2 diffs. That is, there are two rows where the only COLUMN_DIFFs are in column3: row 3 and row 6. The column2.column3.column4 cluster has 1 diff: row 4. Finally, the column3.column4 cluster has 1 diff: row 5. Column diff clusters are useful for spotting patterns of linked or related column diffs, which can be helpful in understanding the origin of diffs.

Friday, September 10, 2010

diff'ng CLOBs

0.6.10 introduced a new default behavior for CLOB diff'ng. CLOBs usually represent formatted text. When diff'ng formatted text, users typically would like certain incidental aspects of the formatting to be ignored. So by default, CLOBs are now diff'd in a way that is insensitive to both *nix and Windows newlines (\n and \r, in any combination).

0.6.10 fixes the following issues

ID

Type

Status

Priority

Summary

AllLabels

21

Enhancement

Fixed

Critical

FileSink should be able to produce summaries

Type-Enhancement, Priority-Critical, OpSys-All

22

Defect

Fixed

High

displayColumnNames should be validated

Type-Defect, Priority-High, OpSys-All

23

Enhancement

Fixed

High

extend TestCaseRunner to test for failures, exceptions

Type-Enhancement, Priority-High, OpSys-All

24

Defect

Fixed

Medium

add cluster information to column diff summary

Type-Defect, Priority-Medium

25

Enhancement

Fixed

High

add group by (column list) option to Sink summary

Type-Enhancement, Priority-High, OpSys-All

28

Defect

Fixed

High

Replace newline characters with spaces in clobs

Type-Defect, Priority-High

Wednesday, September 1, 2010

0.6.9 fixes following issues

ID Type Priority Summary AllLabels
13 Enhancement High ant build target to execute JUnit tests Type-Enhancement, Priority-High, OpSys-All
14 Defect High ant build target to execute TestCases Type-Defect, Priority-High, OpSys-All
15 Enhancement Medium add elapsed diff time to user output from standalone app Type-Enhancement, Priority-Medium, OpSys-All
16 Defect Medium add diff progress indicator to output from standalone app Type-Defect, Priority-Medium, OpSys-All
17 Defect Critical MagicPlan does not accept diffKind parameter Type-Defect, Priority-Critical, OpSys-All
18 Defect Critical maxDiffs property does not work in MagicPlan Type-Defect, Priority-Critical, OpSys-All