Enterprise Diff: September 2010

Monday, September 20, 2010

0.6.12 adds some documentation, fixes minor issues

ID	Type	Status	Priority	Summary	AllLabels
29	Defect	Fixed	Medium	vhl summary reports wrong number of rows diff'd	Type-Defect, Priority-Medium, OpSys-All
30	Defect	Fixed	Critical	TestCase 23 fails on linux	Type-Defect, Priority-Critical, OpSys-All
31	Enhancement	Fixed	High	create README file	Type-Enhancement, Priority-High, OpSys-All
32	Enhancement	Fixed	High	create QuickStart document	Type-Enhancement, Priority-High, OpSys-All
34	Defect	Duplicate	Medium	create User Guide	Type-Defect, Priority-Medium
35	Enhancement	Fixed	High	create faq	Type-Enhancement, Priority-High, OpSys-All

Thursday, September 16, 2010

TextDiffor in 0.6.11

0.6.11 introduces the TextDiffor, which is useful for diff'ng chunks of text that might have small formatting differences that you would like to ignore. A good example of this is programming language code. For instance, I was recently trying to diff SQL schemas (DDL) using meta data tables. A troublesome schema object to diff was the TEXT definition of stored procedures. One side looked like this:

 DECLARE  
  number1 NUMBER(2);  
  number2 NUMBER(2)  := 17;       -- value default   
  text1  VARCHAR2(12) := 'Hello world';  
  text2  DATE     := SYSDATE;    -- current date and time  
 BEGIN  
  SELECT street_number  
   INTO number1  
   FROM address  
   WHERE name = 'INU';  
 END;

while the other side looked like this:

 DECLARE  
  number1  NUMBER(2);  
  number2  NUMBER(2)  := 17;       -- value default   
  text1    VARCHAR2(12) := 'Hello world';  
  text2    DATE     := SYSDATE;    -- current date and time  
 BEGIN  
  SELECT street_number  INTO number1 FROM address WHERE name = 'INU';  
 END;

Note that the lines of the text1 and text2 variable declarations have small alignment differences between the two sides, and that the SQL SELECT statement is multiline in the first case, but only 1 line in the second case. These two snippets are identical PL SQL programmings (produce the same AST), but are different textually.

The TextDiffor will, by default, see these two snippets as identical. It uses a very simple text normalization before performing the String comparison.

1) replace all tabs and newlines ([\t\r\n]) with a single space character
2) compress all multi-character whitespace runs to a single space character
3) trim all whitespace from both ends

Saturday, September 11, 2010

Results summarization from MagicPlan

Release 0.6.10 introduces a new, optional, capacity to the MagicPlan. Previously, the file (report) produced by the MagicPlan only displayed individually itemized diffs. That is, each diff appeared, only once, in the output as a detailed description of just that discrete diff. 0.6.10 allows you to instruct the MagicPlan to produce aggregate level summary information as well as the detailed individual diffs.

If you include this:

           <property name="withSummary" value="TRUE" />

as a property of the MagicPlan, the output file will have a header that looks like this:

 --- vhl summary ---  
 diff'd 8 rows in 0:00:38.011, found:  
 !4 row diffs  
 @7 column diffs  
 -------------------  
 --- row diff summary ---  
 1 row diffs <  
 3 row diffs >  
 ------------------------  
 --- column diff summary ---  
 columns having diffs->(column3, column4, column2)  
 column3 has 4 diffs  
 column4 has 2 diffs  
 column2 has 1 diffs  
 ---------------------------  
 --- column diffs clustered ---  
 columnClusters having diffs->(column3, column2.column3.column4, column3.column4)  
 column3 has 2 diffs  
 column2.column3.column4 has 1 diffs  
 column3.column4 has 1 diffs  
 ---------------------------

Above is the output from TestCase 23, which provides functional test coverage for the results summarization feature. The input data that produced this report:

lhs:                                      rhs:
column1,column2,column3,column4           column1,column2,column3,column4
----------------------------              1,      0000,   x,      aaaa
2,      1111,   x,      aaaa              ----------------------------
3,      2222,   y,      aaaa              3,      2222,   x,      aaaa
4,      0000,   z,      bbbb              4,      3333,   x,      aaaa
5,      4444,   z,      bbbb              5,      4444,   x,      aaaa
6,      5555,   u,      aaaa              6,      5555,   x,      aaaa
7,      0000,   v,      aaaa              ----------------------------
8,      1111,   x,      aaaa              ----------------------------

Note well that the primary key on both the lhs and rhs tables is column1. So DK will use column1 as the diff'ng key, to align the rows.

Dissecting this report, section by section; first, there is the Very High Level (vhl) summary:

 --- vhl summary ---  
 diff'd 8 rows in 0:00:38.011, found:  
 !4 row diffs  
 @7 column diffs  
-------------------

The first line tells us how many rows were diff'd and how long it took. In this case 8 unique rows were evaluated for diffs. If a row occurs on only one side (is a ROW_DIFF), it counts as 1 row diff'd. In the case where DK is able to match the lhs row with a rhs row, that counts as 1 row diff'd, not 2. So the 8 rows that were diff'd are: the 1 row that appears only on the rhs (1), the 3 rows that appear only on the lhs (2,7,8), and the 4 rows that appear on both sides (3,4,5,6).

0:00:38.011 is an ISO 8601 formatted time specification. It represents 0 hours, 0 minutes, 38 seconds, and 11 milliseconds. The next line, "!4 row diffs", starts with the ! mark, which is the symbol for ROW_DIFF in both the summary and detail sections of the report. The 4 row diffs are the rows of dashed lines in the tables above: 1 on the lhs, and 3 on the rhs. The terminology that DK uses is: "there are 4 rows missing". The final line of the vhl summary, "@7 column diffs", shows that there are a total of 7 individual column (or cell) value diffs. The @ sign is the symbol for COLUMN_DIFF in both the summary and detail sections of the report. The 7 column diffs are: row 3 column3, row 4 column2, row 4 column3, row 4 column4, row 5 column 3, row 5 column4, row 6 column3.

The next section is the row diff summary:

 --- row diff summary ---  
 1 row diffs <  
 3 row diffs >  
------------------------

This breaks down the row diffs according to which side they occur on. The line, "1 row diffs <", tells us that there is 1 row missing from the lhs: row 1. The next line states that there are 3 rows missing from the rhs: row 2, row 7, and row 8.

Next is the column diff summary section:

 --- column diff summary ---  
 columns having diffs->(column3, column4, column2)  
 column3 has 4 diffs  
 column4 has 2 diffs  
 column2 has 1 diffs  
 ---------------------------

This is a very straightforward grouping of the COLUMN_DIFFs, grouped according to which column the diff occurs in. column3 has 4 diffs: row 3, row 4, row 5, and row 6. column4 has 2 diffs: row 4, and row 5. column2 has 1 diff: row 4.

Finally, the column diffs clustered section:

 --- column diffs clustered ---  
 columnClusters having diffs->(column3, column2.column3.column4, column3.column4)  
 column3 has 2 diffs  
 column2.column3.column4 has 1 diffs  
 column3.column4 has 1 diffs  
 ---------------------------

This groups the COLUMN_DIFF columns according to which row the diffs occur in. "Cluster" is another name for "pattern of column names having diffs all in the same row". The first line tells us that there are 3 clusters, and which columns participate in each cluster. The column3 cluster has 2 diffs. That is, there are two rows where the only COLUMN_DIFFs are in column3: row 3 and row 6. The column2.column3.column4 cluster has 1 diff: row 4. Finally, the column3.column4 cluster has 1 diff: row 5. Column diff clusters are useful for spotting patterns of linked or related column diffs, which can be helpful in understanding the origin of diffs.

Friday, September 10, 2010

diff'ng CLOBs

0.6.10 introduced a new default behavior for CLOB diff'ng. CLOBs usually represent formatted text. When diff'ng formatted text, users typically would like certain incidental aspects of the formatting to be ignored. So by default, CLOBs are now diff'd in a way that is insensitive to both *nix and Windows newlines (\n and \r, in any combination).

0.6.10 fixes the following issues

ID	Type	Status	Priority	Summary	AllLabels
21	Enhancement	Fixed	Critical	FileSink should be able to produce summaries	Type-Enhancement, Priority-Critical, OpSys-All
22	Defect	Fixed	High	displayColumnNames should be validated	Type-Defect, Priority-High, OpSys-All
23	Enhancement	Fixed	High	extend TestCaseRunner to test for failures, exceptions	Type-Enhancement, Priority-High, OpSys-All
24	Defect	Fixed	Medium	add cluster information to column diff summary	Type-Defect, Priority-Medium
25	Enhancement	Fixed	High	add group by (column list) option to Sink summary	Type-Enhancement, Priority-High, OpSys-All
28	Defect	Fixed	High	Replace newline characters with spaces in clobs	Type-Defect, Priority-High

Wednesday, September 1, 2010

0.6.9 fixes following issues

ID	Type	Priority	Summary	AllLabels
13	Enhancement	High	ant build target to execute JUnit tests	Type-Enhancement, Priority-High, OpSys-All
14	Defect	High	ant build target to execute TestCases	Type-Defect, Priority-High, OpSys-All
15	Enhancement	Medium	add elapsed diff time to user output from standalone app	Type-Enhancement, Priority-Medium, OpSys-All
16	Defect	Medium	add diff progress indicator to output from standalone app	Type-Defect, Priority-Medium, OpSys-All
17	Defect	Critical	MagicPlan does not accept diffKind parameter	Type-Defect, Priority-Critical, OpSys-All
18	Defect	Critical	maxDiffs property does not work in MagicPlan	Type-Defect, Priority-Critical, OpSys-All