Thursday, September 16, 2010

TextDiffor in 0.6.11

0.6.11 introduces the TextDiffor, which is useful for diff'ng chunks of text that might have small formatting differences that you would like to ignore. A good example of this is programming language code. For instance, I was recently trying to diff SQL schemas (DDL) using meta data tables. A troublesome schema object to diff was the TEXT definition of stored procedures. One side looked like this:

 DECLARE  
  number1 NUMBER(2);  
  number2 NUMBER(2)  := 17;       -- value default   
  text1  VARCHAR2(12) := 'Hello world';  
  text2  DATE     := SYSDATE;    -- current date and time  
 BEGIN  
  SELECT street_number  
   INTO number1  
   FROM address  
   WHERE name = 'INU';  
 END;  

while the other side looked like this:

 DECLARE  
  number1  NUMBER(2);  
  number2  NUMBER(2)  := 17;       -- value default   
  text1    VARCHAR2(12) := 'Hello world';  
  text2    DATE     := SYSDATE;    -- current date and time  
 BEGIN  
  SELECT street_number  INTO number1 FROM address WHERE name = 'INU';  
 END;  

Note that the lines of the text1 and text2 variable declarations have small alignment differences between the two sides, and that the SQL SELECT statement is multiline in the first case, but only 1 line in the second case. These two snippets are identical PL SQL programmings (produce the same AST), but are different textually.

The TextDiffor will, by default, see these two snippets as identical. It uses a very simple text normalization before performing the String comparison.

1) replace all tabs and newlines ([\t\r\n]) with a single space character
2) compress all multi-character whitespace runs to a single space character
3) trim all whitespace from both ends

7 comments:

  1. Have you had a look at google-diff-match-patch/? It has a Java version too. Just FYI.

    Cheers!

    ReplyDelete
  2. Thanks for pointing that out Ashwin. I'll see if it makes sense to plug that in instead of rolling my own. One requirement that has been posed is to be able to ignore comments in stored procedures. My initial thinking is to simply implement this as a list of regexes to be ignored, but perhaps google-diff-match-patch can already handle it.

    thanks,

    Joe

    ReplyDelete