Sunday, January 16, 2011

Plugging 3rd party jars into DiffKit

With the 0.8.10 release, you can add third-party jars to DiffKit. DiffKit will recognize JDBC drivers found in those jars, as well as custom Sources, Sinks, and Diffors.
The DiffKit distribution includes a directory "dropin". It’s directly under the DIFFKIT_HOME directory, which is normally the same directory that the diffkit-app.jar is in after you have unzipped the binary download. The directory in the distribution is empty. Any extra jars that you want DiffKit to recognized you can simply copy into the dropin directory. DiffKit will automatically load all jars within dropin each time it runs. Note well that the only archive type that DiffKit recognizes is jar-- it does not recognize zip, war, ear, etc.
All jars within dropin are loaded before any of the third-party jars that are embedded within DiffKit. That means you can override JDBC drivers that are part of the DiffKit binary distribution. You might need to do this in the case of HyperSQL DB, where the version of the JDBC driver must match the version of the DB server. You might also want to add a new JDBC driver in the case where the vendor supplies a newer version of the driver than the one that DiffKit embeds, and that newer version is higher performance.
The principal use for dropins is to allow the user to write custom Source/Sink/Diffor java classes, and then plug those custom implementations into DiffKit. Source/Sink/Diff are documented in the User Guide. Those three key abstractions are represented as Java Interfaces within DiffKit. You can write your own Java classes that implement those Interfaces. You will need to use a PassthroughPlan to reference the custom classes, because MagicPlan does not recognize them. Custom classes are referenced within the Passthrough plan in exactly the same way that DiffKit native classes are.
As a trivial example of a custom Sink, the DiffKit source distribution includes class org.diffkit.contrib.DKSimpleCustomSink. org.diffkit.contrib is a package that is not included in the binary distribution. You can plug that custom Sink into your plan with, for instance, this specification:
<bean id="plan" class="org.diffkit.diff.conf.DKPassthroughPlan">
   <property name="lhsSource" ref="lhs.source" />
   <property name="rhsSource" ref="rhs.source" />
   <property name="sink" ref="customSink" />
   <property name="tableComparison" ref="table.comparison" />
<bean id="customSink" class="org.diffkit.contrib.DKSimpleCustomSink">
   <constructor-arg index="0" value="./custom_report.diff" />
Having DiffKit recognize your custom Sink is simply a matter of compiling and archiving the org.diffkit.contrib.DKSimpleCustomSink into a jar file, and the dropping that jar file into the dropin directory.

Saturday, January 15, 2011

Comparing file tables with DiffKit

In order for DiffKit to diff two Tables, a Table simply being a set of rows, it must know how to align the rows from the left side Table with those on the right. It does that using a key; one or more columns. Before the row sets can be diff’d, they must be sorted. If you are using a DB Source, DiffKit will sort the Tables for you and you don’t need to do anything. If you are using a File Source, DiffKit will not sort the files for you. In the future, I plan to modify DiffKit to do the sorting for you, but in the meantime you must sort the files yourself.
When you perform the sort, you need to ensure that you are sorting using the same comparison function that DiffKit will use internally to compare rows. That’s because comparing rows internally is how DiffKit figures out ROW_DIFFs.
If you are using a MagicPlan to diff File Sources, DiffKit has no data type information about the columns; MagicPlan doesn’t allow it. It’s just a text file, so DiffKit has to assume that all columns are data type String. In that case, DiffKit will use a lexical (String) sort internally to compare rows. And you must ensure that you have also used a lexical sort when you sort the file. I believe that the default comparison term for Unix sort is lexical.
If you are using a PassthroughPlan to diff File sources, you need to tell DiffKit what are the types of each column. If in the PassthroughPlan you have told DiffKit that the key column is type String, then DiffKit behaves exactly as in the case of the MagicPlan, and you must use a lexical sort on the file. However, if in the PassthroughPlan you tell DiffKit that the key has a numeric data type, then DiffKit will internally use a numeric comparison on the rows, and you must sort the file using a numeric comparison.
Bottom line:
DiffKit internal comparison ==(must equal)== comparison used to sort File. DiffKit internal comparison is based on data type(s) of the key.
MagicPlan always results in String data type(s) for the key. PassthroughPlan results in whatever column data types you specify for the key.

Monday, December 27, 2010

Generating db patches with DiffKit

In the 0.8.7 release, DiffKit gained the ability to generate database "patches". These patches are analogous to the patch files produced by traditional *nix diff tools-- they can be read by a patching tool in order to edit the RHS so that it is identical to the LHS.
Table 1. DiffKit versus Diffutils
diff tool patch format patch tool
sql DML
any DML applicator
The "patch" files produced by DiffKit contain only INSERT, DELETE, and UPDATE statements. After the user applies those DML statements to the RHS table, using whichever tools and techniques they prefer, the RHS table will have identical contents to the LHS table. The DiffKit application will never directly modify your tables-- DiffKit is strictly a read-only application from the perspective of your table data.
DB patches are created by using a new Sink implementation: the SqlPatchSink. test26.plan.xml, in the eg/ (examples) folder, dmonstrates this:
      <property name="sqlPatchFilePath" value="./test26.sink.patch" />
invoked this way:
java -jar ../diffkit-app.jar -planfiles test26.plan.xml,dbConnectionInfo.xml
produces this output in the patch file:

VALUES ('2', 'xxxx', 2, 'zz2zz');


SET COLUMN2='5555', COLUMN3=4, COLUMN4='zz4zz'

VALUES ('5', 'xxxx', 5, 'zz5zz');


Wednesday, December 15, 2010

Embedding the DiffKit framework in your application

Here is some helpful information for using the DiffKit framework within your Java application.
  • everything you need is inside the binary distribution: diffkit-<release>.zip. In fact, everything you need is within the standalone application: diffkit-app.jar. You do not need the source distribution.
  • unjar diffkit-app.jar. All of the DiffKit api is then in the diffkit- <release>.jar file. That file is (as of this writing) 315KB and you can embed it in your Java application in the same way that you would embed any other jar.
  • all of the diffkit-<release>.jar dependencies are in the lib/ directory that resulted from unjarring diffkit-app.jar. Not all of those jars are hard dependencies-- many of them will only be loaded if you are touching certain functionality. In particular, if you are embedding DiffKit in your application and only programming against the core apis, you do not need to include these jars in your application:
    • groovy-all-<release>.jar — only needed to run the embedded TestCaseRunner.
    • h2-<release>.jar — only needed if you want to use a DKDBSource or DKDBSink that is configured for the H2 database.
    • db2jcc.jar,db2jcc_license_cu.jar — only needed if you want to use a DKDBSource or DKDBSink that is configured for the IBM DB2 database.
    • ojdbc14.jar — only needed if you want to use a DKDBSource or DKDBSink that is configured for the Oracle database.
    • mysql-connector-java-5.1.13-bin.jar — only needed if you want to use a DKDBSource or DKDBSink that is configured for the MySQL database.
    • jtds-1.2.5.jar — only needed if you want to use a DKDBSource or DKDBSink that is configured for the SQL Server database.
    • postgresql-9.0-801.jdbc4.jar — only needed if you want to use a DKDBSource or DKDBSink that is configured for the PostgreSQL database.
    • hsqldb.jar — only needed if you want to use a DKDBSource or DKDBSink that is configured for the HyperSQL database.
    • org.springframework.*.jar — only needed if you want to configure your use of DiffKit via the Spring framework. If all of your DiffKit configuration is programmatic, then you don’t need the Spring jars.
  • The combination of diffkit-<release>.jar + it’s core dependencies (excludes Groovy, Spring and all of the JDBC drivers) is 2.6MB.

Thursday, December 2, 2010

DiffKit 0.8.5 released -- fixes several minor bugs


release 0.8.5 (12/2/2010)

fixes Issue 52: DiffKit 0.8.4- does not work with Java 1.5 and

fixes Issue 53: mysql unrecognized dbType _MYSQL_INT_UNSIGNED

fixes Issue 54: DiffKit 0.8.4- does not work with SQLServer 2005

Wednesday, December 1, 2010

Activate tracing debugging in DiffKit

By default DiffKit emits a small amount of helpful information on stdout. But DiffKit is capable of producing very fine-grained tracing and debugging information that cover all aspect of operation. You might find this helpful for diagnosing some type of problem with DiffKit. Or you might simply want a better understanding of how DiffKit works. If you are a Java programmer and interested in learning about the internals of DiffKit, reading the "logs", configured for the appropriate logging level, is a very productive technique. The DiffKit logs are designed to be highly readable and they try to tell a story.
DiffKit uses the Open Source Logback Java framework to configure and control all logging. Logback is the official successor project to the ubiquitous log4j. In DiffKit, logback is configured by editing a file named "logback.xml", which is found in the "conf/" directory under the DiffKit home (DiffKit home is usually the directory where you unzipped the distribution zip file; it usually contains the diffkit-app.jar executable).
   <logger name="org.diffkit">
      <level value="warn" />

   <logger name="user" additivity="false">
      <level value="info" />
      <appender-ref ref="USER" />
DiffKit uses two tiers of logging information: "user" and "system". "User" log messages are intended for regular DiffKit users and represent typical operational information that is presented on your console (standard out, or stdout). The standard output you see when you invoke DiffKit comes from the "user" tier. The "system" tier of messages is targeted at engineers who need to trace or debug DiffKit internals in order to better understand what is going on. Each of these tiers has it’s own "domain" (entry) in the configuration file. The "system" tier is represented by this entry:
   <logger name="org.diffkit">
      <level value="warn" />
DiffKit adheres to the standard logging "level" conventions used by logback and log4j:
  • trace: a crazy level of detail, only suitable for the deepest debugging.
  • debug: high level of detail, useful for debugging, not suitable for day-to-day operations.
  • info: a normal, conversational, level of information. Includes routine operational messages that help orient and inform the user about normal operating parameters and outcomes.
  • warn: something that needs to be looked into. Represents an abnormal operating condition, but not necessarily fatal.
  • error: something is totally broken, and things are probably not working, but there is some slim chance the program can still stagger on.
  • fatal: game over.
As you can see from the logback.xml configuration file, DiffKit normally logs info level messages in the "user" tier, but only warn (or worse) level messages in the system tier. If you want to see what’s going on at the system level, you can do this:
   <logger name="org.diffkit">
      <level value="info" />
If you want a lot more information, you can do this:
   <logger name="org.diffkit">
      <level value="debug" />

Monday, November 29, 2010

DiffKit 0.8.4 released -- adds support for HyperSQL 2

This release includes fully tested support for the HyperSQL 2 DB
(formerly known as HSQLDB). That's the database that is embedded in
Open Office.