Sunday, January 16, 2011

Plugging 3rd party jars into DiffKit

With the 0.8.10 release, you can add third-party jars to DiffKit. DiffKit will recognize JDBC drivers found in those jars, as well as custom Sources, Sinks, and Diffors.
The DiffKit distribution includes a directory "dropin". It’s directly under the DIFFKIT_HOME directory, which is normally the same directory that the diffkit-app.jar is in after you have unzipped the binary download. The directory in the distribution is empty. Any extra jars that you want DiffKit to recognized you can simply copy into the dropin directory. DiffKit will automatically load all jars within dropin each time it runs. Note well that the only archive type that DiffKit recognizes is jar-- it does not recognize zip, war, ear, etc.
All jars within dropin are loaded before any of the third-party jars that are embedded within DiffKit. That means you can override JDBC drivers that are part of the DiffKit binary distribution. You might need to do this in the case of HyperSQL DB, where the version of the JDBC driver must match the version of the DB server. You might also want to add a new JDBC driver in the case where the vendor supplies a newer version of the driver than the one that DiffKit embeds, and that newer version is higher performance.
The principal use for dropins is to allow the user to write custom Source/Sink/Diffor java classes, and then plug those custom implementations into DiffKit. Source/Sink/Diff are documented in the User Guide. Those three key abstractions are represented as Java Interfaces within DiffKit. You can write your own Java classes that implement those Interfaces. You will need to use a PassthroughPlan to reference the custom classes, because MagicPlan does not recognize them. Custom classes are referenced within the Passthrough plan in exactly the same way that DiffKit native classes are.
As a trivial example of a custom Sink, the DiffKit source distribution includes class org.diffkit.contrib.DKSimpleCustomSink. org.diffkit.contrib is a package that is not included in the binary distribution. You can plug that custom Sink into your plan with, for instance, this specification:
...
<bean id="plan" class="org.diffkit.diff.conf.DKPassthroughPlan">
   <property name="lhsSource" ref="lhs.source" />
   <property name="rhsSource" ref="rhs.source" />
   <property name="sink" ref="customSink" />
   <property name="tableComparison" ref="table.comparison" />
</bean>
...
<bean id="customSink" class="org.diffkit.contrib.DKSimpleCustomSink">
   <constructor-arg index="0" value="./custom_report.diff" />
</bean>
...
Having DiffKit recognize your custom Sink is simply a matter of compiling and archiving the org.diffkit.contrib.DKSimpleCustomSink into a jar file, and the dropping that jar file into the dropin directory.

Saturday, January 15, 2011

Comparing file tables with DiffKit

In order for DiffKit to diff two Tables, a Table simply being a set of rows, it must know how to align the rows from the left side Table with those on the right. It does that using a key; one or more columns. Before the row sets can be diff’d, they must be sorted. If you are using a DB Source, DiffKit will sort the Tables for you and you don’t need to do anything. If you are using a File Source, DiffKit will not sort the files for you. In the future, I plan to modify DiffKit to do the sorting for you, but in the meantime you must sort the files yourself.
When you perform the sort, you need to ensure that you are sorting using the same comparison function that DiffKit will use internally to compare rows. That’s because comparing rows internally is how DiffKit figures out ROW_DIFFs.
If you are using a MagicPlan to diff File Sources, DiffKit has no data type information about the columns; MagicPlan doesn’t allow it. It’s just a text file, so DiffKit has to assume that all columns are data type String. In that case, DiffKit will use a lexical (String) sort internally to compare rows. And you must ensure that you have also used a lexical sort when you sort the file. I believe that the default comparison term for Unix sort is lexical.
If you are using a PassthroughPlan to diff File sources, you need to tell DiffKit what are the types of each column. If in the PassthroughPlan you have told DiffKit that the key column is type String, then DiffKit behaves exactly as in the case of the MagicPlan, and you must use a lexical sort on the file. However, if in the PassthroughPlan you tell DiffKit that the key has a numeric data type, then DiffKit will internally use a numeric comparison on the rows, and you must sort the file using a numeric comparison.
Bottom line:
DiffKit internal comparison ==(must equal)== comparison used to sort File. DiffKit internal comparison is based on data type(s) of the key.
MagicPlan always results in String data type(s) for the key. PassthroughPlan results in whatever column data types you specify for the key.