Monday, December 27, 2010

Generating db patches with DiffKit

In the 0.8.7 release, DiffKit gained the ability to generate database "patches". These patches are analogous to the patch files produced by traditional *nix diff tools-- they can be read by a patching tool in order to edit the RHS so that it is identical to the LHS.
Table 1. DiffKit versus Diffutils
diff tool patch format patch tool
sql DML
any DML applicator
The "patch" files produced by DiffKit contain only INSERT, DELETE, and UPDATE statements. After the user applies those DML statements to the RHS table, using whichever tools and techniques they prefer, the RHS table will have identical contents to the LHS table. The DiffKit application will never directly modify your tables-- DiffKit is strictly a read-only application from the perspective of your table data.
DB patches are created by using a new Sink implementation: the SqlPatchSink. test26.plan.xml, in the eg/ (examples) folder, dmonstrates this:
      <property name="sqlPatchFilePath" value="./test26.sink.patch" />
invoked this way:
java -jar ../diffkit-app.jar -planfiles test26.plan.xml,dbConnectionInfo.xml
produces this output in the patch file:

VALUES ('2', 'xxxx', 2, 'zz2zz');


SET COLUMN2='5555', COLUMN3=4, COLUMN4='zz4zz'

VALUES ('5', 'xxxx', 5, 'zz5zz');


Wednesday, December 15, 2010

Embedding the DiffKit framework in your application

Here is some helpful information for using the DiffKit framework within your Java application.
  • everything you need is inside the binary distribution: diffkit-<release>.zip. In fact, everything you need is within the standalone application: diffkit-app.jar. You do not need the source distribution.
  • unjar diffkit-app.jar. All of the DiffKit api is then in the diffkit- <release>.jar file. That file is (as of this writing) 315KB and you can embed it in your Java application in the same way that you would embed any other jar.
  • all of the diffkit-<release>.jar dependencies are in the lib/ directory that resulted from unjarring diffkit-app.jar. Not all of those jars are hard dependencies-- many of them will only be loaded if you are touching certain functionality. In particular, if you are embedding DiffKit in your application and only programming against the core apis, you do not need to include these jars in your application:
    • groovy-all-<release>.jar — only needed to run the embedded TestCaseRunner.
    • h2-<release>.jar — only needed if you want to use a DKDBSource or DKDBSink that is configured for the H2 database.
    • db2jcc.jar,db2jcc_license_cu.jar — only needed if you want to use a DKDBSource or DKDBSink that is configured for the IBM DB2 database.
    • ojdbc14.jar — only needed if you want to use a DKDBSource or DKDBSink that is configured for the Oracle database.
    • mysql-connector-java-5.1.13-bin.jar — only needed if you want to use a DKDBSource or DKDBSink that is configured for the MySQL database.
    • jtds-1.2.5.jar — only needed if you want to use a DKDBSource or DKDBSink that is configured for the SQL Server database.
    • postgresql-9.0-801.jdbc4.jar — only needed if you want to use a DKDBSource or DKDBSink that is configured for the PostgreSQL database.
    • hsqldb.jar — only needed if you want to use a DKDBSource or DKDBSink that is configured for the HyperSQL database.
    • org.springframework.*.jar — only needed if you want to configure your use of DiffKit via the Spring framework. If all of your DiffKit configuration is programmatic, then you don’t need the Spring jars.
  • The combination of diffkit-<release>.jar + it’s core dependencies (excludes Groovy, Spring and all of the JDBC drivers) is 2.6MB.

Thursday, December 2, 2010

DiffKit 0.8.5 released -- fixes several minor bugs


release 0.8.5 (12/2/2010)

fixes Issue 52: DiffKit 0.8.4- does not work with Java 1.5 and

fixes Issue 53: mysql unrecognized dbType _MYSQL_INT_UNSIGNED

fixes Issue 54: DiffKit 0.8.4- does not work with SQLServer 2005

Wednesday, December 1, 2010

Activate tracing debugging in DiffKit

By default DiffKit emits a small amount of helpful information on stdout. But DiffKit is capable of producing very fine-grained tracing and debugging information that cover all aspect of operation. You might find this helpful for diagnosing some type of problem with DiffKit. Or you might simply want a better understanding of how DiffKit works. If you are a Java programmer and interested in learning about the internals of DiffKit, reading the "logs", configured for the appropriate logging level, is a very productive technique. The DiffKit logs are designed to be highly readable and they try to tell a story.
DiffKit uses the Open Source Logback Java framework to configure and control all logging. Logback is the official successor project to the ubiquitous log4j. In DiffKit, logback is configured by editing a file named "logback.xml", which is found in the "conf/" directory under the DiffKit home (DiffKit home is usually the directory where you unzipped the distribution zip file; it usually contains the diffkit-app.jar executable).
   <logger name="org.diffkit">
      <level value="warn" />

   <logger name="user" additivity="false">
      <level value="info" />
      <appender-ref ref="USER" />
DiffKit uses two tiers of logging information: "user" and "system". "User" log messages are intended for regular DiffKit users and represent typical operational information that is presented on your console (standard out, or stdout). The standard output you see when you invoke DiffKit comes from the "user" tier. The "system" tier of messages is targeted at engineers who need to trace or debug DiffKit internals in order to better understand what is going on. Each of these tiers has it’s own "domain" (entry) in the configuration file. The "system" tier is represented by this entry:
   <logger name="org.diffkit">
      <level value="warn" />
DiffKit adheres to the standard logging "level" conventions used by logback and log4j:
  • trace: a crazy level of detail, only suitable for the deepest debugging.
  • debug: high level of detail, useful for debugging, not suitable for day-to-day operations.
  • info: a normal, conversational, level of information. Includes routine operational messages that help orient and inform the user about normal operating parameters and outcomes.
  • warn: something that needs to be looked into. Represents an abnormal operating condition, but not necessarily fatal.
  • error: something is totally broken, and things are probably not working, but there is some slim chance the program can still stagger on.
  • fatal: game over.
As you can see from the logback.xml configuration file, DiffKit normally logs info level messages in the "user" tier, but only warn (or worse) level messages in the system tier. If you want to see what’s going on at the system level, you can do this:
   <logger name="org.diffkit">
      <level value="info" />
If you want a lot more information, you can do this:
   <logger name="org.diffkit">
      <level value="debug" />

Monday, November 29, 2010

DiffKit 0.8.4 released -- adds support for HyperSQL 2

This release includes fully tested support for the HyperSQL 2 DB
(formerly known as HSQLDB). That's the database that is embedded in
Open Office. 

Tuesday, November 16, 2010

Configuring database connectivity

User feedback has indicated that the DBConnectionInfo configuration element could use some documentation. DBConnectionInfo is particularly inscrutable, compared with other XML plan file elements, because it uses "constructor-arg"s that are referenced by index, instead of some type of descriptive property name. The user sees "<constructor-arg index="0"" and has no idea what that means. So I’ll detail the elements right here:
<bean id="connectionInfo" class="org.diffkit.db.DKDBConnectionInfo">
  <constructor-arg index="0" value="oracle" />  1
  <constructor-arg index="1" value="ORACLE" />  2
  <constructor-arg index="2" value="XE" />      3
  <constructor-arg index="3" value="" />       4
  <constructor-arg index="4" value="1521" />    5
  <constructor-arg index="5" value="diffkit" /> 6
  <constructor-arg index="6" value="diffkit" /> 7
1Your name for this ConnectionInfo element. You can select an abitrary name, and it has no relationship with any real DB construct. This name is merely for presentation-- it will appear in the logs and some types of reports.
2The "Flavor" of connection. Can be any one of: H2, MYSQL, ORACLE, DB2, SQLSERVER, POSTGRES. These correspond to the constants in the Java enum org.diffkit.db.DKDBFlavor.
3The database name. This is what the vendor refers to as the database name.
4The hostname. Can also be an ip address.
5The port.
6The DB username you want to log in as.
7The DB password for username in 6.

For Java programmers

DiffKit is configured using the "core" Spring framework. The above <bean></bean> XML element simply plugs the constructor-args into the longest Constructor of the org.diffkit.db.DKDBConnectionInfo class.

Saturday, November 13, 2010

Friday, November 12, 2010

How to regression test a database application (part 2)

This is part 2 of 2. In part 1 I characterize what is meant by "database application". I also quickly review some different types of regression testing, and then describe a hypothetical retail inventory example system which will serve as the subject of detailed discussion. In part 2, I explore the "final table" type of regression test in detail, and apply it to our retail inventory system. I conclude with an argument for why "final table" testing should always be included on your project.

The challenge

I’m the project manager for the AIR system, which was described in some detail in part 1. The system has been humming along in production for years, and now my developers are working on some significant functional changes to the system which are scheduled to be released shortly. My biggest concern is to not break something that is already working. For that I need some type of regression testing, but I"m unsure what the best regression strategy is. Some, but not all, of my system is already covered by Unit tests, and there are a couple of higher level functional tests that cover a slice. But there is no end-to-end fully integrated systems test.
In an ideal (fantasy) world, I would now instruct my developers to write regression tests, at all levels, covering every aspect of the system. They would write low level unit tests to test every DB function, DB stored procedure, and Java method. They would then write a higher level functional test for each slice. Finally, they would script, or automate, a highest level fully integrated end-to-end systems test.
Figures 1 & 2 (same as in part 1) diagram the system.
Figure 1: System
Figure 2: Forecast slice
Taking the Forecast slice as an example, if I want to competely cover this slice with Unit tests, my developers will need to write several for the getProductList() stored procedure. More than one Unit will be needed in order to exercises all of the possible boundary and error conditions. The test harness will need to somehow populate the underlying tables/views that are read by getProductList(). Just the Unit tests required to provide decent coverage for the getProductList() could take an entire day to write. Next, my Java developers need to provide Units to blanket the forecast() method. Since forecast() calls the getProductList() stored procedure, the forecast() Units will need to mock the stored procedure. Setting up mock stored procedures is itself an exacting and laborious endeavour. Again, providing comprehensive Unit coverage for just forecast() might pin down a Java developer for an entire day. Finally, all of the tables and views touched by this slice will also need their own Unit tests. These Unit tests will ensure that defaults, invariants, and constraints are always observed within the tables and views. If the views implement any business logic, that too must be covered by Unit testing.
All of the above Unit tests are data intensive. So in each case, the test writer must think long and hard about the possible range of data values that the subject function might be exposed to. The test writer will need to think about both expected values for each parameter or data set, and also possible values for each. After all, one of the goals of software testing is to ensure that the test subject continues to operate properly even in the face of unexpected or "flaky" data or paramters.
It’s clear that providing deep Unit test coverage for just the Forecast slice will entail significant effort. 3 developer days (1 for java, 1 for stored procedure, and 1 for tables/views) might be generous, but it’s not an unrealistic estimate, considering all of the different concerns involved. The Forecast slice is just one of many slices that constitute the whole system. In a complex enterprise system, and that includes most Big Enterprise systems, there will be hundreds or thousands of slices.
If my whole system has just 100 slices, it will take 300 developer days to achieve exquisite, comprehensive, rock-solid, Unit test coverage. Current software testing theory dictates that this is exactly the type of Unit testing you should have on a well run project. But as the manager of project AIR, I don’t have 300 developer days to write Unit tests. Purists will object that most of the Unit tests should already have been written before the current iteration, but on project AIR (as on many, if not most, Enterprise systems) they weren’t. That’s the fault of the previous project manager, but he has since moved on, so now it’s my problem. The theory dictates perfect Unit testing, but it’s hard to square that with time and resource constraints:
In theory there is no difference between theory and practice. In practice there is.
Jan L. A. van de Snepscheut
Alternatively, I can abandon the idea of blanket Unit test coverage. Instead, I could go to the other extreme and build just 1, fully integrated, end-to-end, systems test. One way to carry out this system test is to:
Final Table Test
  • capture a complete set of inputs (files, tables, whatever) from the current production (pre-change) system.
  • in an isolated, controlled, test environment, run those inputs through the complete current production (pre-change) system in order to produce outputs. I’ll call the outputs "final tables", because all of the outputs are captured in DB tables and those tables are at the end of a chain (or pipeline) of processing. The outputs from this run are labelled "expected".
  • in an isolated, controlled, test environment, run those same inputs through the complete current in-development (post-change) system in order to produce outputs. Label those outputs "actual".
  • compare the expected final tables with the actual final tables. They should match, except in the areas where there are intended functional changes that are meant to produce changes to the outputs.
Some benefits of this scheme are worth considering carefully. This is the ulimately "black box" test. The testers do not need to understand anything at all about the internals of the system. They only need to know which buttons to press in order to run the system, and which are the "final tables" they need to compare. So this is the type of test that an independent testing group (say QA) should be able to effect. Another interesting aspect of this approach is that of test data selection. The good part is that it relies on current production inputs, exactly the real-world data conditions that the application will experience in practice. Also, simply reusing production data alleviates the burden of the test team needing to carefully consider and synthesize test data. This could be a significant labor savings.
But perhaps the most important aspect of this type of testing is that it effectively tests the system at all levels. Put another way, the final table test can provide much of the test coverage that would be provided by the collection of Units, whereas Units cannot provide any coverage at the functional or system levels. As a result, even if you have a blanket of Unit tests, you still really do need to do some form of functional or systems test. That’s because the interfaces, or boundaries, between functions, modules, and slices have to be exercised. As an example, look at the Forecast Slice. Suppose I write individual Unit tests for getProductList(), forecast(), and the tables and views. I test the forecast() method by mocking up the getProductList(). That’s not sufficient to provide confidence that the slice is production ready. I need at least some functional tests that exercise forecast() making a real call against the getProductList() stored procedure. Otherwise, the Unit tests, in mocking the getProductList(), might make assumptions about the parameters, behaviour, or side effects, that are not consistent with the real (non-mock) getProductList(). Likewise, it’s not enough for me to simply write a functional test that just covers each slice. I need a still higher level test that exercises the interfaces between slices. So the bottom line is that, if you absolutely have to, you can live without Unit tests and rely exclusively on a single, integrated, systems test (final table test). But the converse is not true. Even if you have superbe Unit test coverage, you will still need to perform some integrated systems testing.
Of course, the final table test is hardly bulletproof. Probably the biggest problem with this test is that it is a relative test, wherease the Units are absolute. By that I mean the final table test only tells you that the changed system behaves the same as the reference (current production) system that you compare it against. If the current production system is producing incorrect results, and your changed system passes the final table test, then your changed system will also be producing incorrect results in exactly the same fashion. The Unit tests, on the other hand, are measured against absolute expectations, not relative ones. So Unit tests are better grounded. Also, the final table test only tests aspects of the system that will be exercised by the current production data inputs. It’s possible that there are bugs lurking in the newly developed (changed) system, but that the bugs are not triggered by the current production data inputs. But the production data could change at any time in a way that would then trigger the bug, but in production. One more material drawback of the final table test is that it can only be carried out late in the Software Development Cycle, whereas Unit tests are able to uncover problems very early on in the cycle. That’s important because it’s generally, though not always, more expensive to correct bugs late in the cycle than early in the cycle.
Table 1. Units versus final table test
Units Final Table
bug detection
lowest, function
highest, system
test type
white box
black box
results type
impl. expertise
The key to deriving maximum benefit from the final table test is that the final tables comparison be carried out at the actual field value level. That is, every row+column in the expected final tables must be compared with its corresponding row+column in the actual final tables. Remarkably, very few shops that utilize a fully integrated systems test actually carry out these detailed comparisons. Instead, they rely on gross (summary level) comparisons, and visual inspection of the resulting data sets. That is completely inadequate.Many bugs can, and do, elude gross and visual comparisons.
The conclusion is that you must perform some type of fully integrated systems test, and a very powerful and cost effective option for this is the "final table test". Of course, high quality projects will also have low-level Unit tests and mid-level functional tests. However, if you are in a bind, constrained by very limited developer and testing resources, I would recommend that the first and foremost test you should carry out is the final table test.

Wednesday, November 10, 2010

How to regression test a database application (part 1)

This is part 1 of 2. In part 1 I characterize what is meant by "database application". I also quickly review some different types of regression testing, and then describe a hypothetical retail inventory example system which will serve as the subject of detailed discussion. In part 2, I explore the "final table" type of regression test in detail, and apply it to our retail inventory system. I conclude with an argument for why "final table" testing should always be included on your project.

Database Applications

Here, a database application means a software system that uses an RDBMS as its primary persistence mechanism. In other words, a database application is built on top of, and around, a relational database. The application logic might be implemented within the database itself, in the form of stored procedures and other database objects, or it might be implemented outside the database in a general purpose programming language such as Java or .NET. It’s quite common that Big Enterprise sytems are implemented using a layered mixture of different programming technologies. Another characteristic of database applications is that the application data, in the database, is the primary window through which the application teams (developers and admins) monitor the health of the system and evaluate the correctness of the results.
Regression testing is a well established technique to help ensure the quality of software systems. The intent of regression testing is to assure that a change, such as a bugfix, did not introduce new bugs. At its most basic level a regression test reruns a function that used to work and verifies that it still produces the same results. The basic elements of a regression test are:
  • a function to test. The function must have identifiable inputs and outputs. Function here is meant in the broadest possible sense. It could represent a database stored procedure, a Java method, an RDBMS View, or a high level shell script job that choreographs other jobs. Any software artifact that is callable or invocable.
  • a predetermined, capturable (i.e. saveable), input data.
  • a predetermined, capturable, expected output data for the above input data.
  • an actual output, which results from applying the latest revision of the function to the input data above.
  • a means for comparing the expected output with the actual output to determine if they are the same or if they are different in a way that is expected.
Regression testing can be applied at many different levels. Applied at the lowest, most primitive, level it is called a "Unit Test". Unit testing applies a laser focus on one low level functional unit. Unit tests are very popular amongst the general purpose programming language communities, such as Java and Python. Testing above the Unit level is generally referred to as functional testing. Again, functional testing can ocurr at different levels. It can test software modules that compose the lowest level operational units into higher level functions. Here, I call a high-level functional module that implements one complete business function a slice. At the top of the testing hierarchy is the system test. The system test exercises the fully integrated, end-to-end, system at the highest possible level. It touches all functions and modules. This usually corresponds to the same level at which the users experience the system.

A Concrete Database Application Example

For the sake of better illustrating the concepts and techniques in this article, I’ll invent a hypothetical database application called the "Automated Inventory Replenishment" system — AIR. The AIR system is maintained by a midsized enterprise that is in the retail sales business. Every day, the business sells a portion of its inventory (stock) and must decide how to restock the store shelves. The AIR system is highly streamlined and simplified, in order to allow focus on just the salient points. The enterprise performs the restocking only once per day, after all of the stores have closed. As a result, AIR is run once per day, well after the close of business. AIR is run in a batch mode-- a single invocation of the system performs a complete, daily, restocking cycle.
Figure 1. shows the complete AIR system.
Figure 1: System
There are many different functional modules, or slices, needed to complete the entire cycle. The inputs are: the current level of inventory for each product, pricing and availability for all products we might want to purchase, and a listing of all available unoccupied shelf space. AIR must determine which product orders to place in order to fill all of the unoccupied shelf space. So the outputs are simply a list of product orders (SKUs and quantities). The outputs are stored in a single table in the database, which is referred to as the "final table" because it is the end (terminal) of a chain of tables used during processing. One of the slices, the "Forecast Slice", is exploded in detail in Figure 2:
Figure 2: Forecast slice
The forecast slice is responsible for guessing how much of each product will be needed (or wanted) for the next business week. These forecast results are one of many importantant inputs to the final ordering function. The final ordering determination has to include not just forecast information, but also price and availability. The forecast slice is implemented as a Java function (method), a DB stored procedure, and a collection of tables and views. Java was chosen for the detailed forecast business logic, because the calculations and flow control are very complicated and better suited for a general purpose programming language than for an embedded database procedural language (e.g. pl/sql). The Java method calls the stored procedure getProductList(), which returns a complete list of all products that we might want to forecast for. The Java forecast method writes all of the forecasts (results) to a table.
Suppose I’m the project manager for the AIR system, and it has been in production for several years. Let’s also suppose that new functional and operational requirements are fed to team AIR on a regular basis. My team is working on changing the system in order to provide new business functionality. But before I can release these changes into the production environment, I have to be confident that we are not breaking something that already works. Let’s suppose that the current level of operational integrity of AIR is fair; then, even minor bugs introduced as a result of change could very well cause noticeable degradation of operational integrity for AIR, and appreciable degradation of career prospects for staff. Clearly, some form of regression testing is called for. But it’s not at all clear which tests will give the highest level of confidence for the least amount of work. Say that some of the slices are covered by both unit and higher level functional regression tests. But, as in most real-world custom-built enterprise applications, most slices are not covered by rock-solid automated regression. For some slices, there are no regression tests at all. For others, there are some unit and functional tests that, history has shown, catch some, but not all, of the bugs that typically creep in as a result of system change.
As project manager, I now have to decide what mix of regression tests will provide the biggest bang-for-the-buck. In part 2 I dive into the details of the costs and benefits associated with the different types of regression tests.

Thursday, November 4, 2010

Big Hairy Batch (BHB)

Periodic batch processing is a common activity in Big Enterprise. "Batch", as it is usually referred to, is now recognized as a distinct enterprise specialty deserving its own tools, techniques and designs. In fact, Spring has introduced within the last few years a framework geared specifically for the unique characteristics of batch: Spring Batch.
The most common characteristics of batch are:
  • They’re periodic. Daily or monthly periods are the most common, because these periods correspond to important business cycles, such as the daily portfolio and instrument repricing that many financial firms live by, or the monthly accounting and balance sheet reconciliation that most regulated companies are required to provide for the public equity markets.
  • They are "automatic" (or at least they are supposed to be). That means they run on a timer (according to their period) rather than in response to user generated events.
  • They typically import several different sources of data, transform those inputs according to a number of pipelined steps, and produce one or more outputs that are usually stored (at least temporarily) in an RDBMS.
  • The outputs can be consumed by the same system that generated them, by external downstream systems, or both.
Batches come in many different shapes and sizes. I believe that the most challenging kind of batch, from an operational and test perspective, is what I call the Big Hairy Batch (BHB). The BHB layers the following traits on top of the standard batch:
  • The inputs come from tens, hundreds, or thousand of different sources, in many different formats.
  • The inputs are heterogeneous; many different kinds of data are sourced, even if one of those kinds is principal.
  • The inputs do not arrive in the desired format. Extensive "reformatting" is necessary to put the inputs into a shape that is amenable to further processing.
  • The inputs are not all expressed according to a common world view. For instance, the same underlying reference value or entity may be referred to differently in different sources. So an extensive "normalization" of the data from different sources is necessary before the different sources can be combined in a single view.
  • The data volumes are large. The principal entity may have tens of millions of instances (rows) and hundreds of attributes (columns).
  • The data is very messy. There are lots of breaks and inconsistencies and duplications, etc.
  • The processing necessary to tranform the inputs to outputs is, conceptually, hugely complicated and complex. Some data must travel through multiple pipelines, and each pipeline can have multiple steps. Some pipelines have dependencies on other pipelines. Some pipelines have dependencies on external callouts (to outside systems). The specific logical functions within a pipeline step might involve very sophisticated or heavy computations.
  • The processing implementations are complicated and difficult to maintain. Usually, the logic and procedures needed to carry out the batch are implemented in a rich melange of diverse technologies. There must be some type of high-level choreography system (programming in the large) to script or drive the entire process; a prominent example would be CA AutoSys. Business logic is spread across the entire application stack: some is in stored procedures and views in a database, some is in middleware in general purpose programming languages (java, c++, python), and some of it is in front-end oriented modules such as MS .NET. Invariably, at the center of the entire batch is a very complex, very overworked Very Large Database (VLDB).
  • The BHB system has very few, if any, automated systematic regression tests in place.
  • BHB has a daily period. Fresh BHB outputs are needed every day in order to carry out business as usual. Each day, the business needs the outputs as close as possible to the beginning of the workday, in order to meet all of their daily business obligations. BHB starts processing as early in the morning as possible, as soon as inputs become available, and struggles to deliver outputs by the beginning of the business day. Any delays due to the time required to process massive volumes, or due to delays from upstream systems, or due to operational failures, will ripple through to the business for the entire day.
  • The BHB team is under pressure to deliver continuous, significant, changes to how the BHB works. These pressures originate from many different sources: changes to business strategy, changes to the systems environment, changes to upstream or downstream systems, etc.
In this environment, the BHB team really has their work cut out for them in trying to maintain the operational integrity of the system while carrying out requested changes. The biggest risk is that changing something will break something that used to work. What are the most productive techniques and tools to guard against this possibility? That’s the subject for a future post.

Tuesday, November 2, 2010

What is enterprise?

A large segment of the IT industry focuses on the enterprise: enterprise software, enterprise solutions, enterprise architecture, etc. But what is "enterprise"? According to Wikipedia:
There is no single, widely accepted list of enterprise software characteristics, …
In that case, I’ll take a shot at it ;). However, my list of enterprise traits is not an attempt at a universal definition. Instead, I’m relating my experiences in "Big Enterprise", based on tours in half a dozen fortune 500 companies.
  • Big Enterprise has Big Data: many databases, many database vendor products, many many tables, huge volume, and heterogeneous types of data.
  • Big Enterprise has Messy Data. A huge number of different actors have operated on the data continuously over a decade or more. Under those circumstances, it’s impossible to keep your data perfectly rationalized. Instead, there tend to be a lot of inconsistencies, breaks, gaps, reconciliation failures, and dupcliations in the data. Frequently, it’s not possible to impose clean relational constraints.
  • Big Enterprise has messy data processes. It has accumulated many different processes to originate, deliver, and transform its data. It has connected these processes together using the cement of different programming/scripting languages and vendor tools. It has globalized, outsourced, insourced, off-shored and re-shored. It has built layer upon layer of intertwingled processing logic in a kind of palimpsest.
  • Big Enterprise has incomprehensible data and processes. Realistically, complex software and data is prohibitively expensive to document properly. Creating documentation that accurately and fully describes a complex system, to all levels of detail, in a manner that naive readers would be able to make sense of, is much more difficult than the enterprise pretends. And once you consider the fact that the software and data is constantly evolving, it’s effectively (within normal enterprise IT budgets) impossible. Instead, the knowledge of the system is captured in the brains of the staff. But staff leaves, and so some knowledge about the systems is lost. It can be (and sometimes is) reacquired, but only at a substantial cost and on a as-needed basis.
  • Big enterprise is organizationally rigid and compartmentalized. In the big enterprise, there is a lot of management hierarchy. Also, big enterprise formally separates roles and responsibilities in a way that doesn’t always align with the needs of the IT staff. For instance, most big enterprises will formally split the functions of development (changeing the IT artifacts) from the functions associated with validating that the changes are correct (usually referred to as QA). Often, the QA group does not have the skills necessary to carry out the sophisticated validations that are necessary for a rock solid quality assurance against a very complex system, especially if the system covers a domain that requires specialized business expertise to understand. It’s not uncommon that the development teams are disinclined to work on the QA aspect of the system, since it is not formally part of their job and thus amounts to unpaid work. Project managers are usually well aware of the shortcomings with this arrangement but lack the political or corporate power to overstep the existing managerial boundaries.
  • Big enterprise is a stressful, chaotic, operating environment for the people who are tasked with maintaining the data and the processes that operate on it. The forces of Big Data, Messy Data, messy data processes, incomprehensibility, staff turnover, bureaucracy, and competitive pressures on the corporation conspire to keep the staff under-the-gun.
This environment is going to dictate which tools and processes are most successful. As the project managers, team leads, and architects in the Big Enterprise confront the challenges above, they must select the tools and technologies that will give them the biggest return-on-investment within their planning horizon. A normal planning horizon for a Big Enterprise project manager or team lead is a couple of weeks to a couple of months. Anything beyond that is "strategic" and highly risky (and therefore implausible). A tool or technique that requires substantial operational or cultural change in the enterprise IT environment, rather than adapting to it, is unlikely to succeed. In other words, agile is as agile does.

Saturday, October 30, 2010

How to Regression Test a Relational Database

Scott Ambler, of The Elements of Java Style fame, has a nice article on database regression testing: How to Regression Test a Relational Database. Scott is a leading evangelist in the small, but growing, movement for regression testing databases.

The emphasis of the article is on low-level, unit style, regression tests. The idea is that database developers, those developing within the database (e.g. stored procedures) as well as against the database (applications reading and persisting data), can identify narrowly focused functional "slices" to configure for automated regression testing. In most cases, this involves first installing data as part of the test setup, then triggering the functionality that is the subject of the test, and finally comparing the actual results against the expected results.

The article makes a number of important points that are particularly relevant to enterprise software development:
  • There's a lot of specialization amongst the groups. Typically, one team is responsible for developing the software, while a totally different team is tasked with testing and Quality Assurance.
  • Frequently, the QA specialists do not have the necessary knowledge or skills to sensibly test the application and its data. But it's their job, so they have the inclination to at least try.
  • Frequently, the development team has the knowledge, but often not the skills, to test the application and the data. But it's not their job so they a lot of times they are not really inclined to think too hard about the problem, which is why they never developed the skills.
  • Therefore, most enterprise shops do not really do database testing right now.
I think most software professionals who have been at it for a while would agree that unit testing, whether application code or database artifacts, is a very productive technique. The idea that it's much less expensive to find and fix flaws early in the SDLC, rather than in the final stages of full integrated or end-user testing, is a well established wisdom. However, it seems to me that Scott gave short shrift to high-level, fully-integrated, functional testing. In my experience, it's not possible to have a blanket of low-level Unit tests that supply perfect test coverage with respect to the final functionality. Even with the best possible Units, in a complex application there are so many high level interactions amongst all the discrete components or modules, many of those interactions with subtle and in some cases unintended couplings and side-effects, that high-level end-stage functional testing is also necessary. In other words, the best defense is a multi-layered "defence in depth".

Wednesday, October 6, 2010

0.7.0 adds proper documentation

DiffKit now has decent README, Quick Start, FAQ, User Guide, and Compatibility Matrix.

Monday, September 20, 2010

0.6.12 adds some documentation, fixes minor issues

ID Type Status Priority Summary AllLabels
29 Defect Fixed Medium vhl summary reports wrong number of rows diff'd Type-Defect, Priority-Medium, OpSys-All
30 Defect Fixed Critical TestCase 23 fails on linux Type-Defect, Priority-Critical, OpSys-All
31 Enhancement Fixed High create README file Type-Enhancement, Priority-High, OpSys-All
32 Enhancement Fixed High create QuickStart document Type-Enhancement, Priority-High, OpSys-All
34 Defect Duplicate Medium create User Guide Type-Defect, Priority-Medium
35 Enhancement Fixed High create faq Type-Enhancement, Priority-High, OpSys-All

Thursday, September 16, 2010

TextDiffor in 0.6.11

0.6.11 introduces the TextDiffor, which is useful for diff'ng chunks of text that might have small formatting differences that you would like to ignore. A good example of this is programming language code. For instance, I was recently trying to diff SQL schemas (DDL) using meta data tables. A troublesome schema object to diff was the TEXT definition of stored procedures. One side looked like this:

  number1 NUMBER(2);  
  number2 NUMBER(2)  := 17;       -- value default   
  text1  VARCHAR2(12) := 'Hello world';  
  text2  DATE     := SYSDATE;    -- current date and time  
  SELECT street_number  
   INTO number1  
   FROM address  
   WHERE name = 'INU';  

while the other side looked like this:

  number1  NUMBER(2);  
  number2  NUMBER(2)  := 17;       -- value default   
  text1    VARCHAR2(12) := 'Hello world';  
  text2    DATE     := SYSDATE;    -- current date and time  
  SELECT street_number  INTO number1 FROM address WHERE name = 'INU';  

Note that the lines of the text1 and text2 variable declarations have small alignment differences between the two sides, and that the SQL SELECT statement is multiline in the first case, but only 1 line in the second case. These two snippets are identical PL SQL programmings (produce the same AST), but are different textually.

The TextDiffor will, by default, see these two snippets as identical. It uses a very simple text normalization before performing the String comparison.

1) replace all tabs and newlines ([\t\r\n]) with a single space character
2) compress all multi-character whitespace runs to a single space character
3) trim all whitespace from both ends

Saturday, September 11, 2010

Results summarization from MagicPlan

Release 0.6.10 introduces a new, optional, capacity to the MagicPlan. Previously, the file (report) produced by the MagicPlan only displayed individually itemized diffs. That is, each diff appeared, only once, in the output as a detailed description of just that discrete diff. 0.6.10 allows you to instruct the MagicPlan to produce aggregate level summary information as well as the detailed individual diffs.

If you include this:
           <property name="withSummary" value="TRUE" />  
as a property of the MagicPlan, the output file will have a header that looks like this:
 --- vhl summary ---  
 diff'd 8 rows in 0:00:38.011, found:  
 !4 row diffs  
 @7 column diffs  
 --- row diff summary ---  
 1 row diffs <  
 3 row diffs >  
 --- column diff summary ---  
 columns having diffs->(column3, column4, column2)  
 column3 has 4 diffs  
 column4 has 2 diffs  
 column2 has 1 diffs  
 --- column diffs clustered ---  
 columnClusters having diffs->(column3, column2.column3.column4, column3.column4)  
 column3 has 2 diffs  
 column2.column3.column4 has 1 diffs  
 column3.column4 has 1 diffs  

Above is the output from TestCase 23, which provides functional test coverage for the results summarization feature. The input data that produced this report:
lhs:                                      rhs:
column1,column2,column3,column4           column1,column2,column3,column4
----------------------------              1,      0000,   x,      aaaa
2,      1111,   x,      aaaa              ----------------------------
3,      2222,   y,      aaaa              3,      2222,   x,      aaaa
4,      0000,   z,      bbbb              4,      3333,   x,      aaaa
5,      4444,   z,      bbbb              5,      4444,   x,      aaaa
6,      5555,   u,      aaaa              6,      5555,   x,      aaaa
7,      0000,   v,      aaaa              ----------------------------
8,      1111,   x,      aaaa              ----------------------------
Note well that the primary key on both the lhs and rhs tables is column1. So DK will use column1 as the diff'ng key, to align the rows.

Dissecting this report, section by section; first, there is the Very High Level (vhl) summary:
 --- vhl summary ---  
 diff'd 8 rows in 0:00:38.011, found:  
 !4 row diffs  
 @7 column diffs  
The first line tells us how many rows were diff'd and how long it took. In this case 8 unique rows were evaluated for diffs. If a row occurs on only one side (is a ROW_DIFF), it counts as 1 row diff'd. In the case where DK is able to match the lhs row with a rhs row, that counts as 1 row diff'd, not 2. So the 8 rows that were diff'd are: the 1 row that appears only on the rhs (1), the 3 rows that appear only on the lhs (2,7,8), and the 4 rows that appear on both sides (3,4,5,6).

0:00:38.011 is an ISO 8601 formatted time specification. It represents 0 hours, 0 minutes, 38 seconds, and 11 milliseconds. The next line, "!4 row diffs", starts with the ! mark, which is the symbol for ROW_DIFF in both the summary and detail sections of the report. The 4 row diffs are the rows of dashed lines in the tables above: 1 on the lhs, and 3 on the rhs. The terminology that DK uses is: "there are 4 rows missing". The final line of the vhl summary, "@7 column diffs", shows that there are a total of 7 individual column (or cell) value diffs. The @ sign is the symbol for COLUMN_DIFF in both the summary and detail sections of the report. The 7 column diffs are: row 3 column3, row 4 column2, row 4 column3, row 4 column4, row 5 column 3, row 5 column4, row 6 column3.

The next section is the row diff summary:
 --- row diff summary ---  
 1 row diffs <  
 3 row diffs >  
This breaks down the row diffs according to which side they occur on. The line, "1 row diffs <", tells us that there is 1 row missing from the lhs: row 1. The next line states that there are 3 rows missing from the rhs: row 2, row 7, and row 8.

Next is the column diff summary section:
 --- column diff summary ---  
 columns having diffs->(column3, column4, column2)  
 column3 has 4 diffs  
 column4 has 2 diffs  
 column2 has 1 diffs  
This is a very straightforward grouping of the COLUMN_DIFFs, grouped according to which column the diff occurs in. column3 has 4 diffs: row 3, row 4, row 5, and row 6. column4 has 2 diffs: row 4, and row 5. column2 has 1 diff: row 4.

Finally, the column diffs clustered section:
 --- column diffs clustered ---  
 columnClusters having diffs->(column3, column2.column3.column4, column3.column4)  
 column3 has 2 diffs  
 column2.column3.column4 has 1 diffs  
 column3.column4 has 1 diffs  
This groups the COLUMN_DIFF columns according to which row the diffs occur in. "Cluster" is another name for "pattern of column names having diffs all in the same row". The first line tells us that there are 3 clusters, and which columns participate in each cluster. The column3 cluster has 2 diffs. That is, there are two rows where the only COLUMN_DIFFs are in column3: row 3 and row 6. The column2.column3.column4 cluster has 1 diff: row 4. Finally, the column3.column4 cluster has 1 diff: row 5. Column diff clusters are useful for spotting patterns of linked or related column diffs, which can be helpful in understanding the origin of diffs.

Friday, September 10, 2010

diff'ng CLOBs

0.6.10 introduced a new default behavior for CLOB diff'ng. CLOBs usually represent formatted text. When diff'ng formatted text, users typically would like certain incidental aspects of the formatting to be ignored. So by default, CLOBs are now diff'd in a way that is insensitive to both *nix and Windows newlines (\n and \r, in any combination).

0.6.10 fixes the following issues











FileSink should be able to produce summaries

Type-Enhancement, Priority-Critical, OpSys-All





displayColumnNames should be validated

Type-Defect, Priority-High, OpSys-All





extend TestCaseRunner to test for failures, exceptions

Type-Enhancement, Priority-High, OpSys-All





add cluster information to column diff summary

Type-Defect, Priority-Medium





add group by (column list) option to Sink summary

Type-Enhancement, Priority-High, OpSys-All





Replace newline characters with spaces in clobs

Type-Defect, Priority-High

Wednesday, September 1, 2010

0.6.9 fixes following issues

ID Type Priority Summary AllLabels
13 Enhancement High ant build target to execute JUnit tests Type-Enhancement, Priority-High, OpSys-All
14 Defect High ant build target to execute TestCases Type-Defect, Priority-High, OpSys-All
15 Enhancement Medium add elapsed diff time to user output from standalone app Type-Enhancement, Priority-Medium, OpSys-All
16 Defect Medium add diff progress indicator to output from standalone app Type-Defect, Priority-Medium, OpSys-All
17 Defect Critical MagicPlan does not accept diffKind parameter Type-Defect, Priority-Critical, OpSys-All
18 Defect Critical maxDiffs property does not work in MagicPlan Type-Defect, Priority-Critical, OpSys-All

Friday, August 27, 2010

0.6.6 allows composing plans from multiple files

It's fairly common to have information needed for a plan that you would like to reuse across multiple plans. For instance, if you are diff'ng DB sources, it would be nice to isolate the DB connection information to one file that could be reused across different comparisons.

0.6.6 now allows this via the -planfiles argument. -planfiles can take as a value either a single file name or multiple comma separated file names. It then composites the files to create a single plan. Here's an example:

 joe$ java -jar diffkit-app.jar -planfiles test3.plan.xml,sources.xml  

file: test3.plan.xml
      <bean id="plan" class="org.diffkit.diff.conf.DKPassthroughPlan">  
           <property name="lhsSource" ref="lhs.source" />  
           <property name="rhsSource" ref="rhs.source" />  
           <property name="sink" ref="sink" />  
           <property name="tableComparison" ref="table.comparison" />  

file: sources.xml
      <bean id="rhs.source" class="org.diffkit.diff.sns.DKFileSource">  
           <constructor-arg index="0"  
                value="./test3.rhs.csv" />  
           <constructor-arg index="1" ref="lhs.table.model" />  
           <constructor-arg index="2">  
                <null />  
              <constructor-arg index="3">  
                <null />  
              <constructor-arg index="4" value="\," />  
              <constructor-arg index="5" value="true" />  
              <constructor-arg index="6" value="true" />  

Note that the bean id="rhs.source" is nowhere defined in the file test3.plan.xml. It's only defined in the file sources.xml

much improved command line interface in 0.6.6

0.6.6 was just released. By default, all of the logback INFO level spam is gone from the standard output. Instead, there is now a minimalist output that includes a simple, high-level, diff report. e.g.

 joe$ java -jar diffkit-app.jar -planfiles test9.plan.xml  
 diff'd 9 rows, found:  
 !4 row diffs  
 @2 column diffs  

It's still possible to get the logback spam. You need to edit conf/logback.xml to change the  root logger level to info (instead of warn), and then specify the command line param to force diffkit-app.jar to use that logback.xml conf file.

Thursday, August 26, 2010

0.6.5 released -- add support for most SQL data types


Saturday, August 21, 2010

0.6.2 released-- DiffKit now diffs across different databases

LHS table can be in one DB, while RHS table can be in another. TestCase 18 demos the feature.

Friday, August 20, 2010

DB2 -- more trouble

Turns out that the version of the jdbc driver we are using has a different behavior than all other jdbc drivers. The javadoc indicates that you should be able to make repeated calls to and simply get a null when the end of the RS has been reached. But the db2 driver blows up after the first call to an exhausted (consumed) RS:

Caused by: [ibm][db2][jcc][10120][10898] Invalid
 operation: result set is closed.
        at [diffkit-app.jar:na]
        at [diffkit-app.jar:na]
        at [diffkit-app.jar:na]
        at org.diffkit.diff.sns.DKDBSource.getNextRow( [diff

According to this, there is a new db2 jdbc driver that cures this behavior, and as a bonus doesn't require a license file!

db2 9.5

This is part of the 9.5 release, so I'm not sure it will be compatible with lower versions, but seems worth a try.

DiffKit now builds and executes under Java 1.5

The 0.5.3 release builds and passes all TCs under JDK/JRE 1.5 on Windoze.

Thursday, August 19, 2010

0.5.1 released

There is now a source and a bin distribution. The standalone executable application successfully executes the TestCases on Windoze (at least under JVM 1.6).

Monday, August 16, 2010

Standalone application now runs the TestCases

Previously, TestCases could only be run through the DiffKit project. You needed to download the source code, have groovy installed, and then figure out how to invoke the TestCaseRunner.

Now, the standalone executable diffkit-app.jar carries the whole TestCase suite with it. All TestCases can be executed in place simply:

java -jar diffkit-app.jar -test

So now it's very easy for us to collect end-user supplied validation of the kit from different environments.

Saturday, August 14, 2010

JarClassLoader -- it's not Groovy related

Further experimentation turned up that the problem does not stem from an interaction between JarClassLoader and Groovy. The same effects can manifest with JarClassLoader even if the target class is Java.

Simply put, Class.getPackage() does not always work with non-default ClassLoaders. But if you only need the package name, ClassUtils.getPackageName() does. Here's a link on the problem:

JarClassLoader & Groovy -- side effects

It appears that when you ask JarClassLoader to load compiled Groovy classes (compiled with Ant groovyc), the resulting classes do not behave the same way as regular java classes. In particular, the Groovy classes don't know what package they belong to: 
 println MyClass.class.getPackage() 
 println MyClass.class.getPackage() 

The Java class gives the correct answer whether it is loaded by ClasspathClassLoader or by JarClassLoader. However, the Groovy class only works when called by ClasspathClassLoader, not by JarClassLoader.

Wednesday, August 11, 2010

application now processes command line using Apache commons CLI

Entry point to application is now conf/DKApplication:

   private static final String VERSION_OPTION_KEY = "version";  
   private static final String HELP_OPTION_KEY = "help";  
   private static final String TEST_OPTION_KEY = "test";  
   private static final String PLAN_FILE_OPTION_KEY = "planfile";  
   private static final Options OPTIONS = new Options();  
   static {  
    OPTIONS.addOption(new Option(VERSION_OPTION_KEY,  
      "print the version information and exit"));  
    OPTIONS.addOption(new Option(HELP_OPTION_KEY, "print this message"));  
    OPTIONS.addOption(new Option(TEST_OPTION_KEY, "run embedded TestCase suite"));  
   public static void main(String[] args_) {  
    LOG.debug("args_->{}", Arrays.toString(args_));  
    try {  
      CommandLineParser parser = new PosixParser();  
      CommandLine line = parser.parse(OPTIONS, args_);  
      if (line.hasOption(VERSION_OPTION_KEY))  
      else if (line.hasOption(HELP_OPTION_KEY))  
      else if (line.hasOption(TEST_OPTION_KEY))  

commons CLI is very easy to use.

Tuesday, August 10, 2010

controlling logback configuration in standalone app

Application end-users must have the ability to change the logback logging level applied to all of the DK code, without having to explode the application jar.

Logback looks for a config file named logback.xml on the classpath. So first we copy a new logback.xml file into cwd and try setting the classpath on the command line:

java -cp . -jar diffkit-app.jar

No joy.

But this works: 

java -Dlogback.configurationFile=./logback.xml -jar diffkit-app.jar
Simple enough.

Also, logback can find the logback.xml file embedded in the application jar if the System property is defined this way:

java -Dlogback.configurationFile=conf/logback.xml -jar diffkit-app.jar

In this case, logback treats the property value as a classpath resource specification.

Saturday, August 7, 2010

First download released on google code

0.5.0 is now available:

Download the zip and unzip. diffkit-app.jar is a completely self-contained executable jar. In order to run it, you only need have java 1.6 installed on your system (might even work with java 1.5, but not yet tested). Here's an example invocation:

java -jar diffkit-app.jar example.plan.xml

=== example.plan.xml ===

<?xml version="1.0" encoding="UTF-8"?>

<beans xmlns=""

<bean id="plan" class="org.diffkit.diff.conf.DKMagicPlan">
<property name="lhsFilePath" value="./example.lhs.csv" />
<property name="rhsFilePath" value="./example.rhs.csv" />
<property name="keyColumnNames">
<property name="sinkFilePath" value="./example.sink.diff" />


Hat's off to JDotSoft

Integrated JarClassLoader from JDotSoft. Only took me about an half hour to figure the whole thing out. For my tastes, much simpler and more embeddable than One-JAR.

Now the dist ant target builds a standalone, executable jar-- diffkit-app.jar. For embedded applications, users simply explode the diffkitapp.jar, the diffkit library jar is in the root, and all dependent jars in lib/.

One-JAR 0.97

One-JAR works exactly as advertised. Unfortunately, it's a bit of a pain to work with. The fundamental problem is that assumes a particular world view/structure. For instance, the jar that is produced from the target project source is named 'main.jar' in the output, and stored in main/main.jar. That's not ideal for DiffKit distribution purposes, since I really need the DiffKit jar to carry version information in its name. Documentation for One-JAR is sparse. I can't find any reference in the documentation regarding how I can change the name of the main.jar.

Verdict; One-JAR is a nifty product that just works. But it's not very malleable, which can be a problem if your desired output does not have exactly the structure envisioned by the One-JAR author.

I'm going to look for a simpler alternative that is easier to bend to my purposes.

Here's one place to start

Thursday, August 5, 2010

packaging DiffKit for distribution

Right now, DiffKit is source code only. We need some way of packaging a binary distribution to satisfy two audiences:

  1. Users who run DiffKit as a standalone application.
  2. Users who access DiffKit programmatically and need to embed a DiffKit jar in their own application.
Remarkably, there are no standard Java mechanisms for this in Java 1.6. JSR 277 seems to have gone comatose. Apparently Java 1.7 introduces some kind of modularized packaging implementation: java module system in java 7; but 1.7 is not in sight.

After reading about OSGi for about an hour, I fatigued and decided on something simple and immediate for the short run. One-JAR™ uses a custom class loader to allow references to jars that are embedded in other jars. So the plan is to create an executable binary (jar) using One-JAR for the application users, and embedded developers can unjar the One-JAR to access the diffkit.jar and any of its dependent jars that they might require.