Thursday, November 4, 2010

Big Hairy Batch (BHB)

Periodic batch processing is a common activity in Big Enterprise. "Batch", as it is usually referred to, is now recognized as a distinct enterprise specialty deserving its own tools, techniques and designs. In fact, Spring has introduced within the last few years a framework geared specifically for the unique characteristics of batch: Spring Batch.
The most common characteristics of batch are:
  • They’re periodic. Daily or monthly periods are the most common, because these periods correspond to important business cycles, such as the daily portfolio and instrument repricing that many financial firms live by, or the monthly accounting and balance sheet reconciliation that most regulated companies are required to provide for the public equity markets.
  • They are "automatic" (or at least they are supposed to be). That means they run on a timer (according to their period) rather than in response to user generated events.
  • They typically import several different sources of data, transform those inputs according to a number of pipelined steps, and produce one or more outputs that are usually stored (at least temporarily) in an RDBMS.
  • The outputs can be consumed by the same system that generated them, by external downstream systems, or both.
Batches come in many different shapes and sizes. I believe that the most challenging kind of batch, from an operational and test perspective, is what I call the Big Hairy Batch (BHB). The BHB layers the following traits on top of the standard batch:
  • The inputs come from tens, hundreds, or thousand of different sources, in many different formats.
  • The inputs are heterogeneous; many different kinds of data are sourced, even if one of those kinds is principal.
  • The inputs do not arrive in the desired format. Extensive "reformatting" is necessary to put the inputs into a shape that is amenable to further processing.
  • The inputs are not all expressed according to a common world view. For instance, the same underlying reference value or entity may be referred to differently in different sources. So an extensive "normalization" of the data from different sources is necessary before the different sources can be combined in a single view.
  • The data volumes are large. The principal entity may have tens of millions of instances (rows) and hundreds of attributes (columns).
  • The data is very messy. There are lots of breaks and inconsistencies and duplications, etc.
  • The processing necessary to tranform the inputs to outputs is, conceptually, hugely complicated and complex. Some data must travel through multiple pipelines, and each pipeline can have multiple steps. Some pipelines have dependencies on other pipelines. Some pipelines have dependencies on external callouts (to outside systems). The specific logical functions within a pipeline step might involve very sophisticated or heavy computations.
  • The processing implementations are complicated and difficult to maintain. Usually, the logic and procedures needed to carry out the batch are implemented in a rich melange of diverse technologies. There must be some type of high-level choreography system (programming in the large) to script or drive the entire process; a prominent example would be CA AutoSys. Business logic is spread across the entire application stack: some is in stored procedures and views in a database, some is in middleware in general purpose programming languages (java, c++, python), and some of it is in front-end oriented modules such as MS .NET. Invariably, at the center of the entire batch is a very complex, very overworked Very Large Database (VLDB).
  • The BHB system has very few, if any, automated systematic regression tests in place.
  • BHB has a daily period. Fresh BHB outputs are needed every day in order to carry out business as usual. Each day, the business needs the outputs as close as possible to the beginning of the workday, in order to meet all of their daily business obligations. BHB starts processing as early in the morning as possible, as soon as inputs become available, and struggles to deliver outputs by the beginning of the business day. Any delays due to the time required to process massive volumes, or due to delays from upstream systems, or due to operational failures, will ripple through to the business for the entire day.
  • The BHB team is under pressure to deliver continuous, significant, changes to how the BHB works. These pressures originate from many different sources: changes to business strategy, changes to the systems environment, changes to upstream or downstream systems, etc.
In this environment, the BHB team really has their work cut out for them in trying to maintain the operational integrity of the system while carrying out requested changes. The biggest risk is that changing something will break something that used to work. What are the most productive techniques and tools to guard against this possibility? That’s the subject for a future post.