From “Good Enough” to Robust, Fault-Tolerant Enterprise Architecture

Frequently, when a system is stood up and used, the initial configuration ends up as the production configuration. Since initial testing and usage was successful, there’s no budget for further investigation into whether it is the best configuration for the long term. As a result, the system may not be ideal for future needs. System usage may continue to expand or even change in function (like moving from internal to customer-facing) and the system isn’t scalable. How then do you move from a limited system to something that can evolve and expand as business needs change?

As with just about any question in the IT realm, the answer is “it depends.” What is the tolerance for downtime? How will the system be used? Is it an internal system or an external, customer-facing system? What about disaster recovery? How much do you want to spend?

Scalability is the key to any robust IT system. Moving from a simple, stand-alone system to scalable distributed system can be costly (and painful) if done in a single grand step. However, breaking the move into many smaller steps will not only limit the impact to the current user base, but also allow for a more palatable financial cost.

One of the keys to a fault tolerant system is removing single points of failure. Clustering is one of the simplest ways to minimize failure points. Workload is split between two or more processes so if one process fails, the remaining process(es) handles the entire workload. Separating out the VMs that host similar processes onto different pieces of hardware eliminates the hardware as a failure point. Different degrees of separation can be used for development, testing and production. It would be prohibitively expensive to have all systems mirror a production configuration; however, a smaller, similarly configured system is still wise.

To provide the best quality of service for production, the following recommendations can be made:

Add an additional WESB cell to allow for side-by-side upgrades and immediate back off in case of issues with an upgrade.
Add IBM HTTP Servers (IHS) with the WebSphere plug-in in front of the WESB JVMs to allow scaling of WESB without updating the calling systems.
Cluster both the MQ and DB2 subsystems to provide fault tolerance for those components as well.

Each of these activities can be done independently and at different times. Determining the priority of each activity can drive when each part can be done.

In this example, we start with the following :

Test environment as a single WebSphere Enterprise Service Bus (WESB) JVM, connecting to a single MQ server, backed by a single DB2 database.
Production as a two-node, multi-cluster WESB, connecting to a single MQ server, backed by a single DB2 database.
Although on virtual machines, everything runs on the same host.

For the first step, a second production WESB cell was added. JVM clusters of three members were created, with a cluster member of each cluster residing within a given VM. Each VM resides on a different host. This provides striping of clusters across a set of VMs and hosts and allows for testing of the new clustered architecture without impact on the currently running system. It may seem counter-intuitive to start with changes to production, but this allows for an accelerated roll-out of applications running on the middleware layer.

The second step shores up the back-end systems. The MQ system is clustered, adding an offline, passive instance on a separate VM and host. DB2 is clustered in a similar manner.

The third step turns our focus to the test system. A smaller WESB cell with a set of clusters is configured, with only two cluster members per cluster, spread across two VMs. All of this may reside on the same host since uptime requirements for the test system are not as stringent as production. This configuration still allows for testing operational procedures and functions that rely on clustering, but does not consume more than the minimum amount of resources.

The fourth step returns focus back to the production system, building out another WESB cell as was done in the first step. Once complete, the original two-node production cell can be phased out and decommissioned.

The fifth and final step creates a second test cell of similar proportions to the one created in step three. This gives a complete test system that replicates the production system, only on a smaller scale.

With all steps completed, there is now a production system that has mitigated failure at the following points :

Hardware - by using multiple hosts
VM - by spreading the multiple VMs over multiple hosts
JVM/application - by striping clusters across VMs
Database - by creating a high availability, disaster recovery cluster
Messaging - by creating a cluster with a passive node

Each of the steps in the process can be done in any order and at any time. In any case, the end game is the same -- a robust environment that is able to tolerate most interruptions or failures with little to no impact on the end user.

Custom Development

From “Good Enough” to Robust, Fault-Tolerant Enterprise Architecture

Spotlight on Branch Banking: Experimenting to Find Your Future Branch

3 Simple but Powerful Rules for Identifying Your Applications' Untrusted Data