Events Training Consulting Newsletters Webcasts Blogs
Subscriptions
Current Issue
Past Issues
Join Our Mailing List
Contact Us
Home
 
 
 

 


TechEncyclopedia

Failover Beyond Five Nines

Performance Technology's PICMG 2.16 'Redundant Host'-compliant platform offers fast hardware failover at (ultimately) lower cost.

By John Jainschigg

print this article print this article
email this article e-mail this article
.

CallCopy and SER Sign Partnership Agreement
Workforce Management From Forecasting To Optimization
Miami Advice: The Best of Call Center Demo
Agent Training Beyond the Classroom
Measuring The Things That Matter
Q and A: The Importance of Testing Your Technology
New Methods to Measure Performance
Head-mounted Interpersonal Communications Gear
HardMetrics Launches Marketing Performance Manager
Report: PM Small, But Growing Fast
.

04/04/2003, 4:07 PM ET

Software-based failover is applicable in many high availability contexts. When it's construed as 'failover between threads or processes within a single system,' as 'failover and load-sharing among a cluster of servers,' or as 'failover between processing nodes in an opportunistic peer-to-peer network,' (among other constructions) it can be very cost-effective: you get usable parallelism and useful redundancy from the same hardware.

True, software failover takes time - sometimes seconds (or tens of seconds) to detect and spin up a new process; longer to re-establish peripheral communications and restore full application availability. But a whole class of distributed network applications can tolerate this. In many cases, even big parts of 'realtime' apps can be broken down into fail-overable components. In a distributed IP PBX, for example, the softswitches - essentially database apps - are often engineered to fail over to one another, without affecting the gateways doing transcoding and call-termination.

Software failover also may impose stringent engineering requirements at the OS, protocol, application, and hardware levels, as well as special monitoring and management skills. It often makes sense to integrate hardware specifically for use in software-failover scenarios - for example, incorporating lots of RAM for precautionary buffering. In a software failover situation, you need a point person who understands the OS and the application, as well as the hardware. This militates against simple-looking, easy-to-install-and-babysit, one-box solutions. But in some development/deployment scenarios, that's not a problem.

Problems with software failover emerge when applications are less tolerant of latency and more closely meshed with hardware and connectivity. VRU apps, for example, are often most cost-effectively built by sticking a CPU card in a box with resource boards, and cabling the latter to a demarc. But using high-level software failover in this situation can mean duplicating the VRU and the connections, as well as doing funny things with the provisioning. In a failure, you'll almost certainly lose some calls or sessions in progress, you may lose incoming data or corrupt your database - and even in a best-case, the performance hiccup will be readily perceptible to users. The same thing holds for many mission-critical applications, esoteric (military, aviation, medical) to commonplace (any app that fields a non-RTX-able 'firehose' of data - e.g., billing records from legacy network elements).

In these scenarios, you need hardware failover. Ideally, you need really fast, clever hardware failover between redundant (relatively inexpensive) SBCs in a single box, with clean reconnection to surviving peripheral components. In the VRU example, above, the resource boards are intrinsically redundant, cost a lot of money, and may be plugged into the wall - so duplicating them in a second box and keeping them on standby makes little sense.

If your app looks like this, you need to check out Performance Technologies' (Rochester, NY - 585-256-0200, www.pt.com) Redundant Host line of software and hardware. Based on the PICMG 2.12 Redundant Host spec, the line presently includes the ZT5524e SBC - a high-performance computer-on-a-board; the ZT4901e mezzanine bridge card; and the ZT5085e Redundant Host chassis: a split-backplane (2.13 CompactPCI) chassis with dual H.110/2.16 bus-segments (full 2.16 support will arrive shortly). Each segment hosts six general-purpose slots, plus additional slots for the SBCs and their mated bridges (two SBCs, two bridges per chassis), a switch and a 2.16 node board. The chassis also accomodates dual (under/over) CMMs at the left side of the card cage.

The SBCs ride side-by-side in the middle of the chassis, interleaved with the bridge cards. Your application - written with the help of PT's development kit for the PICMG 2.12 Redundant Host API and 2.9 IPMI monitoring/messaging - runs on one board as 'master' and on the other, either on active standby or monitoring the master by checkpointing. The master app typically watches all the peripheral boards, though you can also engineer a system with dual masters, each controlling one bus segment normally, and prepared to pick up peripherals in the other, in the event of failure. In failure, the standby SBC takes over; the bridge cards negotiate access to the backplane and peripherals, and you're running again, in 10ms. The failover process typically insures that peripherals are mapped into the new SBC's memory exactly as they were in the former master's, so resource boards don't even know they've undergone a brain-swap. The boards are all hot-swappable too, of course, so it's easy to fix the hardware.

The IPMI management standard, meanwhile, supports proactive failover by making each application aware of operating conditions. If the standby app perceives, for example, that the temperature of the master SBC is rising and that failure is imminent, it can take over gracefully. The result is just what the doctor ordered: you get trans-five-nines availability without excessive hardware redundancy. More RH-compliant products are in PT's pipeline.


Great Fault-Rez Site

If you're into fault-rez computing, core components, resilient operating systems, board-level IP telephony, OEM componentry, or other topics relating to the design and construction of survivable telecom applications and products, check out www.Zipster.com, newest brainchild of Richard Grigonis, author of the definitive guide to Fault-Resilient Computing (CMP Books).


.

Free CallCenter Insider Newsletter

Your Email Address


Optional Areas of Interest
International News
Advice/Tips
Technology
Agent Development
IVR