FAULT TOLERANT POWER SYSTEMS
by: Paul O'Boyle, Senior Design Engineer, Kepco, Inc.
Recognizing the Need.
The growing dependence of our society upon electronic data processing has
created a need for continuously operational systems. The original applications
for this level of performance (financial institutions, air and rail transportation,
telephone systems, etc.) have been joined by requirements from almost every area of
business. Local area networks, process controls, daily business transaction
recording and many other systems have the potential to suffer substantial loss
if system operation is disrupted for any reason. A burn-in process for microcomputer
ICs which fails could cost many thousands in lost revenue. Down time on a computer
which provides credit card authorizations could translate to a substantial loss in sales.
Fault Tolerance as a Solution.
The solution for these and the many other applications that need dependable operation
is to design fault tolerance into as many portions of the system as possible. Here
we will discuss the implementation of fault tolerance in the power conversion portion
of such systems. Fault tolerance has existed in software and in digital hardware for
a number of years, but there has been a growing realization over the past decade that
the reliability of any electronic system is dictated first by the reliability of the
power conversion system. All modern electronic systems require some form of power
conversion to convert the source power delivered by the a-c mains to isolated, regulated
and conditioned low-voltage d-c power. Any interruption of this power conversion
process can result in immediate and unexpected loss of system operation. The need
to minimize this potential failure mode has resulted in the development of fault-tolerant power systems.
A fault-tolerant power system is one comprised of multiple power converters configured
so as to maintain the integrity of the output power bus in the event of any one
single-point failure within the power conversion system. Fault-tolerant power
systems can be designed in a variety of configurations, but all share the following
A review of these traits is provided along with some insight as to the tradeoffs
which system designers must weigh in matching the appropriate power conversion system to the application.
- Sufficient capacity to sustain power bus operation in the event of any single-point power system fault (redundancy);
- Ability to isolate and localize a failure to a single replaceable module (fault isolation and detection);
- Design which permits extraction of the faulty module and insertion of a replacement without interruption of the power bus (on-line replacement, or "hot-swap").
Much attention has been given by power supply designers in recent years to redundancy issues.
The simplest implementation of redundancy is to operate multiple power converters in output-parallel
configuration with sufficient load current capacity (ampacity) to support the load in the event
of the loss of one or more converter modules (see Figure 1). The power converters must provide a constant-current or "rectangular" type of overload characteristic, since load current delivery is based on the effective output voltage setting of each power converter in descending priority and thus only one power converter is actually operating in voltage-stabilized mode. The chief disadvantages of this technique are unequal operating stress which leads to increased power
converter failure rates, and degraded transient recovery in the event of a power module failure.
Nearly all new power supply designs incorporate special circuitry to provide forced load sharing,
either passive or active (see Figure 2), and current-stabilized overload protection to
optimize their use in parallel-redundant power system applications. Load (or current)
sharing permits an even distribution of the system load current among multiple
output-paralleled power converters. This results in lower operating temperatures
and reduced failure rates as well as improved response time. The cost and complexity
of the actual circuitry is low; some manufacturers now provide dedicated control circuits
which allow implementation of load sharing with existing power converters using external
circuitry. An example of this is the Kepco FCS board assemblies, which will add forced
load sharing to any power supply equipped with remote error sensing.
Source Power Loss.
Decisions regarding redundancy are not limited to power conversion issues. True fault-tolerant
power systems should address the possible loss of source power as well as loss of power conversion.
Indeed, many fault-tolerant power systems require separately generated and protected power
sources for each of the multiple power converters used to generate the d-c power bus.
Others use either on-line or off-line uninterruptible power sources (UPS) with battery- or
generator-backup in the event of primary power loss. Still others, most notably
telecommunication (telcom) systems, use a distributed power architecture consisting of a
combination of all of the above applied to both source and load circuits.
Life Cycle Costs.
The burden of these additional protective functions (battery chargers, maintenance, wiring,
etc.) adds significant life-cycle cost which the system designer must consider against the
actual performance required when determining which protection to specify. For instance,
use of on-line UPS for source power redundancy involves inrush start-up current of the
power converters, while specification of off-line UPS requires specifying the correct
relationship between output ride-through time and UPS transfer time to preserve power
bus integrity. Batteries create their own overhead burdens in the form of maintenance,
charging requirements and environmental considerations.
"Hot Swap" Considerations.
The recent proliferation of
"hot-swap" power systems indicates a growing need for continuous
system operational readiness. This requirement embodies several related functions involving
human engineering as well as power bus performance. Issues such as module form factor and
weight, connector insertion/extraction force and module retention mechanisms are typical
intangibles which enter into the equation. The use of self-aligning (blind-mate) connectors
with integral or separate mechanical keying mechanisms to prevent insertion of an incorrect
replacement module is an additional factor. The ultimate goal is to enable transparent
replacement of a faulty module, that is, with minimal disturbance to the power bus. The
most common standard is to limit the bus transients induced by the replacement actions to
the amplitude and recovery times normally associated with step-load response, although the
specific system requirements will dictate to the system engineer the allowable disturbance levels.
The most critical functions of any fault tolerant power system are fault detection and fault
isolation. Fault detection is the ability to accurately and consistently identify and localize
a failure to a specific replaceable module, while fault isolation, as the words imply, isolates
the system from any adverse effects of the failure. These functions are the basic elements of
any fault-tolerant power system. Properly designed, they will maintain power bus integrity;
without them, no amount of redundancy or replaceability will salvage system operation.
The fault detector function involves a complex interrelationship between the power converter
modules and the power system itself. By nature, a properly designed fault-tolerant power
system will endure the failure of one or more elements with no apparent effect on the power bus,
yet the power system must be capable of detecting and localizing this failure to a single
replaceable subassembly without the benefit of direct observation of the output. This requires
simultaneous measurement of multiple parameters, both internal and external, and interpretation
of their combined values to determine if the power converter is operating properly. The
following examples are offered to illustrate the difficulties involved.
Consider the method for detecting an output-low failure. For a simple, non-redundant power
system employing a single power converter, the fault detector need only monitor output voltage
(or current, in current-stabilized applications) to determine if the output is operating within
specification. Any power bus fault must be the result of failure of the one and only power
converter in the system. In the case of the simplest redundant power system, that of two
output-paralleled power converters, the task becomes much more complex. Assuming that N+1
redundancy is provided, if the output of one of the two power supplies fails low, the
second power supply will continue to support the load, hence no output fault is present.
One of the power supplies has failed, however, and the fault detector design must be
capable of determining which of the two power converters is defective so that the power
system can be serviced. The problem intensifies when three or more power converters
comprise the power system.
Several methods can be implemented to address this problem, each with its own disadvantages.
The most direct method is to insert a diode in series with each output between the power
converter output and the power bus itself, and to monitor the output of the power converter
itself: in the event of a low-output fault, the diode blocks the power bus voltage from
forcing the power converter output high, and the fault detector of the defective converter
can now measure and report the output failure.
There are problems with this approach, however. The series diode introduces a significant
power dissipation penalty, since all of the load current drawn by the power bus flows
through these diodes. They are therefore normally quite large and expensive and in
most applications require some amount of heat sinking. These diodes are essential
for on-line replacement applications. If the system does not demand on-line replacement
of power modules, only redundancy and fault indication, then this is a tremendous
efficiency burden to carry. Furthermore, if the diode failure mode is to be shorted,
as is most common in these applications, the fault detector now monitors the power
bus directly and can no longer detect low-output failures. Additional circuitry can
be employed to monitor the voltage drop across the diode in order to detect a shorted
device. However, the circuitry must be capable of distinguishing voltage drops of the
same order of magnitude as the output ripple voltage, and in applications where
power bus load current varies significantly this technique can be very inconsistent.
Similar problems exist for output-high (overvoltage) failures of the power converters.
Consider again the case of two power converters operating in N+1 redundancy, now with
output blocking diodes installed. If one converter fails output-high, the second
converter senses an overvoltage condition and stops delivering output power thereby
avoiding the pitfall of having all of the output-paralleled power converter follow
the defective module into overvoltage (often termed "selective overvoltage"). The
problem is that both power converters now show an output fault. If the output-high
fault generates a power converter shutdown, the second converter will recover and
the fault signal will be valid. If overvoltage shutdown is not achieved, however,
either by design omission or failure characteristic, the system operator will not
be able to determine which power converter has failed.
A better way is for the fault detector to monitor both the power bus voltage and
the current delivered by each power converter, and to determine whether or not each
power converter is operating properly based on logical analysis of these two readings.
Fault indications are then only issued for conditions which indicate abnormal module
operating conditions, significantly improving the accuracy of the detector circuit
and negating the need for the blocking diode except in "hot-swap" applications.
This method is not entirely foolproof, since it cannot detect shorted blocking
diodes nor does it eliminate their need in on-line replaceable power systems; yet
it is the most complete and accurate method presently available to determine operating
status of output-parallel power converters while on-line, and represents only a modest
increase in circuit complexity.
The issues affecting fault-tolerant power system design and selection are often a result
of the basic performance required by the power bus. An example of this is power bus
overload/short circuit protection. In conventional single-converter power systems,
the maximum overload current delivered to the power bus in the event of a load failure
is determined by the power rating of the converter and/or adjustment of the maximum
current limit value. The use of high-redundancy power systems (N+2, N+3, etc.) creates
special handling problems, especially in telecommunications applications where the power
converter must operate in both voltage- and current-stabilized output regulation modes.
The concept of excess capacity becomes a dangerous problem if the power bus is shorted and
all of the power converters now deliver their maximum output current through the system's
load wiring. Significant thermal damage and even insulation fires are possible in this
event unless the system engineer recognizes the danger.
Solutions include distributed load protection devices (fuses, circuit breakers, thermistors,
etc.) and sizing of load wiring based on maximum possible current delivery of the power system.
Many power converter designs include either fixed or optional timeout circuits as part of the
overcurrent protection circuitry which shuts down the power converter after a time period of
10-30 seconds on the assumption that long-term overloads represent major load problems and
that system has already been compromised. This is not a viable option for power converters
supporting battery-based power buses such as are used in many telecommunication applications,
as long-term current-stabilized operation is a normal operating condition.
Modern power converter designs incorporating many of the features discussed above are available
from several manufacturers, among them HC Power (HC1010 Series), Lambda/Qualidyne (MPS Series)
and Kepco, Inc. (HSP Series). All represent products specifically designed for fault-tolerant
power systems used in the international marketplace. They include such features as wide-range
(universal) input with power factor correction, internally-mounted output isolation diodes,
forced load sharing circuitry, blind-mate connectors and fault detector circuitry with both
visual and electrical indicators. The Kepco HSP Series logical fault detector with selective
overvoltage shutdown provides accurate fault detection and fault isolation both with and without
the optional isolation diode. The current limiting circuit includes a switch-selectable 20-second
lockout timer to provide the user the option of continuous current-stabilized operation for (as
for battery charger applications) or delayed shutdown for load/wiring protection if required.
Another switch-selectable function activates a "current walk-in" circuit to provide slow output
current rise rates, a requirement of Bellcore-type telcom battery rectifiers.
Other features of the Kepco HSP Series power converters include remote analog programming of both
output voltage and current limit regulation levels, Bellcore-style signal outputs (isolated Form-C
relay contacts) and an isolated remote inhibit circuit which includes TTL-level inputs for both
positive and negative logic. The inhibit circuit operates from a separate 5V supply which is
available to the user for loads up to 100 mA. Kepco rack adapters are available for both plug-in
and fixed applications. The Kepco also makes a version of these power converters called HSM.
The HSM power converters incorporate many of the features of the
HSP except that they are designed
for fixed installation rather than pluggable applications The elimination of visual indicators,
extraction handle and retention mechanisms results in a size reduction, specifically in module
length. The modules are about 15% shorter.
As the world's dependence on electronic data and control increases, fault-tolerant power system
applications will continue to expand both in size and performance requirements. Just as with
distributed power architectures, the change will be evolutionary rather than revolutionary.
Future improvements should include greater intelligence in the area of fault isolation, with
feedback from all areas applied to a central controller which will have authority to reconfigure
the overall system "on the fly". Multiple redundant power converters will be held in stand-by mode, to be brought on-line or off-line as power bus load conditions warrant. Bus voltage should be
controllable to optimize the operation of the system based on external influences (temperature,
source power conditions, etc.).
Electrical Engineer specializing in new product development
Polytechnic Institute of Brooklyn, BSEE program
Design experience including both military and industrial power converters
Presently Engineering Group Leader for Switchmode Power Supply Development
<! ********************* START CURRENTS TOUR ********************* -->
<! ********************* END CURRENTS TOUR ********************* -->