2. On Defense-in-Depth

Having spent all my adult life thinking about all the things that can go wrong in a nuclear reactor, I find it a bit amusing when someone says, “Sure, nuclear is fine as long as everything goes as planned, BUT WHAT IF SOMETHING GOES WRONG??”

I think it is good when people challenge a design, but the idea that everything is supposed to go according to plan is so misguided that it isn’t even wrong.

Things are expected to fail. Pumps trip, pipes leak, power may go off. The real challenge is how to prepare for the unknown.

You start by naming what can go wrong.
Power goes up. Pressure drops. Coolant leaves. Systems disappear.

Then, you figure out which events would cause each of them.
Unwanted control rod withdrawal. Sudden increase in secondary flow. And so on.

Then you apply defense-in-depth.

When I was lecturing on nuclear safety, I used a simple “99% rule” to explain how the whole structure fits together.

An anticipated operational occurrence (AOO) is a disturbance that the plant is explicitly designed to handle without activating safety systems. These are events expected to occur over the plant’s lifetime, generally around once a year. Examples include loss of offsite power, pump trips, or turbine trips.

Prevention works about 99% of the time. Thus:

Approximately 1% of AOOs escalate into accidents (about once every 100 years).
Of those accidents, roughly 1% escalate into severe accidents (about once every 10,000 years).

Each step represents a failure of the previous layer, but the key is how those layers are constructed:

AOOs are managed through design and basic control.
Accidents are addressed by safety systems, which include diverse backups to prevent common-cause failures.
Severe accidents are managed through Severe Accident Management (SAM).

That middle step is where diversity plays a crucial role. Failures are not always random; shared sensors, shared hardware, and shared assumptions can lead to shared failures.

Diversity helps break that chain by introducing different principles, different hardware, and different dependencies.

By the time you reach severe accident management, the mindset has already shifted.

Assume systems have failed for a reason.
Assume information is incomplete or incorrect.
Focus on simple, robust actions that still work.

The structure is not about adding more layers; three layers are sufficient. It’s about ensuring that each layer can withstand the failure modes of the previous one. This approach is what keeps the 99% rule meaningful.

Otherwise, it’s merely the same logic failing in the same way—three times in a row.

***

Historically, the nuclear industry’s formal safety analysis has focused on design basis accidents (DBAs)—the carefully chosen scenarios for which a plant’s response is explicitly demonstrated in advance. However, when we examine the accidents that have shaped public perception of nuclear risk, a different pattern becomes apparent.

None of the significant nuclear accidents started as textbook design basis accidents that unfolded neatly within their expected parameters. Instead, they originated from anticipated operational occurrences, equipment malfunctions, operational transients, loss of support functions, maintenance vulnerabilities, or external disturbances that initially fell into the realm of ordinary plant behavior. What transformed these situations into severe events was their escalation.

The critical takeaway is that the levels of defense-in-depth, which are often treated analytically as separate, do not remain truly independent when under stress. Shared dependencies, common-cause failures, environmental impacts, operator workload, instrumentation degradation, loss of power, flooding, overheating, and organizational pressures create connections between protection levels that traditional safety models typically keep distinct. Once these connections emerge, barriers can fail simultaneously rather than sequentially.

There are also implications for how we frame probability in safety analyses. Traditional safety evaluations devote considerable attention to design basis accidents as representative challenges. However, real operating history indicates that classic DBAs may be even less indicative of how accidents are initiated than previously thought. The industry has witnessed fewer events resembling the clean scenarios of DBAs and more situations where familiar disturbances combine, leading to unexpected escalation chains.

This suggests a shift in priority for future reactor safety efforts. System-level resilience should receive at least as much focus as component-level qualifications. The central question should increasingly be: what mechanisms prevent an otherwise manageable operational upset from leading to additional failures?

A particular weakness arises when both accident prevention and accident management rely on the same electricity distribution systems. Incorporating passive safety measures or engine-driven pumps in accident management can enhance practical resilience.

Preventing escalation entails identifying hidden connections among electrical systems, cooling functions, instrumentation, operator actions, maintenance states, digital controls, and site infrastructure. It requires examining where redundancy has a common vulnerability, even if it appears diverse on paper. It also means understanding how a single degraded function creates increased demand, confusion, or environmental stress elsewhere in the plant.

The most challenging safety problem is preventing an event from gaining allies.

Experience from severe accidents consistently teaches the same lesson: risk increases through the connections between systems, not merely through the failure of individual systems viewed in isolation.

***

Some reactor designs have a subtle vulnerability that exists between what is classified as a disturbance and what is recognized as a loss of power. A sustained grid undervoltage—ranging from 80% to 90% of the nominal voltage—does not automatically trigger a transfer to emergency diesel generators. The grid is still technically “there,” meaning the plant remains connected to it. However, it is precisely in this gray area that the plant can begin to break down.

The issue arises from gradual thermal stress. When voltage levels are depressed, electric motors throughout the plant operate less efficiently. To maintain torque, these motors draw higher currents, which increases internal losses and raises winding temperatures. No single component fails immediately; instead, components gradually heat up, quietly and continuously, as long as the low voltage condition persists.

Over time, motors may trip on thermal protection, but not all at the same time. Each motor’s response depends on its individual margins, loading, and cooling conditions. One function is lost, then another, and so on. Each system behaves as designed in isolation, responding to its own local conditions. However, the initial cause—grid undervoltage—remains constant and does not disappear.

This issue is fundamentally architectural. Multiple layers of defense in depth depend on the same external input: grid voltage. Since the undervoltage does not meet the criteria for a loss of offsite power, the plant does not switch to its emergency supply. Instead, it remains connected to a degraded grid long enough for thermal stress to accumulate across several systems.

Accident management strategies assume that either the grid is fully operational or that the plant has already transitioned to its own independent power sources. In this intermediate state, neither assumption is valid. The grid is present but not usable, and the emergency supply has not been activated. Consequently, there may not be reliable electrical power available for accident management functions at the critical moment when they are most needed.

In effect, the external grid is allowed to impact multiple defense levels simultaneously. This does not happen due to a single failure, but through a persistent condition that each layer can tolerate only temporarily. Independence is not lost all at once; it is gradually eroded, function by function, as heating drives equipment out of service, while no stable power source is secured to replace what is being lost.

This type of vulnerability does not require an extreme initiating event. It stems from how thresholds and transitions are defined. If the criteria for switching to emergency power are too narrow or too slow to detect a sustained degradation, the plant remains dependent on a “limping” grid that can no longer support its safety functions or accident management.

Draw as many boxes as you like. If they lose power from the same place, they go dark together.

***

We like to say we don’t trust the grid.

Yet most emergency core cooling pumps still depend on it—indirectly.

On paper, the logic is clean: multiple trains, separate buses, backup diesels. In reality, the dependency is just pushed one layer down. A long undervoltage does not trigger a clean transfer; it quietly degrades performance and can peel off motors across defense-in-depth levels without ever declaring a clear failure.

So why not put engines directly on the pumps?

It’s not one decision—it’s a system of incentives.

Electric motors are predictable. They start fast, follow commands precisely, and integrate cleanly with protection and control logic. You can test them often and under realistic conditions. That matters, because in safety, what you can test repeatedly is what you can claim as reliable.

Engine-driven pumps bring a different profile. They remove one dependency—the grid—but introduce others: fuel, cooling, starting systems, maintenance practices. None of these are insurmountable, but they are harder to model cleanly and harder to exercise frequently without disturbance.

And this is where it gets interesting.

Our probabilistic models tend to reward architectures that can be decomposed into well-behaved components. Electrical systems fit that mindset. Cross-boundary effects—like a sustained 0.85 pu grid that sequentially weakens multiple safety layers—are real, but they are awkward to represent. So they end up underweighted.

The result is not a conspiracy of procurement bureaucracy or PRA. It’s alignment. We choose what is easy to specify, easy to test, and easy to defend—even if it quietly reconnects the layers we claim are independent.

An engine on the pump shaft would break some of those hidden couplings.

But until we make those couplings visible in the safety case—and give them proper weight—the system will keep drifting toward electrically “independent” designs that share the same vulnerability, just one layer deeper.

Independence is not about how many boxes you draw.

It’s about what actually fails together.

Contents

2. On Defense-in-Depth