Having spent all my adult life thinking about all the things that can go wrong in a nuclear reactor, I find it a bit amusing when someone says, “Sure, nuclear is fine as long as everything goes as planned, but what if something goes wrong?”

That question is based on a false premise. 

Things are expected to go wrong. Pumps trip, pipes leak, power may go off. The real challenge is how to prepare for the unknown.  

You start by naming what can go wrong.
Power goes up. Pressure drops. Coolant leaves. Systems disappear.

Then, you figure out which events would cause each of them.  
Unwanted control rod withdrawal. Sudden increase in secondary flow. And so on.

Then you apply defense-in-depth.  

When I was lecturing on nuclear safety, I used a simple “99% rule” to explain how the whole structure fits together.

An anticipated operational occurrence (AOO) is a disturbance that the plant is explicitly designed to handle without activating safety systems. These are events expected to occur over the plant’s lifetime, generally around once a year. Examples include loss of offsite power, pump trips, or turbine trips.

Prevention works about 99% of the time. Thus:

Approximately 1% of AOOs escalate into accidents (about once every 100 years).
Of those accidents, roughly 1% escalate into severe accidents (about once every 10,000 years).

Each step represents a failure of the previous layer, but the key is how those layers are constructed:

AOOs are managed through design and basic control.
Accidents are addressed by safety systems, which include diverse backups to prevent common-cause failures.
Severe accidents are managed through Severe Accident Management (SAM).

That middle step is where diversity plays a crucial role. Failures are not always random; shared sensors, shared hardware, and shared assumptions can lead to shared failures.

In reality, the preventive layer has been much better than 99% - there have been far fewer than 4-5 accidents per year globally - but a significant number of events that have passed it have slipped through the accident level as well. The level we spend most effort on turns out to be the weakest, mainly because it shares dependencies with the preceding level.

Diversity helps break that chain by introducing different principles, different hardware, and different dependencies.

By the time you reach severe accident management, the mindset has already shifted.

Assume systems have failed for a reason.
Assume information is incomplete or incorrect.
Focus on simple, robust actions that still work.

The structure is not about adding more layers; three layers are sufficient. It’s about ensuring that each layer can withstand the failure modes of the layer below it. This approach is what keeps the 99% rule meaningful.

Otherwise, it’s merely the same logic failing in the same way—three times in a row.

***

Historically, the nuclear industry’s formal safety analysis has focused on design basis accidents (DBAs)—the carefully chosen scenarios for which a plant’s response is explicitly demonstrated in advance. However, when we examine the accidents that have shaped public perception of nuclear risk, a different pattern becomes apparent.

None of the significant nuclear accidents began as textbook design-basis accidents that unfolded neatly within their expected parameters. Instead, they originated from anticipated operational occurrences, equipment malfunctions, operational transients, loss of support functions, maintenance vulnerabilities, or external disturbances that initially fell into the realm of ordinary plant behavior. What transformed these situations into severe events was their escalation.

The critical takeaway is that the levels of defense-in-depth, which are often treated analytically as separate, do not remain truly independent when under stress. Shared dependencies, common-cause failures, environmental impacts, operator workload, instrumentation degradation, loss of power, flooding, overheating, and organizational pressures create connections between protection levels that traditional safety models typically keep distinct. Once these connections emerge, barriers can fail simultaneously rather than sequentially.

There are also implications for how we frame probability in safety analyses. Traditional safety evaluations devote considerable attention to design basis accidents as representative challenges. However, real operating history indicates that classic DBAs may be even less indicative of how accidents are initiated than previously thought.

This suggests a shift in priority for future reactor safety efforts. System-level resilience should receive at least as much focus as component-level qualifications. The central question should increasingly be: what mechanisms prevent an otherwise manageable operational upset from leading to additional failures?

A particular weakness arises when both accident prevention and accident management rely on the same electricity distribution systems. Incorporating passive safety measures or engine-driven pumps in accident management can enhance practical resilience.

Preventing escalation entails identifying hidden connections among electrical systems, cooling functions, instrumentation, operator actions, maintenance states, digital controls, and site infrastructure. It requires examining where redundancy has a common vulnerability, even if it appears diverse on paper. It also means understanding how a single degraded function creates increased demand, confusion, or environmental stress elsewhere in the plant.

The most challenging safety problem is preventing an event from gaining allies.

Experience from severe accidents consistently teaches the same lesson: risk increases through the connections between systems, not merely through the failure of individual systems viewed in isolation.

***

 

One example of couplings between defense-in-depth levels sits in the gap between ‘grid available’ and ‘loss of offsite power'.

Some reactor designs have a subtle vulnerability that exists between what is classified as a disturbance and what is recognized as a loss of power. A sustained grid undervoltage—ranging from 80% to 90% of the nominal voltage—does not automatically trigger a transfer to emergency diesel generators. The grid is still technically “there,” meaning the plant remains connected to it. However, it is precisely in this gray area that the plant can begin to break down.

The issue arises from gradual thermal stress. When voltage levels are depressed, electric motors throughout the plant operate less efficiently. To maintain torque, these motors draw higher currents, which increases internal losses and raises winding temperatures. No single component fails immediately; instead, components gradually heat up, quietly and continuously, as long as the low voltage condition persists.

Over time, motors may trip on thermal protection, but not all at the same time. Each motor’s response depends on its individual margins, loading, and cooling conditions. One function is lost, then another, and so on. Each system behaves as designed in isolation, responding to its own local conditions. However, the initial cause—grid undervoltage—remains constant and does not disappear.

This issue is fundamentally architectural. Multiple layers of defense in depth depend on the same external input: grid voltage. Since the undervoltage does not meet the criteria for a loss of offsite power, the plant does not switch to its emergency supply. Instead, it remains connected to a degraded grid long enough for thermal stress to accumulate across several systems.

Accident management strategies assume that either the grid is fully operational or that the plant has already transitioned to its own independent power sources. In this intermediate state, neither assumption is valid. The grid is present but not usable, and the emergency supply has not been activated. Consequently, there may not be reliable electrical power available for accident management functions at the critical moment when they are most needed.

In effect, the external grid is allowed to impact multiple defense levels simultaneously. This does not happen due to a single failure, but through a persistent condition that each layer can tolerate only temporarily. Independence is not lost all at once; it is gradually eroded, function by function, as heating drives equipment out of service, while no stable power source is secured to replace what is being lost.

This type of vulnerability does not require an extreme initiating event. It stems from how thresholds and transitions are defined. If the criteria for switching to emergency power are too narrow or too slow to detect a sustained degradation, the plant remains dependent on a “limping” grid that can no longer support its safety functions or accident management.

Draw as many boxes as you like. If they lose power from the same place, they go dark together.

***

At Forsmark Nuclear Power Plant Unit 1 on 25 July 2006, the reactor was operating at full power when a fault in the 400 kV switchyard caused a sharp voltage dip with a phase disturbance. Generator protection reacted immediately: the unit disconnected from the grid and the reactor scrammed. In a fraction of a second, the plant moved from steady operation to decay heat conditions.

That transition is expected. The assumption is that loss of offsite power is met by a clean handover to emergency systems. Diesel generators start, safety buses are energized, and cooling continues with full instrumentation.

Here, the transition did not remain orderly.

Not all emergency diesel generators were available. Only two of the four started automatically; the others did not respond as intended under the disturbed electrical conditions and had to be brought in later by operator action. In principle, two diesels are sufficient in an n+2 redundant plant.

But the disturbance reached further than generation.

The uninterruptible power supply (UPS) systems reacted in an unexpected way. Certain inverters dropped out during the transient and did not reconnect properly. As a result, even the trains backed by running diesels did not fully energize all their loads. Critical instrumentation and control systems were partially lost.

At the same time, the main circulation pumps did not provide the smooth flywheel coast-down normally expected after a trip. The electrical disturbance and associated protection logic curtailed that inertia. Core flow decreased more abruptly, pushing the reactor more quickly toward natural circulation.

Within seconds of the scram, the plant occupied an uneasy state:

the chain reaction had stopped,
decay heat remained to be removed,
only partial emergency power was initially available,
less than two safety trains were fully functional in practice,
and the control room was partially blind, with key indications—such as reliable reactor water level—unavailable or uncertain.

For roughly twenty minutes, operators worked to stabilize the situation, restore the electrical configuration, and rebuild a trustworthy picture of the plant.

There was no core damage, no abnormal release of radioactivity, and no evidence of dryout. After a scram, power drops rapidly to decay heat, and the margin to critical heat flux increases. Even with degraded pump coast-down, flow inertia and rapidly decreasing power ensured adequate cooling.

The significance lies elsewhere.

On paper, the plant’s safety systems were redundant and independent. In reality, the event revealed hidden couplings—dependencies introduced by power electronics, protection logic, and the dynamics of switching between power sources. 

The UPS systems—intended to guarantee continuity—became a point where multiple safety functions were simultaneously affected. The plant did not lose its ability to cool the core; it lost, for a time, the assured and observable pathway by which that cooling is controlled and verified.

The event was classified as INES Level 2. The rating reflects the absence of damage, but the lesson runs deeper: a system can meet its design requirements on paper and still pass, briefly, into a state where fewer safety functions are truly available than assumed.

The reactor shut down as intended. Cooling was maintained. The plant recovered.

What remained was a sharper understanding that the decisive vulnerabilities often lie not in components themselves, but in the interfaces—electrical, logical, and temporal—that govern how a plant moves from normal operation to safety mode under stress.

After the incident, some insisted that the main lesson is to update the set of electrical disturbances that components must be designed against.

It is not. It is to design the systems so that no individual disturbance can degrade all of them at once. 

***

We like to say we don’t trust the grid.

Yet most emergency core cooling pumps still depend on it—indirectly.

On paper, the logic is clean: multiple trains, separate buses, backup diesels. In reality, the dependency is just pushed one layer down. A long undervoltage does not trigger a clean transfer; it quietly degrades performance and can peel off motors across defense-in-depth levels without ever declaring a clear failure.

So why not put engines directly on the pumps?

It’s not one decision—it’s a system of incentives.

Electric motors are predictable. They start fast, follow commands precisely, and integrate cleanly with protection and control logic. You can test them often and under realistic conditions. That matters, because in safety, what you can test repeatedly is what you can claim as reliable.

Engine-driven pumps bring a different profile. They remove one dependency—the grid—but introduce others: fuel, cooling, starting systems, maintenance practices. None of these are insurmountable, but they are harder to model cleanly and harder to exercise frequently without disturbance.

And this is where it gets interesting.

Our probabilistic models tend to reward architectures that can be decomposed into well-behaved components. Electrical systems fit that mindset. Cross-boundary effects—like a sustained 0.85 pu grid that sequentially weakens multiple safety layers—are real, but they are awkward to represent. So they end up underweighted.

The result is not a conspiracy of procurement bureaucracy or PRA. It’s alignment. We choose what is easy to specify, easy to test, and easy to defend—even if it reconnects the layers we claim are independent.

An engine on the pump shaft would break some of those hidden couplings.

But until we make those couplings visible in the safety case—and give them proper weight—the system will keep drifting toward electrically “independent” designs that share the same vulnerability, just one layer deeper.

Independence is not about how many boxes you draw.

It’s about what actually fails together.

***

Automation wins comfortably inside the design basis.

That is where it was built to operate. The space is defined, the variables are known, the responses are pre-engineered. In that domain:

  • it reacts faster than any human
  • it does not get tired, distracted, or overloaded
  • it executes sequences exactly as intended, every time
  • it enforces consistency across all similar events
  • it removes hesitation when the correct action is already known

Within that envelope, human involvement is often the weak link. Delay, doubt, and variation add nothing when the situation matches the assumptions. Automation is not just helpful there — it is superior.

But that advantage is conditional.

It depends on the world behaving as expected.

Step outside the design basis, and the structure flips.

The plant is no longer a set of predefined states. It becomes a system with broken assumptions:

  • signals may contradict each other
  • instrumentation may be partially wrong or misleading
  • cause and effect are no longer aligned with procedures
  • actions that are “correct” in one context may be harmful in another
  • the sequence itself may be unknown

At that point, automation does not degrade gracefully. It continues to apply logic that was built for a different reality. It does exactly what it was designed to do — and that is precisely the problem.

Because what is needed now is not execution, but interpretation.

Not speed, but judgment.

Not predefined response, but reconstruction of the situation.

This is where general intelligence becomes essential:

  • to question whether the signals make sense at all
  • to recognize patterns that were never explicitly modeled
  • to infer hidden states from incomplete information
  • to abandon procedures when their assumptions no longer hold
  • to create new courses of action in real time

A human can do this. Not perfectly, not always correctly — but flexibly. The core may still be saved.

Automation cannot. Not because it is slow or limited, but because it is bounded by its design space.

Without human operators, the outcome is uncertain as soon as the situation degrades beyond what the designers accounted for. 

The goal is not to replace one with the other but to separate their roles clearly:

  • Automation owns the known 
  • Humans own the unknown

Blurring that boundary is what creates risk. Claiming that everything is accounted for makes the risk even bigger.
 
But humans need time to reflect, and the design must ensure they have it.

Because when a plant leaves the design basis, it needs someone who can understand what is actually happening.