When You Can't Roll a Truck: OT Resilience for Remote and Distributed Operations
"Your incident response plan assumes a truck can get to you in 1 hour. Your geography disagrees."
Most OT resilience frameworks were written by people who assumed help is an hour away. For operators running remote utilities, distributed pipelines, water districts, telecom infrastructure, ports, seafood processors, or any facility where the road closes in January and the barge comes twice a week, that assumption quietly breaks everything downstream of it.
This article is about designing resilience around the site you actually have, not the site someone in an office imagined when they wrote the plan. We’ve seen firsthand that they can be wildly different, and that can leave you and those you serve high and dry when a cyber crisis hits.
The On-Site Assumption and Why It Breaks
Pull up almost any OT incident response plan and read it carefully. Somewhere in there, in the containment steps, the recovery timeline, or the escalation tree, it assumes that a qualified person can get on-site quickly. It might not say that explicitly. It just assumes it.
For remote and distributed operators, that assumption is often wrong by a factor of days, not hours.
- Fly-in access, seasonal road closures, and weather windows can delay physical response by 24 to 72 hours or more.
- One field technician may cover hundreds of miles of line, pipe, fiber, or infrastructure.
- Critical spare parts may travel by barge, air freight, or regional staging — not same-day courier.
- Headquarters-written plans often get copied to remote sites the plan author has never set foot in.
This is not purely a cyber problem. It is a geography, logistics, and operations problem that cyber resilience planning must account for. As we have written before, OT cannot simply mirror IT approaches; the physical consequences of getting it wrong are fundamentally different. That logic applies doubly when the physical environment itself is part of the risk picture.
Read more: The Path to OT Resiliency: Why OT Cannot Mirror IT and What to Do Instead
What Distance Changes About Detection, Response, and Recovery
Remote conditions do not just slow things down. They change what the words detection, response, and recovery actually mean in practice.
Detection at remote sites means you cannot centralize every log. Low-bandwidth WAN links make full telemetry unrealistic. The answer is not to give up on visibility — it is to be ruthless about which signals matter. A handful of high-confidence detections tied to meaningful response steps is worth far more than a flood of alerts no one can act on. We covered this tradeoff in depth in our piece on OT logging.
Read more: OT Logging That Matters: Less Noise, More Signal
Response at remote sites means that containment steps must be executable by whoever is physically present — usually an operator, not a security analyst. If your response playbook requires a senior engineer to log into a jump host and make firewall changes, and that engineer is three flight connections away, you have a plan that will not survive first contact with reality.
Recovery means your restore timeline is governed by logistics, not by what your backup software marketing materials say. If the replacement PLC needs to come by air freight and clear customs, your recovery time objective needs to reflect that — and your business continuity plan needs to account for manually operating the process in the meantime.
Design Principle: Local Autonomy and Graceful Degradation
The single most important design principle for remote OT resilience is this: the site must be able to operate safely with the WAN link down for days, not minutes.
This sounds obvious. In practice, many remote sites have quietly accumulated dependencies on central systems that were never meant to be critical — historian connections, centralized authentication, cloud-based SCADA dashboards, remote patch servers. When the link fails, they find out.
Graceful degradation needs to be designed deliberately:
- Document the degraded operating mode for each critical site. What keeps running? What pauses? What switches to manual?
- Give operators clear authority to make decisions locally during a link outage. Ambiguity about who can authorize what creates dangerous hesitation.
- Test the islanded state before an incident forces you to. Deliberate islanding exercises reveal dependencies you did not know existed.
- Build local authentication fallback. If the site uses centralized identity and the link drops, can operators still log into the systems they need?
Design Principle: Out-of-Band Access Designed Up Front
When the primary WAN link is down or has been compromised, responders need a safe path back into the site. Out-of-band access is that path — but only if it was designed before the incident, not improvised during it.
There is an important distinction here between designed access and discovered access. Undocumented cellular modems, forgotten vendor tunnels, and engineer workarounds are attacker paths. Designed, brokered, and logged out-of-band access is a defender path. The difference is governance.
- Cellular or satellite backup paths should terminate at the firewall and use the same identity controls, MFA requirements, and session logging as the primary path.
- Access should be brokered, meaning it flows through a jump host or access broker that can record sessions and enforce least-privilege destination rules.
- Document the out-of-band path in the response playbook explicitly. Responders at 2 a.m. should not be guessing.
Design Principle: Spares, Manual Fallback, and People
Cyber resilience fails if the physical and human elements are ignored. A technically sound recovery plan that depends on a replacement part arriving in four days, when the part lives in a warehouse twelve hundred miles away, is not actually a recovery plan.
- Stage critical spares at the site or a regional hub based on honest logistics math, not best-case scenarios.
- Manual operating procedures should be current, printed, and physically present on-site. A procedure stored only in a network share does not help when the network is the problem.
- Operators should be cross-trained on the first thirty minutes of cyber response: isolate, preserve, call. Those three steps done well limit damage and give responders something to work with. The rest can wait for qualified help.
The safety and cyber intersection matters here too. An operator making isolation decisions at a remote site under pressure needs to know which containment steps are safe to take without creating a hazardous process condition.
Read more: The 4 Reasons Cyber and Safety Collide in OT
Writing the Playbook for the Person Who Is Actually There
The best incident response playbook for a remote site is the one an operator can use at 2 a.m., under pressure, without having called the help desk yet.
That means it cannot be 40 pages. It cannot assume specialized knowledge. It needs to be on paper, not on the network.
- One page. Laminated. Verb-first. 'Isolate. Document. Call.'
- Ranked by safety impact, the first steps protect people and the process, not necessarily data.
- Clearly defined authority: what an operator may do without waiting for approval, and when they must wait.
- A communication tree that works when the site link is the thing that failed, satellite phone, radio, or an alternate contact.
Every section of the playbook should pass a single test: can the person who will actually be there, in that moment, execute this step? If the answer is no, rewrite it.
30-Day 'Do This Now' Checklist
- Pick your most remote critical site and answer honestly: how long does it take to get a qualified technician there in January?
- Document the degraded operating mode for that site and walk through it with the operators who would actually execute it.
- Verify the out-of-band access path exists, terminates at the firewall, and is logged. If it does not exist, design it now.
- Stage or confirm that critical spares and current manual procedures are physically on-site, not just on a shared drive.
- Run one tabletop where the site is unreachable for 72 hours, and the WAN link is down. Track every assumption that breaks.
Plan for the Site You Actually Have
Resilience for remote and distributed operations is not a harder version of central-site resilience. It is a different design problem. Distance, weather, staffing, and logistics are not inconveniences to route around — they are design constraints to build into the plan from the start.
The organizations that handle incidents well at remote sites planned for the one person who would be there alone, with whatever was already on the shelf, with the link that was already down. They did not plan for help that was not coming.
If your team wants to validate how your remote site resilience holds up against these principles, the Koniag Cyber OT/ICS assessment practice works specifically with operators in remote and distributed environments.
Find the content useful? Subscribe to The Catch, our exclusive weekly LinkedIn newsletter focused on real-life experiences doing cyber right in the most highly regulated industries.


