Back to The LedgerEngineering

Air-gapped deployment: what actually breaks

Most web apps that claim to run on-premise have not been tested with the network unplugged. The result is a long tail of things that silently fail, hang, or quietly phone home. Here is the audit checklist we run before every on-prem release.

Antoine Bedaton
Antoine Bedaton
05 Mar 202611 min read
Air-gapped deployment: what actually breaks

Part of our complete guide to negative news screening for Swiss banks. This post is the deep dive on on-premise and air-gapped deployment; the guide covers the end-to-end picture.

Most products that claim to run "on-premise" have only ever been tested with internet access. The result is a long tail of network calls that nobody noticed in development (fonts pulled from a CDN, GeoJSON pulled from GitHub, telemetry SDKs that initialise on import, CRL fetches blocked by an outbound firewall) and that, on a Swiss bank's segregated network, either silently fail or, worse, hang the page for fifteen seconds while a TLS stack waits for an OCSP responder it cannot reach.

This post is the checklist we run before every on-prem release. It covers what genuinely breaks, what doesn't, and the categories most people forget to look at.

A note on terminology

True air-gap (no network path to the internet at any layer) is rare in financial institutions. What Swiss banks usually mean is a no-egress segment: outbound connections to the internet are blocked at the perimeter, but the application still has working DNS, NTP, an internal CA, an internal package registry, and possibly a tightly controlled outbound proxy for explicitly-listed destinations. The NIST glossary keeps the strict definition; the NIST SP 800-82 guidance for industrial control systems and SP 800-53 PE-family controls describe the realistic version most enterprises operate.

The distinction matters because the failures look different. A no-egress segment usually has working DNS but no route to cdn.jsdelivr.net; a true air-gap has no DNS at all for external names. A few tests below behave differently in each.

Browser-side: what the browser fetches by default

The hardest leaks to catch are the ones the browser does for you. The backend logs nothing because the request never reaches the backend.

Fonts

@font-face declarations against fonts.googleapis.com or fonts.gstatic.com are the most common leak. Modern frameworks mitigate this: next/font/google in Next.js 13+ downloads the font files at build time and serves them from the same origin, so the runtime is clean. But the build is not. If your CI runs in a no-egress environment, the build itself fails (with a notably bad error: "download timed out after 3 seconds" in some Next.js dev modes). The fix is either next/font/local with the font files committed to the repo, or a build-time mirror.

The same logic applies to icon packs distributed as fonts (Font Awesome's CDN delivery, classic Material Icons). Bundle them. Don't <link> them.

Icons

Most modern icon libraries (lucide-react, react-icons, @heroicons/react) ship as npm packages and bundle into the JS output. These are safe. The leak is the older pattern: a single <i class="fas fa-search"> somewhere in the codebase that loads the full Font Awesome stylesheet from a CDN to render one icon. Grep for use.fontawesome.com, kit.fontawesome.com, and cdn.jsdelivr.net.

Map tiles and country data

This is the category that catches almost everyone. Map tiles are huge, so they live on tile servers. The default tile servers for Leaflet, Mapbox GL, Google Maps, and OpenStreetMap are all external. If the on-prem deployment shows a "world map" with country boundaries or markers, and the development version uses any of these, it will silently fail behind a no-egress firewall.

The same applies to country boundary data. The most common source is Natural Earth served via raw GitHub URLs (raw.githubusercontent.com/nvkelso/natural-earth-vector/...), which feels like a stable static URL but is in fact a CDN egress that no enterprise firewall whitelist will allow. The fix is to ship the GeoJSON inside the application bundle (a few hundred KB compressed) and load it from the same origin.

For tile rendering specifically, the on-prem option is either self-hosting a tile server (tileserver-gl over your own MBTiles) or disabling the map view entirely on no-egress deployments. The in-between option (using a third-party tile provider over the bank's controlled outbound proxy) is sometimes acceptable, but it introduces an external dependency that the bank's procurement team will need to vet under FINMA Circular 2018/3 on outsourcing. The vendor due-diligence checklist maps each clause back to a procurement question.

Telemetry SDKs

The script tags everyone forgets:

  • Google Analytics, Plausible, Fathom, Matomo (when hosted), Vercel Analytics, Cloudflare Analytics
  • Sentry, Datadog RUM, LogRocket, FullStory, Hotjar
  • Stripe, Intercom, Calendly, HubSpot widgets

Even when the analytics product itself is innocuous, the SDK's initial GET will hang in the browser tab if the host is unreachable and the script is loaded with async but blocking on a render-time event. The right pattern is conditional initialisation gated on a build flag (if (process.env.NEXT_PUBLIC_TELEMETRY === 'enabled')), with the on-prem build setting it to disabled and the SDK imports tree-shaken out of the bundle entirely.

Favicons

Rare but real: a favicon <link rel="icon" href="https://..."> hardcoded to a marketing-site CDN that doesn't exist on the on-prem deployment. The browser will retry, log a network error, and otherwise behave fine, but the network panel of any user inspecting the page shows an outbound failure to a third-party domain. That finding alone has killed pilot deployments at conservative banks.

CRL and OCSP

This one is subtle. When the browser establishes TLS to the on-prem backend, it will (depending on the certificate, the browser, and the policy) try to validate the certificate chain by fetching a Certificate Revocation List or hitting an OCSP responder. If the certificate was issued by a public CA, those URLs are external. If the firewall blocks them, the browser falls back to soft-fail by default (the connection succeeds, the certificate is treated as valid), but the user-visible delay can be up to ~15 seconds while the request times out.

There are two clean fixes:

  1. OCSP stapling at the reverse proxy. The proxy fetches the OCSP response on a schedule and serves it inline during the TLS handshake. The browser never makes its own OCSP request. nginx does this with ssl_stapling on; ssl_stapling_verify on;
  2. Internal CA. If the on-prem deployment is served by a certificate from the bank's own CA, the CRL and OCSP URLs are on the internal network. There is nothing for the browser to fail to reach.

Most Swiss banks run an internal PKI for exactly this reason. The bug to avoid is shipping a default deployment that uses a public-CA certificate (Let's Encrypt is the usual culprit) and forgetting that the OCSP responder lives on r3.o.lencr.org.

Server-side: what the backend fetches

The browser leaks are visible. The server-side ones are the ones that break the install at startup.

OIDC discovery

If the application supports external identity providers (Google, Okta, Auth0, Azure AD), Spring Security's standard pattern is to fetch {issuerUri}/.well-known/openid-configuration at startup and discover the rest of the endpoints from the response. On a no-egress deployment, that fetch fails and the application refuses to start. The mitigation is straightforward (explicitly configure the endpoints rather than relying on discovery), but the default behaviour catches teams off-guard.

If the on-prem deployment uses a local Keycloak (which is what we recommend), the discovery endpoint is on the internal network and the problem doesn't arise. The risk is the buyer who insists on SSO against their existing cloud IdP and didn't tell anyone the backend would need outbound access to it.

Vendor SDK telemetry

Every database driver, cloud SDK, and infrastructure tool ships with a phone-home behaviour these days. Most of them respect an environment variable. None of them respect the same one. The checklist for a typical Java/Node stack:

| Component | Disable with | |-----------------------|---------------------------------------| | Next.js | NEXT_TELEMETRY_DISABLED=1 | | Liquibase | LIQUIBASE_ANALYTICS_ENABLED=false | | MinIO | MINIO_PROMETHEUS_AUTH_TYPE=public, telemetry off via server flags | | HashiCorp Vault | disable_mlock is unrelated; telemetry stanza in config.hcl | | Confluent Kafka | confluent.support.metrics.enable=false | | Elasticsearch | xpack.ml.enabled=false (and others) | | Kibana | telemetry.enabled: false | | Spring Boot Actuator | Don't expose /actuator/info over public ingress |

We set every one of these in our base Compose file and then verify on a clean deployment that no outbound connection attempts hit the firewall logs in the first ten minutes after startup. That last step is the only one that actually proves it.

NTP, DNS, log shipping

These are infrastructure rather than application concerns, but they break in the same ways. NTP defaults to *.pool.ntp.org; the bank runs its own NTP and expects the application servers to use it. DNS defaults to whatever the host has configured; on a no-egress segment that should resolve internal names only, but containers occasionally ship with 8.8.8.8 baked into a config file. Log shipping defaults to wherever the SaaS log vendor lives.

The audit step here is grep-based, not behavioural. Search every config file for pool.ntp.org, 8.8.8.8, 1.1.1.1, *.googleapis.com, *.cloudflare.com, *.amazonaws.com, *.azure.com, and the names of any vendor your stack might have inherited from a tutorial.

Container image pulls

A no-egress deployment cannot pull from Docker Hub, GHCR, Quay, or the Confluent and Elastic registries. Every image referenced in the Compose stack (Postgres, Redis, Keycloak, Yente, Vault, Elastic, Kafka) has to be mirrored to the bank's internal registry first. This is the single most operationally painful step of an on-prem delivery. We ship a manifest of every image SHA the release uses, and the bank's container team mirrors them before deployment.

The thing to verify after mirroring is that no image also pulls something at runtime, for example a Liquibase image that fetches JDBC drivers from Maven Central on first run, or an Elastic image that downloads a plugin on startup. If it does, the install fails silently after the network has been disconnected.

Build-time is not runtime

A lot of "air-gap" debates miss this distinction. The build can have internet. The runtime cannot. Anything fetched at build time (fonts, npm packages, Maven dependencies, container layers) ends up inside an artifact that ships to the bank. As long as the artifact is reproducible and the dependencies are pinned (lockfiles, hash verification), the runtime sees only what the build baked in.

The mistake is treating runtime fetches as if they were build fetches. A library that "downloads its data on first use" is fine for SaaS and broken for on-prem. The fix is to ship the data inside the artifact, not to retry the fetch with a longer timeout.

What air-gap does not protect against

This is the part the security team has to hear, even if procurement doesn't want to. Air-gap is a perimeter control. It does not substitute for:

  • Supply chain integrity at build time: malicious npm packages, poisoned Maven dependencies, a compromised CI runner. All of these enter the artifact before it crosses the boundary. The bank's CSO cares about SBOM, signed releases, and reproducible builds more than they care about the perimeter rule.
  • Removable media: USB-shaped attacks, vendor laptops, the consultant who "just needs to grab the logs" with their personal drive. The INSA / ICS guidance on air-gap myths spends most of its time on this.
  • Insider risk: nothing about the network architecture changes the access an authorised user has to the data.
  • Data exfiltration via legitimate channels: an analyst exporting a report and emailing it out is not an air-gap problem, it is a DLP problem.

The point of air-gap is to remove a class of attack and a class of data leak. It is not a complete posture, and pretending it is loses credibility with the security team faster than admitting the limits.

The audit we actually run

Three steps, in order, on a freshly-built release:

  1. Static grep. Every external domain pattern listed above against the entire repo, the entire build output, and every container image. Any hit is a finding.
  2. Network egress test. Bring up the full stack on a host with an outbound firewall in deny-and-log mode. Walk through every feature in the UI and watch the firewall logs. Any attempted outbound connection that isn't to a configured external service is a finding.
  3. Disconnected install. Disconnect the build host's network, re-run the install from artifacts, and verify the application starts cleanly and every primary feature works. The features that require external services (sanctions feeds, OIDC against an external IdP) should fail with a clear "this feature is disabled in this deployment mode" message, not with a network error.

The first two are cheap. The third is the only one that catches the runtime fetches. We have caught real things on every release, in our own product, on the third step. That is the point of running it.

Bottom line

Air-gap is not hard, but it is unforgiving. The costly leaks are not the obvious ones (Google Analytics in _app.tsx gets caught in review) but the inherited defaults: a build tool's font download, a TLS library's revocation check, a container image's first-run plugin fetch, an OIDC discovery URL on a public IdP. The fix in each case is the same: make the network call opt-in, fail loudly when it is configured but unreachable, and ship the disabled default.

#on-premise#air-gap#deployment#security#swiss