MSc Thesis Defence · University of Amsterdam

Stable-edge filtering for passive device-class classification in OT networks under operational change

Jonathan van den Heuvel · supervisors dr. Cyril Hsu & dr. Chrysa Papagianni
System & Network Engineering × KPMG Cyber · 7 July 2026

One-line framing: this is a reproducible negative result with a mechanism and a fix. Open with the problem, build to the contribution; do not bury the lede but do not over-claim it either.

The setting

Operators need to know what is on their OT network, without touching it

Operational-technology networks run power grids, water plants, factories
Asset visibility is a prerequisite for segmentation, patching, and IEC 62443 / NIS2
Active scanning can disturb the physical process, so it is risky in OT
Passive traffic analysis is the pragmatic default: classify devices from the traffic a tap already sees

This work is hosted by KPMG's OT-security practice, where operational traces are never a clean steady state: laptops connect and disconnect, scanners sweep, configurations drift. A classifier that does well on a curated one-hour capture is not yet evidence it will behave when those things happen.

Motivation. Keep it short: passive classification matters for asset inventory and segmentation; the realistic question is robustness to operational change, not accuracy on a clean trace.

The idea under test

Real traffic mixes stable links with transient ones, so clean the graph first?

A passive classifier works on the observed communication graph
It mixes stable operational links (an HMI polling a PLC) with transient ones (engineering sessions, scans)
A natural preprocessing step: keep only the edges that persist in time, the stable-edge filter

Hypothesis Removing temporally non-persistent edges before classification makes it better, and more robust to operational change, than keeping them. A single, falsifiable hypothesis. This talk reports that it is falsified, and traces why.

State the hypothesis explicitly and as falsifiable. The whole talk is a clean falsification plus the mechanism behind it.

Research questions

Three questions operationalise the hypothesis

RQ1Does the filter improve classification on steady-state traffic?
RQ2Does it make classification of held-out hosts robust when their traffic changes?
RQ3Where does the classification signal, and the filter's effect, live?

Held-out hosts = devices whose traffic is never seen in training; one per class is reserved for testing (host-stratified inductive split). Define this if asked.

How devices are classified

A graph neural network reads each host and its neighbourhood

Classifier: GraphSAGE, an inductive graph neural network, so it generalises to hosts unseen in training
Each host has six features: in/out-degree, in/out-bytes, distinct source/destination ports
Message passing aggregates a host's neighbours' features with its own; two layers reach the 2-hop neighbourhood
A graph-free random forest on the same features is the graph-versus-no-graph reference

Prior passive OT classifiers take the observed graph as given, or prune edges that are rare in aggregate. None treats temporal persistence, whether an edge recurs across time windows, as the filter, and none tests robustness to operational change. That is the gap this thesis addresses.

Why GraphSAGE: inductive (held-out hosts), and the mean aggregator keeps the filter's effect interpretable. Keep this slide brief.

The testbed

A reproducible OT lab with edge-level ground truth

20 always-on hosts, five device classes, four per class
Modbus/TCP and S7 control traffic; HTTP and DNS on IT endpoints
One passive tap on a shared segment, no active scanning
Steady state plus four scripted change scenarios, with edge-level ground truth
Seeded, fingerprinted, released for reproduction

controller: PLC · Modbus/S7 · polled
supervisory: HMI · polls controllers
engineering: workstation · sessions
historian: periodic snapshots
IT endpoint: HTTP / DNS

scenarios: maintenance, onboarding, configuration drift, benign noise

If asked whether it is based on a paper: inspired by ICSSIM and MiniCPS (same testbed tradition, protocols, Purdue roles) but built custom so the scenario phases can be scripted; not a reproduction.

Filter + evaluation

Observation-window persistence, evaluated on hosts the model never saw

An edge is kept if it is present in at least a fraction θ of the windows of the captured trace: the realistic passive case, where a single tap cannot segment phases
Train on steady-state only; apply without retraining under change
Host-stratified inductive split: 3 train + 1 held-out host per class, redrawn each seed
10 lab × 10 model seeds; paired Wilcoxon within seed

0.772 → 0.490 macro-F1: training hosts → held-out hosts The model learns, then hits a generalisation ceiling (a supervisory / engineering / historian three-way confusion). Five-class chance is 0.20.

The split is host-level, not window-level: a window-stratified split would just memorise hostnames. The 0.772 vs 0.490 shows the model learns and then hits an honest ceiling.

RQ1 · steady state

On stationary traffic the filter does nothing, and the graph adds nothing

The filter removes zero edges from a stationary phase, so RQ1 is a sanity check, not a comparison
A graph-free random forest is at least as accurate as GraphSAGE on held-out hosts (paired p = 0.037)
So the classification signal lives in the host features, which makes the maintenance finding classifier-independent

Held-out macro-F1, random forest vs GraphSAGE (10 seeds, paired)

Own the inert graph: it strengthens the thesis, because the main finding then holds for any classifier reading the features, not just a GNN. Use "no measurable benefit", not "RF is better".

RQ2 · robustness under change

No benefit anywhere, and a significant penalty under maintenance

0.45 → 0.36 Δ −0.089 · p = 0.027 · worse 8/10 seeds

Neutral on four of five scenarios; significantly harmful in the one containing a genuine equipment outage. The filter delivers no robustness benefit, and a real cost.

Δ held-out macro-F1 (filtered − baseline), per scenario

The headline. Four scenarios at zero, maintenance significantly negative. Do not claim per-class movements on the weak classes; they are within the noise floor.

Why maintenance breaks it

A paused controller, stripped of its polls, looks like an idle IT endpoint

1During the 40-min outage the polls fall to 22/30 = 73% < θ
2The filter cuts them from every window: in-degree 20→0, in-bytes 2.1M→0
3The feature vector collapses, and is misread as an idle IT endpoint

plc-1 inbound polls across one maintenance run (30 windows)

The mechanism is exact, not statistical. The all-zero collapse is exact for the paused controller's outage windows; pooled controller F1 falls to 0.82, not to 0. Confusion matrix is in the backup.

Is it real?

Controls isolate the cause

Random, same count removed → harmless. So it is not that removing edges hurts.
Byte-volume → also harmful. A second content-blind proxy strips the low-volume polls.
Phase-local (the idealised filter) → removes nothing. The penalty is the observation window.

Maintenance Δ macro-F1, same count removed per window, four selection rules

The rigour highlight. The harm is about which edges are removed, not how many, and it is shared by content-agnostic proxies that key on low volume.

Ruling out confounds

Not distribution shift; concentrated on the paused host

Train on sparsified graphs? penalty persists, Δ −0.126 so it is feature destruction, not a dense-train / sparse-test shift
Leave-one-controller-out plc-1 (paused) −0.229 vs others ~−0.037 a clean per-controller estimate; the effect tracks the paused host
Classifier-independent random forest shows the same −0.104 not an artefact of graph aggregation

B1 closes the distribution-shift objection (a direct retrain-on-filtered control would settle it fully, and is future work). B5 replaces a weak n=3 dose argument with clean folds.

RQ3 · where the signal lives

Node-local features carry the signal, and take the damage

Node-local features alone (bytes, ports) generalise as well as the full set (0.505 vs 0.490)
The neighbourhood degree features raise the training fit but not held-out generalisation
The maintenance penalty appears under both feature subsets

Filtering the polls into the paused controller zeroes its degree features and its byte and port features at once, because in this lab a controller's entire observable footprint is the polls it receives. The harm is feature destruction, which is why a classifier with no graph aggregation suffers it too.

RQ3 ties the mechanism to the features: the damage is to the host features, which is the same reason the graph adds nothing on top of them.

The contribution

Content-agnostic edge filtering is fragile for passive OT classification.

A controller's class-defining inbound polls are both low-volume and event-sensitive, so multiple natural proxies (observation-window persistence, byte-volume) preferentially strip them; a count-matched random removal does not, but is useless as a denoiser. A useful filter must be content / semantics-aware.

The reframed central claim, conditioned on the mechanism. Avoid the bare "temporal persistence is the wrong abstraction"; the conditioned version is what the controls support.

From negative result to design principle

A content-aware filter removes the failure mode

Keep any edge to a control-protocol port (Modbus, S7) regardless of persistence; apply the persistence test only to the rest
Under maintenance it removes no edges, so the penalty disappears: Δ +0.000 vs −0.089
It still prunes the benign-noise scanner edges, so it is selective, not disabled

−0.089 → +0.000 maintenance Δ: content-agnostic → content-aware Safe where the content-agnostic filter is harmful, while keeping its pruning behaviour elsewhere. The principled successor is a learned edge filter on protocol, direction, rate, and endpoint roles.

Honest framing: the content-aware filter removes a failure mode, it does not raise accuracy (the graph is inert at this scale). The static port allow-list is the crudest form of content-awareness; the learned version is future work.

What it claims, and what it doesn't

The magnitude is specific to this lab's near-bipartite poll topology
In a field plant a paused PLC keeps peer and historian traffic, so the collapse would be partial
One observation point, five device classes, a fixed window length

What is expected to transfer

The mechanism: content-agnostic proxies strip low-volume, event-sensitive class-defining edges
Supported by byte-volume reproducing the harm through a different proxy
Field validation on a real trace is the primary next step

Be candid: concede the magnitude is lab-specific, defend only the qualitative mechanism. This is the strongest external-validity attack; meet it head-on.

Contributions

A reproducible OT lab with five device classes, four operational-change scenarios, and edge-level ground truth, released with all code
The finding, with controls, that content-agnostic edge filtering is fragile, established inductively over 10 × 10 seeds and classifier-independent
A content-aware remedy that removes the failure mode and turns the negative result into a design principle

Future work

A learned content-aware edge filter, using the lab's edge-level labels
Validation on a real, NDA-constrained OT trace
A window-length sweep; generalisation to unseen classes and topologies

Three contributions: the artefact, the controlled negative finding, the constructive remedy. A negative result with a mechanism and a fix is a stronger contribution than a marginal positive.

In one line

A content-agnostic filter strips the edges that define a controller; a content-aware one does not.

Jonathan van den Heuvel · University of Amsterdam · 2026
Thank you. Questions?

Close on the design principle. The backup slides follow for Q&A: graph-utility, distribution-shift, scope, the control table, the confusion matrix, and the θ sweep.

Backup

Backup slides, for questions.

Graph utility · distribution shift · external validity · the control table · the confusion matrix · the θ sweep.

Reachable with the number keys or by advancing past the close. Use during Q&A.

Backup · graph utility

"Why a graph thesis if the graph adds nothing?"

RF 0.512 vs GraphSAGE 0.490 on held-out hosts (paired p = 0.037)
Treat it as a result: the graph adds no accuracy at this scale (20 hosts, 4 per class, 6 features)
The maintenance penalty reproduces in the random forest, so it is feature destruction, not a message-passing artefact
That makes the finding hold for any classifier reading the features, which is more robust, not less

Held-out macro-F1, random forest vs GraphSAGE

Do not defend the graph. Owning that it is inert reads as honesty and broadens the result.

Backup · distribution shift

"Isn't it just train-dense / test-sparse shift?"

The filter removes nothing from steady state, so the filtered model trains on dense graphs and tests on sparse ones
Control: train on randomly-sparsified steady-state graphs, so the model has seen sparse neighbourhoods, then test on filtered maintenance
The penalty persists, Δ −0.126, so it is not pure covariate shift
A direct retrain-on-filtered control would settle it completely, and is future work (stated honestly)

The count-matched random removal also argues against pure sparsity: equal test-time edge density, but no harm. Two controls point the same way; one is still left open, and the talk does not over-claim closure.

This is the cleanest alternative explanation and it is only half-closed. Say so. "I don't claim it's fully settled; here is the control that will settle it" beats over-claiming.

Backup · external validity

"n = 1 lab, n = 1 paused controller, isn't the failure engineered?"

The magnitude is manufactured by the near-bipartite topology, conceded
In a real plant a paused PLC keeps peer, historian, and management traffic, so the collapse is partial and the penalty smaller
The mechanism is the claim, and a different proxy (byte-volume) reproduces it
The bipartite simplification is a stated scope condition on the headline, not a hidden assumption

The split between magnitude (lab-specific, concede it) and mechanism (general, defend it) is the move for every external-validity question. Field validation on a real trace is the primary future work.

Concede the magnitude, defend the mechanism. Be most humble here; this is the hardest honest question.

Backup · the numbers

Maintenance Δ by selection rule

All 10 lab × 10 model seeds, paired Wilcoxon. Same count of edges removed per window; only rules that target the low-volume polls are harmful.

stable-edge (persistence) −0.089p = 0.027 · worse 8/10 · the thesis filter
byte-volume −0.060p = 0.004 · worse 9/10 · also harmful
random count-matched +0.051p = 0.16 · harmless, rules out "any removal"
phase-local 0.000removes nothing by construction · the penalty is the window
content-aware +0.000the fix: keeps the polls, still prunes scanners

Have these cold. The contrast between random (harmless) and persistence / byte-volume (harmful) is the heart of the controls.

Backup · the misclassification

Filtering shifts the paused controller toward "IT endpoint"

Controller recall (the diagonal) falls 0.92 → 0.70
0.30 of controller windows leak to it (the ctrl→it cell rises 0.08 → 0.30)
High precision keeps controller F1 at 0.82, not 0: only the paused host's windows flip

maintenance confusion matrices, baseline vs filtered

Reconciles the confusion recall (0.70) with the per-class F1 (0.82): 0.30 is the leak, not the recall. Only plc-1's windows flip, so pooled F1 stays high.

Backup · operating point

The penalty holds at every threshold that removes edges

θ swept from 0.3 to 0.9, window length fixed at 5 minutes
Harmful at every θ that removes anything; beneficial at none
So the result is not an artefact of the default operating point

maintenance macro-F1 versus presence threshold theta

The θ sweep pre-empts "did you just pick a bad threshold?". Varying the window length W is future work.