// RESEARCH STATUS — COMPLETE · rev 2026-06-10
// Why this is in a security portfolio — The imagery is satellite land-cover; the
transferable asset is the detection methodology. Engineering for class imbalance, optimising
for minority-class recall, validating that a model learns genuine structure rather than
spurious correlation, and weighing deep against classical learners under identical splits are
exactly the disciplines that govern intrusion detection, malware classification, and anomaly
detection. A production IDS where 0.01% of traffic is malicious must not score 99.99% accuracy
by calling everything benign — the imbalance problem solved here is the imbalance problem in
the SOC. The dataset is the vehicle; the method is the point.
Key Results
| Metric | Value |
|---|---|
| ResNet-18 Accuracy | 99.11% |
| TerraCNN Accuracy | 93.97% |
| TerraCNN F1 Macro | 0.9390 |
| Latent Space ARI | 0.6478 |
// Metric Choice — Primary evaluation metric is macro-averaged F₁ — equal weight to all
four classes regardless of frequency. This mirrors the threat-detection requirement where rare
events (attacks) must not be drowned by majority-class accuracy. A model that labels
everything benign achieves high accuracy but zero security value.
Dataset — RSI-CB256
| Class | Samples | Characteristics |
|---|---|---|
| Forest | 1,500 | High saturation variance; green-dominant |
| Water | 1,500 | High brightness; spectrally distinct |
| Cloudy | 1,500 | Low saturation; significant illumination spread |
| Desert | 1,131 | Low saturation; brown-dominant; under-represented |
Total: 5,631 images. Approximately 25% inter-class imbalance (Desert vs. majority classes).
HSV saturation means span 0.12 to 0.35 — photometric diversity that rules out simple
colour-histogram classifiers. Data split: stratified 70/20/10 (train 3,941 / val 1,126 /
test 563), identical across all models.
TerraCNN Architecture
Input: 128 × 128 × 3 (RGB, ImageNet-normalised)
Conv Block 1: 32 filters, 3×3 kernel → ReLU → BatchNorm → MaxPool 2×2
Conv Block 2: 64 filters, 3×3 kernel → ReLU → BatchNorm → MaxPool 2×2
Conv Block 3: 128 filters, 3×3 kernel → ReLU → BatchNorm → MaxPool 2×2
Global Average Pooling
Dense: 256 units → ReLU → Dropout (p = 0.50)
Output: 4 units → Softmax
Regularisation: dropout 0.50, L₂ weight decay λ = 1e-4
Optimiser: Adam β1=0.9, β2=0.999 · LR: 1e-3 → 2.5e-4 by epoch 70
Imbalance: class-weighted cross-entropy + inverse-frequency sampler
Augmentation policy was motivated by EDA: HSV means differing by up to 0.23 across classes. A
colour-insensitive augmentation (colour jitter brightness=contrast=saturation=0.3 and ±20°
rotation) avoids leaking class information through photometric artefacts.
Model Comparison
| Model | Architecture | Test Accuracy | F₁_macro |
|---|---|---|---|
| ResNet-18 (transfer) | Pretrained 18-layer residual CNN, fine-tuned on RSI-CB256 | 99.11% | 0.9916 |
| TerraCNN (scratch) | Custom 3-stage CNN (32→64→128 filters) + GAP + 256-unit dense | 93.97% | 0.9390 |
| Random Forest | 100-tree ensemble, class-balanced, 49,152-D pixel vectors | — | ~0.94 |
| SVM (RBF) | RBF-kernel SVM, 49,152-D pixel vectors, grid search C×γ | — | ~0.90 |
ResNet-18 confirms ImageNet-pretrained spatial features transfer effectively to constrained
four-class remote-sensing tasks. TerraCNN reaches 0.9390 F₁_macro from scratch — within a point
of the Random Forest baseline (~0.94) and roughly five points behind the pretrained network,
the expected gap for a small custom network trained on ~3,900 images without pretraining.
All models struggle most with Water–Forest confusion (spectral overlap in low-illumination
scenes). ResNet-18 reduces this to 3.6% error through learned spatial context; SVM reaches
17.4%, demonstrating the limitation of raw pixel vectors.
Latent Space Analysis
Penultimate layer activations (128-dimensional) extracted from TerraCNN on the test split. PCA
reduced to 200 components (retaining 96.0% of variance) followed by t-SNE (perplexity=30) for
two-dimensional projection.
ARI = 0.6478 — strong agreement between TerraCNN's internal cluster structure and true
class labels. Four coherent clusters emerge in the projection, with residual Water–Cloudy
overlap accounting for the gap from a theoretical maximum of 1.0. This confirms the CNN is
learning genuine discriminative features, not spurious correlations.
Security Relevance
Why classifying satellite imagery is a security problem, not just computer vision. The most
direct framing: this is automated intelligence-analysis triage. GEOINT/IMINT analysts
receive far more overhead imagery than any team can read; the bottleneck is deciding which
scenes deserve a human look. Automated land-cover classification is that triage layer —
routing scenes, flagging change, and aiming scarce analyst attention. A model that sorts
forest/water/cloud/desert is a toy; the reproducible, imbalance-aware, latent-validated
method is what sorts "normal" from "anomalous" across an ISR feed.
// Two further security framings — Critical-infrastructure monitoring — the same
classifier, retrained on the right classes, drives automated change-detection around fixed
sites (ports, substations, borders); the imbalance discipline is what stops the rare,
important scene being averaged away. Adversarial robustness in contested environments —
any vision model fielded for ISR is a target, so understanding how it fails (and proving its
representations are genuine, not spurious — the ARI = 0.6478 check) is itself a security
discipline. See below.
Beneath the domain framing, the techniques map directly to detection pipelines:
| Technique | Detection-pipeline application |
|---|---|
| Macro F₁ primary metric | Rare-class recall — attacks are the minority class in any production IDS. High accuracy by labelling everything benign is not acceptable. |
| Class-weighted loss + inverse-frequency sampler | Prevents benign-class dominance in intrusion detection. Standard practice in applied threat detection pipelines. |
| SVM with RBF kernel | Classical anomaly scoring; used in network intrusion detection benchmarks (NSL-KDD, CICIDS). |
| Random Forest | Feature importance ranking in malware classification; explainability for SOC analysts reviewing flagged samples. |
| CNN spatial feature extraction | Network traffic image encoding; binary visualisation for malware classification at scale. |
| Latent space ARI analysis | Cluster validation in unsupervised threat grouping — verifying the model learns real structure. |
Transfer Learning — the 5.14-point gap
The two deep models tell a representation-learning story. ResNet-18 (ImageNet-pretrained,
fine-tuned) reached 99.11% accuracy / 0.9916 F₁; TerraCNN (trained from scratch on
~3,900 tiles) reached 93.97% / 0.9390. The pretrained network wins by +5.14 percentage
points (99.11 − 93.97). That gap is not the architecture — it is representation transfer:
ResNet-18 arrives already knowing edges, textures and shapes from a million photographs and
only re-aims that vocabulary; TerraCNN must invent the entire visual vocabulary from a few
thousand satellite tiles. The gap is the measurable price of not having pretraining.
// What +5.14pp means in deployment — and the trade-off — On a triage queue of 10,000
scenes, 5.14pp is roughly 514 fewer misclassifications per batch — fewer wasted analyst
hours, fewer scenes wrongly waved through. But the pretrained model is a third-party
artefact: opaque 512-D features, inherited ImageNet biases, and a supply-chain dependency
on weights you did not train. TerraCNN is ~5 points weaker yet fully owned and auditable —
a compact 128-D latent (its ARI = 0.6478 is that audit), trainable on-prem on classified data
with no external dependency. In a cleared environment the auditable from-scratch model can be
the correct choice despite −5.14pp: provenance and inspectability are security properties.
"Most accurate" ≠ "most deployable in a contested setting."
Adversarial Robustness — fooling the classifier
A 99.11% figure measures performance on clean, honest data. It says almost nothing about an
adversary actively trying to deceive the model — the case that matters for anything fielded in
a contested environment. What an attacker would need to do:
- Adversarial perturbation (cheapest). Both models classify on RGB statistics and are
differentiable, so an attacker with model (or surrogate) gradients can compute a minimal,
near-imperceptible pixel change (FGSM/PGD) that flips the class with high confidence — no
physical access to the scene required. - Exploit the known weak boundary. Every model's dominant confusion is Water ↔ Forest
in low illumination (SVM 17.4%, even the best CNN 3.6%). An adversary who understands the
model operates in exactly that ambiguity — no digital tampering needed. - Physical-world deception. Camouflage, decoys and terrain alteration are perturbations
applied to the scene rather than the file — the physical analogue of an adversarial example,
historically effective against human and machine analysts alike. - Data poisoning. Influence the training set (mislabelled / trojaned tiles) and you install
a blind spot before deployment — a supply-chain attack on the model itself.
// The limit of ML-based security — High benchmark accuracy is not robustness. A
99.11% classifier can be driven toward near-100% error by an adversary with gradient
access — accuracy and adversarial robustness are different properties against different
threat models. Clean-data benchmarks must be paired with adversarial evaluation;
contested-use models need defence-in-depth and a human in the loop; and a model's
known confusion structure is also its attack surface. This is the vision-model
counterpart of the mirage argument: a detector that
keys on surface features is one an adversary can learn to evade. Robustness is designed and
tested for — it does not come free with accuracy.
Skills Demonstrated
| Skill | Evidence |
|---|---|
| Deep Learning | Custom CNN architecture design; PyTorch training pipeline; Adam with scheduling; early stopping. |
| Classical ML | SVM grid search; Random Forest with OOB validation; five-fold cross-validation; scikit-learn. |
| Imbalance Handling | Class-weighted cross-entropy; inverse-frequency mini-batch sampling; EDA-driven augmentation design. |
| Evaluation Methodology | Macro F₁; stratified splits identical across all models; controlled comparability; fixed seeds. |
| Latent Space Analysis | PCA (200 components, 96% variance); t-SNE projection; Adjusted Rand Index for cluster validation. |
| Security-Relevant Framing | Imbalance techniques directly applicable to IDS and malware classification; GEOINT/IMINT triage framing. |
| Transfer-Learning Analysis | Quantified the +5.14pp pretraining gap; interpretability/provenance trade-off for cleared deployment. |
| Adversarial ML Thinking | Threat-modelled the classifier (FGSM/PGD, poisoning, physical deception); "accuracy ≠ robustness". |
Repository
// GitHub — Full methodology, dataset description, architecture documentation, and
research references: github.com/rootdrifter/oracle —
one repository in the github.com/rootdrifter portfolio.