What is Antifragility in Training?
Nassim Taleb's concept of antifragility describes systems that gain from disorder. Most ML methods are fragile - inject noise, performance drops. Some are robust - inject noise, performance stays flat. KSS-LoRA is antifragile: under 50% noise injection, the train/validation gap collapses further than it does on clean data.
At 50% noise: standard model gap = 0.309. KSS-LoRA gap = 0.016. That's 94.7% smaller - and the gap keeps collapsing as noise increases. Best single run: gap = 0.0087. Near-zero overfitting on majority-corrupted data.
Why This Happens
Standard training memorises whatever patterns it finds - clean or noisy. Add corrupted labels and it memorises corruptions too. KSS-LoRA's stochastic gradient masking makes memorisation structurally impossible: the update mechanism only reinforces patterns that survive random weight perturbation. Noise amplifies this pressure. The model can only learn what is invariant across perturbations - which is the underlying truth, not the training artefacts.
Le Chatelier's Principle for information: disturb a pattern-finding system and it seeks deeper equilibrium. We observed this empirically before we had a name for it.
| Noise % | Baseline Gap ↓ | KSS-LoRA Gap ↓ | KSS vs Baseline |
| 0% | 0.429 | 0.092 | −78.6% |
| 10% | 0.411 | 0.087 | −78.9% |
| 20% | 0.368 | 0.053 | −85.7% |
| 30% | 0.357 | 0.054 | −85.0% |
| 50% | 0.309 | 0.016 | −94.7% |
H200 SXM · Llama-3.1-8B · 5 seeds each condition. Best seed at 50% noise: gap = 0.0087. Full data in preprint (Q2 2026).
Implications for Production Training
Real-world corpora are never clean. Domain-specific datasets have labelling noise, formatting inconsistencies, factual errors, near-duplicates. Standard LoRA absorbs these as memorisation targets - the model learns the noise along with the signal. KSS-LoRA converts noise into additional regularisation pressure. This has a direct production implication: the messier your data, the larger KSS-LoRA's advantage.