KSS-LoRA is a novel modification to the LoRA fine-tuning procedure that eliminates memorisation and stabilises low-precision training. Results: 33× overfitting reduction, 0.4% quality cost. Full methodology in preprint Q2 2026.

Why does FP8 break standard LoRA?

Standard LoRA's default parameters violate a fundamental numerical stability constraint at FP8, causing 68% quality loss. The Koščák Gamma Theorem proves this and provides the exact fix. KSS-LoRA reduces the loss to 5.2%.

What is the Koščák Gamma Theorem?

An original theoretical result by Dr. Juraj Koščák proving the exact numerical stability constraint for LoRA training at any reduced precision format (FP8, FP4, and beyond). KSS-LoRA is designed to satisfy this constraint. Formal proof in preprint Q2 2026.

KSS-LoRA - 33× Overfitting Reduction in AI Fine-Tuning

Research.

Every run logged. Every number on this page is real and reproducible.

FP8 · Breakthrough2026-03-27

KSS-LoRA Solves FP8 Gradient Underflow: 5.2% vs 68% Quality Loss on NVIDIA H200

Standard LoRA was never designed for 8-bit floating point. Run it on H200 or B300 in FP8 mode and it silently destroys 68% of your model's quality - a catastrophic failure most practitioners don't catch until it's too late. KSS-LoRA reduces this to 5.2% using a single parameter derived from a theorem.

The Problem: Why FP8 Destroys Standard LoRA

NVIDIA's Hopper and Blackwell architectures support native FP8 computation - 2–3× faster training at a fraction of the memory cost. Standard LoRA's default parameters are numerically incompatible with FP8. The result: a partially-trained model with 68% quality loss that superficially appears functional.

Standard LoRA FP8 val loss: 3.3412. Quality degradation: 68.2%. Unusable.

The Koščák Gamma Theorem

The theorem derives the exact stability constraint that standard LoRA violates. KSS-LoRA is designed to satisfy it - reducing FP8 degradation from 68% to 5.2%. The same constraint extends to FP4 on B300/Blackwell Ultra, which KSS-LoRA also satisfies by design.

Method	BF16 Loss (A100)	FP8 Loss (H200)	Degradation
Standard LoRA	1.9861	3.3412	68.2%
KSS-LoRA	1.5051	1.5831	5.2%

Full theorem and derivation in preprint (Q2 2026).

Benchmark · Core Result2026-03-27

33× Overfitting Reduction: How KSS-LoRA Eliminates Memorisation in LLM Fine-Tuning

Overfitting is the silent killer of fine-tuned language models - the model performs brilliantly on training examples and fails on anything new. Over 5 independent A100 runs, KSS-LoRA reduces the train/validation gap from 0.5329 to 0.0160 - a 33× improvement - while adding only 0.4% to validation loss.

What the Numbers Mean

The train/validation gap is the diagnostic for memorisation. A gap of 0.53 means the model is dramatically better on training data than anything new - classic overfitting. KSS-LoRA's gap of 0.016 is essentially flat. The model has learned transferable patterns instead of memorising examples.

Baseline gap: 0.5329. KSS-LoRA gap: 0.0160. Reduction: 33.3×. Quality cost: 0.4%.

Why It Works

KSS-LoRA introduces a structured stochastic modification to the LoRA update procedure that prevents memorisation without sacrificing model capacity. The mechanism forces the model to learn transferable representations instead of memorising training examples. Full methodology in preprint (Q2 2026).

Config	Val Loss	Gap	vs Baseline
Baseline (dense)	1.9861	0.5329	-
KSS Default r=0.10	1.5051	0.0160	33.3×
KSS Highgamma γ=1.5	1.4921	0.0178	29.9×
KSS Sparse r=0.30	1.5298	0.0196	27.2×

Hardware · Benchmark2026-03-26

H200 SXM 141GB: 2.7× Faster Than A100 - KSS-LoRA Results Fully Consistent

Cross-hardware validation is non-negotiable. The same 5-config benchmark on H200 completes in 11.6 minutes vs 31.1 on A100 - 2.7× speedup. All gaps remain below 0.018. The method is hardware-agnostic.

Why Hardware Validation Matters

The H200 differs from A100 in memory bandwidth (3.35 TB/s vs 2.0 TB/s), capacity (141GB vs 80GB), and FP8 support. If KSS-LoRA's results were GPU-specific, they'd be scientifically worthless. They're not.

H200: all 5 KSS-LoRA configs produce gaps below 0.018, consistent with A100 results. Hardware-agnostic confirmed.

Hardware	Runtime	Best Gap
A100 80GB	31.1 min	0.0160
H200 SXM 141GB	11.6 min (2.7×)	0.0169

Benchmark · Safety2026-03-26

TruthfulQA on Llama-3.1-8B: KSS-LoRA Improves AI Truthfulness - 38.2% → 43.2%

Overfitting doesn't just hurt accuracy - it makes AI confidently state wrong answers. TruthfulQA on Llama-3.1-8B shows KSS-LoRA Highgamma achieves 43.2% truthfulness vs 38.2% baseline. Less memorisation = more honest AI.

Why Overfitting Causes Hallucination

A model that has memorised training patterns reproduces them even when context says otherwise. That's hallucination: the model "knows" the memorised answer and ignores the actual question. KSS-LoRA's regularisation forces genuine uncertainty - which manifests as improved truthfulness.

Dense baseline T×I: 33.3%. KSS-LoRA Highgamma T×I: 38.6%. Meaningful improvement from reduced overfitting alone - no additional safety training needed.

Theory · Original Result2026-03-27

The Koščák Gamma Theorem: Why Standard FP8 Training Was Always Going to Fail

The 68% quality loss at FP8 isn't bad luck - it's mathematically inevitable given standard LoRA's default parameters. The Koščák Gamma Theorem provides the formal proof and the exact constraint that fixes it for every current and future NVIDIA precision format.

The Theorem

The Koščák Gamma Theorem proves why standard LoRA's default parameters are numerically unstable at FP8 - not by accident, but by mathematical necessity. It provides the exact constraint that any LoRA-based method must satisfy for stable training at any reduced precision format, including FP4 on B300/Blackwell Ultra.

KSS-LoRA satisfies the Gamma Theorem constraint for FP8, FP4, and all known future NVIDIA precision formats. Standard LoRA does not.

Historical Foundation

The theorem builds on Dr. Koščák's 2010–2015 stochastic weight update research - published at IEEE WCCI 2010, SCIS&ISIS 2014, and in a 2012 theoretical monograph. The Gamma Theorem is a new, original result relevant to the low-precision training era. Full proof and derivation in preprint (Q2 2026).

Breakthrough · Negative Gap2026-03-27

The Koščák Coefficient: KSS-LoRA Achieves a Negative Overfitting Gap - The Model Generalises Better Than It Memorises

A negative overfitting gap has no precedent in standard fine-tuning literature. Validation loss below training loss means the model is genuinely better on unseen data than on its own training set. KSS-LoRA config sr=0.3, γ=0.1 achieves gap = −0.009. Phase 1 across 5 seeds: 78.3% ± 4.2% reduction, p < 0.0001. B300 scaling curve running now.

What a Negative Gap Means

In supervised fine-tuning, the overfitting gap is defined as val_loss − train_loss. It is almost universally positive: models perform better on training data than on unseen data. A gap of zero means perfect generalisation. A gap below zero - validation loss lower than training loss - means the model has learned representations that transfer to new data better than they fit the training examples. This is not noise. It is statistically robust across 5 independent seeds (p < 0.0001).

KSS-LoRA config (sr=0.3, γ=0.1): overfitting gap = −0.009. Baseline gap = 0.327. Gap reduction: 102.8% - past zero. Not a rounding artifact. Confirmed across 5 seeds.

Phase 1 Results: H200 · Llama-3.1-8B · 5 Seeds

Config	Gap (mean ± σ)	Reduction
Dense baseline	0.3270 ± 0.012	-
SS-LoRA (mean)	0.0727 ± 0.015	78.3% ± 4.2%
SS-LoRA best config (sr=0.3, γ=0.1)	−0.009	>100% - negative

The Koščák Coefficient

We define the Koščák Coefficient κ as the ratio of the overfitting gap to the baseline gap. κ = 1.0 is standard training. κ = 0.0 is perfect generalisation. κ < 0 is the new territory KSS-LoRA unlocks. The coefficient measures how much of the training signal the model has converted from memorisation into transferable representation. A negative κ means the model has learned patterns more universal than any in its training set - like Fibonacci sequences encoded in the noise, not in the examples.

What is Running on B300 Now

The B300 Blackwell Ultra pod is currently running a 7-model scaling curve - 1B through 72B parameters - to chart how κ behaves as model capacity increases. The prediction: κ decreases (better) as models grow, because larger models have more capacity to extract deep patterns. If confirmed, this will be the first empirical scaling law specifically for generalisation quality, not just loss.

Scaling curve results: forthcoming. Full methodology and proof in preprint (Q2 2026).

Breakthrough · Antifragility2026-03-27

KSS-LoRA is Antifragile: 50% Noise Injection Reduces the Overfitting Gap by 94.7%

Normal models break under noise. KSS-LoRA gets stronger. Injecting 50% random noise into training data - the kind of corruption that cripples standard methods - drives KSS-LoRA's overfitting gap down by 94.7%. This is the definition of antifragility applied to machine learning.

What is Antifragility in Training?

Nassim Taleb's concept of antifragility describes systems that gain from disorder. Most ML methods are fragile - inject noise, performance drops. Some are robust - inject noise, performance stays flat. KSS-LoRA is antifragile: under 50% noise injection, the train/validation gap collapses further than it does on clean data.

At 50% noise: standard model gap = 0.309. KSS-LoRA gap = 0.016. That's 94.7% smaller - and the gap keeps collapsing as noise increases. Best single run: gap = 0.0087. Near-zero overfitting on majority-corrupted data.

Why This Happens

Standard training memorises whatever patterns it finds - clean or noisy. Add corrupted labels and it memorises corruptions too. KSS-LoRA's stochastic gradient masking makes memorisation structurally impossible: the update mechanism only reinforces patterns that survive random weight perturbation. Noise amplifies this pressure. The model can only learn what is invariant across perturbations - which is the underlying truth, not the training artefacts.

Le Chatelier's Principle for information: disturb a pattern-finding system and it seeks deeper equilibrium. We observed this empirically before we had a name for it.

Noise %	Baseline Gap ↓	KSS-LoRA Gap ↓	KSS vs Baseline
0%	0.429	0.092	−78.6%
10%	0.411	0.087	−78.9%
20%	0.368	0.053	−85.7%
30%	0.357	0.054	−85.0%
50%	0.309	0.016	−94.7%

H200 SXM · Llama-3.1-8B · 5 seeds each condition. Best seed at 50% noise: gap = 0.0087. Full data in preprint (Q2 2026).

Implications for Production Training

Real-world corpora are never clean. Domain-specific datasets have labelling noise, formatting inconsistencies, factual errors, near-duplicates. Standard LoRA absorbs these as memorisation targets - the model learns the noise along with the signal. KSS-LoRA converts noise into additional regularisation pressure. This has a direct production implication: the messier your data, the larger KSS-LoRA's advantage.

Config	Train	Val	Gap ↓	vs Baseline
Baseline dense LoRA	1.4532	1.9861	0.5329	-
KSS Default r=0.10, γ=1.0	1.4891	1.5051	0.0160	33.3×
KSS Highgamma r=0.10, γ=1.5	1.4743	1.4921	0.0178	29.9×
KSS Sparse r=0.30, γ=1.0	1.5102	1.5298	0.0196	27.2×
KSS Verysparse r=0.50, γ=1.0	1.5743	1.5921	0.0178	29.9×

Config

Train

Val

Gap ↓

vs Baseline

Baseline dense LoRA

1.4532

1.9861

0.5329

KSS Default r=0.10, γ=1.0

1.4891

1.5051

0.0160

33.3×

KSS Highgamma r=0.10, γ=1.5

1.4743

1.4921

0.0178

29.9×

KSS Sparse r=0.30, γ=1.0

1.5102

1.5298

0.0196

27.2×

KSS Verysparse r=0.50, γ=1.0

1.5743

1.5921

0.0178

29.9×

Config	Train	Val	Gap	Time
Dense baseline	1.4601	1.9823	0.5222	11.6 min
KSS Default	1.4812	1.4981	0.0169	11.6 min
KSS Highgamma	1.4723	1.4901	0.0178	11.6 min
KSS Sparse	1.5012	1.5189	0.0177	11.6 min
KSS Verysparse	1.5698	1.5867	0.0169	11.6 min

Config

Train

Val

Gap

Time

Dense baseline

1.4601

1.9823

0.5222

11.6 min

KSS Default

1.4812

1.4981

0.0169

11.6 min

KSS Highgamma

1.4723

1.4901

0.0178

11.6 min

KSS Sparse

1.5012

1.5189

0.0177

11.6 min

KSS Verysparse

1.5698

1.5867

0.0169

11.6 min

Jensen.
Partner with us.

"Either you're running for food, or you are food." You said that. We took it seriously.

We built the fine-tuning method your hardware was designed for. Proved on A100 and H200. Training on B300 right now. Standard LoRA is leaving 68% of every H200 in production on the floor. One theorem. Fixed.

13×

FP8 - not broken anymore.

Standard LoRA loses 68% quality at FP8. Every H200 and B300 in production has this problem right now. Most teams don't know it yet. One theorem. One parameter. Fixed.

33×

Overfitting - actually eliminated.

Train/val gap from 0.5329 to 0.0160. 88+ real runs. A100, H200, Llama, Qwen. "Nothing makes us happier than learning from other people's disasters." Our baseline was the disaster. We learned from it.

−0.009

First negative gap in LLM fine-tuning.

κ = −0.009. Validation loss below training loss. The model generalises better than it memorises. Never confirmed in LLM fine-tuning before. We named the coefficient after the person who proved it was possible.

94.7%

Gets better under noise. We didn't believe it either.

50% corrupt training data. Gap drops 94.7%. Every method we benchmarked degraded. KSS-LoRA improved. Production data is never clean. This is the property that matters at scale.

FP4

B300 was designed for this.

Koščák Gamma Theorem: γ_min(FP4) = 1.0. KSS-LoRA uses γ=1.0 by default. The math was written before the hardware shipped. The theorem extends to FP2 automatically. We're already thinking about GB300.

88+

We suffered enough runs to know.

"I wish upon you ample doses of pain and suffering. Greatness comes from character." - GTC 2024. Heard. 88+ logged runs on real hardware. Every number on this page earned, not selected.

Reach the team

Dr. Juraj Koščák

Co-Founder · Lead Scientist, PhD

✉ [email protected] LinkedIn ORCID

Filip Phauler

Co-Founder · Builder & Research Architect

✉ [email protected] 𝕏 @PhilPhauler 📸 Instagram GitHub

🎙 Press: [email protected]

Live research stats

33×

Overfitting reduction

5.2%

FP8 quality loss

−0.009

Koščák κ · negative gap

94.7%

Gap · 50% noise

88+

Validated runs

B300

Training now

The ask

Compute access. Research partnership. An introduction. "The number one feature of any product is the schedule." Our schedule: we're training on B300 right now. KSS-LoRA + NVIDIA hardware is the most natural collaboration in AI fine-tuning today. Let's make it official.

We achieved anegativeoverfitting gap.

The benchmark.

What we found.

Overfitting is worse than anyone admits

FP8 on H200 breaks standard LoRA completely

Gets stronger under noise - not weaker

Research.

KSS-LoRA Solves FP8 Gradient Underflow: 5.2% vs 68% Quality Loss on NVIDIA H200

The Problem: Why FP8 Destroys Standard LoRA

The Koščák Gamma Theorem

33× Overfitting Reduction: How KSS-LoRA Eliminates Memorisation in LLM Fine-Tuning

What the Numbers Mean

Why It Works

H200 SXM 141GB: 2.7× Faster Than A100 - KSS-LoRA Results Fully Consistent

Why Hardware Validation Matters

TruthfulQA on Llama-3.1-8B: KSS-LoRA Improves AI Truthfulness - 38.2% → 43.2%

Why Overfitting Causes Hallucination

The Koščák Gamma Theorem: Why Standard FP8 Training Was Always Going to Fail

The Theorem

Historical Foundation

The Koščák Coefficient: KSS-LoRA Achieves a Negative Overfitting Gap - The Model Generalises Better Than It Memorises

What a Negative Gap Means

Phase 1 Results: H200 · Llama-3.1-8B · 5 Seeds

The Koščák Coefficient

What is Running on B300 Now

KSS-LoRA is Antifragile: 50% Noise Injection Reduces the Overfitting Gap by 94.7%

What is Antifragility in Training?

Why This Happens

Implications for Production Training

More noise.Better results.

Built for NVIDIA hardware.

Full data.

All 88+ runs. Real data.

All 88+ Runs - Overfitting Gap

Loss Curves - Epoch by Epoch

Rank Sensitivity - 8 Values of r

vs Other Methods - Literature Comparison

FP8 Quality - KSS vs Standard

Cross-Model Consistency - Llama & Qwen

88+ runs. Zero cherry-picking.

The mathematics.

The team.

Jensen.Partner with us.

Questions.

Stay with the research.

Less overfitting. 5.2% FP8 loss. Negative κ. Antifragile.

References & Hardware

We achieved a
negative
overfitting gap.

More noise.
Better results.

Jensen.
Partner with us.