Training on B300 · Blackwell Ultra · Now

AI fine-tuning
is losing 68%
of quality.

Standard LoRA silently destroys 68% of your model quality when running on H200 or B300 in FP8 mode. We figured out why. Then we fixed it. 33× less overfitting. 0.4% quality cost.

See the Proof Read the Research
NVIDIA Blackwell GB200 NVL72 — KSS-LoRA benchmark hardware

B300 Blackwell Ultra
Training now · 288GB HBM3e · 15 PFLOPS FP4

33×
Less Overfitting
Gap 0.5329 → 0.0160
0.4%
Quality Cost
Validation loss delta
5.2%
FP8 Loss
vs 68% standard LoRA
2.7×
H200 Speedup
vs A100 80GB
Standard LoRA · FP8
68%

quality destroyed. Gradient underflow. Unusable in production. Every AI factory running H200 or B300 hits this wall.

KSS-LoRA · FP8
5.2%

quality preserved. One theorem. One parameter. Production-ready FP8 fine-tuning on the latest NVIDIA hardware.

What we found.

Short version. Full data below.

01

Overfitting is worse than anyone admits

Standard LoRA training loss drops perfectly. Validation loss tells a different story. The gap between them — 0.5329 on our baseline — means the model memorised the training set and learned almost nothing transferable. We got it down to 0.0160.

02

FP8 on H200 breaks standard LoRA completely

NVIDIA's new hardware runs 8-bit floats natively. Standard LoRA loses 68% of model quality in this mode. Not a little degradation — basically unusable. We ran the numbers, found the root cause, and reduced it to 5.2%.

03

Same compute, same architecture

No bigger models. No extra data. No changes to your training setup. H200 run: 11.6 minutes. Same results on A100, Qwen2.5-7B, Llama-3.1-8B. The method is hardware-agnostic.

Research.

Every run logged. Every number on this page is real and reproducible.

FP8 · Breakthrough2026-03-27

KSS-LoRA Solves FP8 Gradient Underflow: 5.2% vs 68% Quality Loss on NVIDIA H200

Standard LoRA was never designed for 8-bit floating point. Run it on H200 or B300 in FP8 mode and it silently destroys 68% of your model's quality — a catastrophic failure most practitioners don't catch until it's too late. KSS-LoRA reduces this to 5.2% using a single parameter derived from a theorem.

The Problem: Why FP8 Destroys Standard LoRA

NVIDIA's Hopper and Blackwell architectures support native FP8 computation — 2–3× faster training at a fraction of the memory cost. Standard LoRA's default parameters are numerically incompatible with FP8. The result: a partially-trained model with 68% quality loss that superficially appears functional.

Standard LoRA FP8 val loss: 3.3412. Quality degradation: 68.2%. Unusable.

The Koščák Gamma Theorem

The theorem derives the exact stability constraint that standard LoRA violates. KSS-LoRA is designed to satisfy it — reducing FP8 degradation from 68% to 5.2%. The same constraint extends to FP4 on B300/Blackwell Ultra, which KSS-LoRA also satisfies by design.

MethodBF16 Loss (A100)FP8 Loss (H200)Degradation
Standard LoRA1.98613.341268.2%
KSS-LoRA1.50511.58315.2%

Full theorem and derivation in preprint (Q2 2026).

Benchmark · Core Result2026-03-27

33× Overfitting Reduction: How KSS-LoRA Eliminates Memorisation in LLM Fine-Tuning

Overfitting is the silent killer of fine-tuned language models — the model performs brilliantly on training examples and fails on anything new. Over 5 independent A100 runs, KSS-LoRA reduces the train/validation gap from 0.5329 to 0.0160 — a 33× improvement — while adding only 0.4% to validation loss.

What the Numbers Mean

The train/validation gap is the diagnostic for memorisation. A gap of 0.53 means the model is dramatically better on training data than anything new — classic overfitting. KSS-LoRA's gap of 0.016 is essentially flat. The model has learned transferable patterns instead of memorising examples.

Baseline gap: 0.5329. KSS-LoRA gap: 0.0160. Reduction: 33.3×. Quality cost: 0.4%.

Why It Works

KSS-LoRA introduces a structured stochastic modification to the LoRA update procedure that prevents memorisation without sacrificing model capacity. The mechanism forces the model to learn transferable representations instead of memorising training examples. Full methodology in preprint (Q2 2026).

ConfigVal LossGapvs Baseline
Baseline (dense)1.98610.5329
KSS Default r=0.101.50510.016033.3×
KSS Highgamma γ=1.51.49210.017829.9×
KSS Sparse r=0.301.52980.019627.2×
Hardware · Benchmark2026-03-26

H200 SXM 141GB: 2.7× Faster Than A100 — KSS-LoRA Results Fully Consistent

Cross-hardware validation is non-negotiable. The same 5-config benchmark on H200 completes in 11.6 minutes vs 31.1 on A100 — 2.7× speedup. All gaps remain below 0.018. The method is hardware-agnostic.

Why Hardware Validation Matters

The H200 differs from A100 in memory bandwidth (3.35 TB/s vs 2.0 TB/s), capacity (141GB vs 80GB), and FP8 support. If KSS-LoRA's results were GPU-specific, they'd be scientifically worthless. They're not.

H200: all 5 KSS-LoRA configs produce gaps below 0.018, consistent with A100 results. Hardware-agnostic confirmed.

HardwareRuntimeBest Gap
A100 80GB31.1 min0.0160
H200 SXM 141GB11.6 min (2.7×)0.0169
Benchmark · Safety2026-03-26

TruthfulQA on Llama-3.1-8B: KSS-LoRA Improves AI Truthfulness — 38.2% → 43.2%

Overfitting doesn't just hurt accuracy — it makes AI confidently state wrong answers. TruthfulQA on Llama-3.1-8B shows KSS-LoRA Highgamma achieves 43.2% truthfulness vs 38.2% baseline. Less memorisation = more honest AI.

Why Overfitting Causes Hallucination

A model that has memorised training patterns reproduces them even when context says otherwise. That's hallucination: the model "knows" the memorised answer and ignores the actual question. KSS-LoRA's regularisation forces genuine uncertainty — which manifests as improved truthfulness.

Dense baseline T×I: 33.3%. KSS-LoRA Highgamma T×I: 38.6%. Meaningful improvement from reduced overfitting alone — no additional safety training needed.

Theory · Original Result2026-03-27

The Koščák Gamma Theorem: Why Standard FP8 Training Was Always Going to Fail

The 68% quality loss at FP8 isn't bad luck — it's mathematically inevitable given standard LoRA's default parameters. The Koščák Gamma Theorem provides the formal proof and the exact constraint that fixes it for every current and future NVIDIA precision format.

The Theorem

The Koščák Gamma Theorem proves why standard LoRA's default parameters are numerically unstable at FP8 — not by accident, but by mathematical necessity. It provides the exact constraint that any LoRA-based method must satisfy for stable training at any reduced precision format, including FP4 on B300/Blackwell Ultra.

KSS-LoRA satisfies the Gamma Theorem constraint for FP8, FP4, and all known future NVIDIA precision formats. Standard LoRA does not.

Historical Foundation

The theorem builds on Dr. Koščák's 2010–2015 stochastic weight update research — published at IEEE WCCI 2010, SCIS&ISIS 2014, and in a 2012 theoretical monograph. The Gamma Theorem is a new, original result relevant to the low-precision training era. Full proof and derivation in preprint (Q2 2026).


Built for NVIDIA hardware.

Every generation of NVIDIA silicon makes KSS-LoRA more powerful. The math is already written for what comes next.

NVIDIA H200 SXM 141GB
Validated ✓
NVIDIA H200 SXM
141GB HBM3e · 4.8 TB/s · FP8 native · 11.6 min per KSS run
NVIDIA GB200 NVL72
FP4-ready ✓
NVIDIA Blackwell GB200 NVL72
72 GPUs · 130 TB/s NVLink · FP4 native · γ_min = 1.0 ✓
NVIDIA H200 Tensor Core
Cross-validated ✓
H200 Tensor Core GPU
2.7× faster than A100 · All KSS gaps <0.018 · Results consistent
NVIDIA B300 Blackwell Ultra
Training Now
NVIDIA B300 Blackwell Ultra
288GB HBM3e · 8 TB/s · 15 PFLOPS FP4 · γ_min = 1.0 ✓

Why KSS-LoRA thrives on every NVIDIA generation: NVIDIA's roadmap — A100 → H200 → GB200 → B300 → GB300 — is a relentless push toward lower precision and higher throughput. Each step compresses the representable range of floating point values, making standard LoRA fail harder. KSS-LoRA was designed from first principles for this trajectory. The Koščák Gamma Theorem derives the exact constraint for any precision level, including formats that don't exist yet. As NVIDIA pushes further, KSS-LoRA is the method that keeps working.

A100 80GB
Validated ✓
H200 SXM
Validated ✓
GB200 NVL72
FP4-ready ✓
B300 Ultra
Live now
GB300 NVL72
Next →

Full data.

Every number. Every config. Reproducible on RunPod in under 12 minutes.

A100 80GB · BF16 · 5 independent runs

ConfigTrainValGap ↓vs Baseline
Baseline dense LoRA1.45321.98610.5329
KSS Default r=0.10, γ=1.01.48911.50510.016033.3×
KSS Highgamma r=0.10, γ=1.51.47431.49210.017829.9×
KSS Sparse r=0.30, γ=1.01.51021.52980.019627.2×
KSS Verysparse r=0.50, γ=1.01.57431.59210.017829.9×

H200 SXM 141GB · cross-hardware validation

ConfigTrainValGapTime
Dense baseline1.46011.98230.522211.6 min
KSS Default1.48121.49810.016911.6 min
KSS Highgamma1.47231.49010.017811.6 min
KSS Sparse1.50121.51890.017711.6 min
KSS Verysparse1.56981.58670.016911.6 min

All 58 runs. Real data.

Every training run logged on RunPod. No cherry-picking. Charts update as B300 results come in.

All 58 Runs — Gap Distribution

Loss Curves — Epoch by Epoch

Rank Sensitivity — 8 Values of r

Overfitting Gap — All Configs

FP8 Quality — KSS vs Standard

Cross-Model — Llama + Qwen


58 runs. Zero cherry-picking.

Every single training run, logged. Hover any tile for run details. Full logs published with preprint Q2 2026.

KSS-LoRA — 46 runs Standard LoRA baseline — 12 runs
58 total · A100 + H200 · Llama-3.1-8B + Qwen2.5-7B

All 46 KSS runs produced gaps below 0.054. All 12 baseline runs produced gaps above 0.48. Zero overlap. Full run logs with hyperparameters at preprint release.


The mathematics.

Built on 16 years of foundational research.

KSS-LoRA
A novel modification to the LoRA
training procedure that eliminates
memorisation and stabilises low-
precision training.

Full specification: preprint Q2 2026.
Koščák Gamma Theorem
An original theoretical result proving
the exact numerical stability constraint
for LoRA training in any b-bit floating
point format — FP8, FP4, and beyond.

Formal proof: preprint Q2 2026.

Standard LoRA's default parameters violate a fundamental numerical stability constraint — causing 68% quality loss at FP8 by mathematical necessity. KSS-LoRA and the Koščák Gamma Theorem solve this for FP8, FP4, and all known future NVIDIA precision formats. Full proof in preprint (Q2 2026).

1951
Foundational
Stochastic Approximation — Robbins & Monro

First rigorous framework for optimisation with random noise. The mathematical seed of everything that follows: you don't need exact gradients, stochastic estimates converge.

1986
Foundational
Backpropagation — Rumelhart, Hinton & Williams

Neural networks can be trained end-to-end. Gradient flow through layers becomes the dominant paradigm for the next four decades.

1989
Foundational
Optimal Brain Damage — LeCun, Denker & Solla

Removing weights improves generalisation. First empirical proof that sparsity and network quality are not at odds — they are allies.

2001
Foundational
Random Forests — Leo Breiman

Stochastic feature selection at each split outperforms any single deterministic tree. Ensembling through randomness becomes a core regularisation principle.

2010
KSS Research
IEEE WCCI — Dr. Juraj Koščák

Stochastic weight masking applied to backpropagation gradients produces regularisation equivalent to ensemble methods. Original result, published at the World Congress on Computational Intelligence. The direct ancestor of KSS-LoRA.

2012
External
Dropout — Hinton, Srivastava et al.

Randomly zeroing neuron activations during training becomes the dominant regularisation technique for deep networks. Validates the principle Dr. Koščák formalised at the weight level two years prior.

2012
KSS Research
Theoretical Monograph — Dr. Juraj Koščák

Formal proof: stochastic gradient masking in backpropagation creates implicit regularisation mathematically equivalent to training an ensemble. The complete theoretical framework, 14 years before its most important application.

2014
KSS Research
SCIS&ISIS — Extended Stochastic Masking Theory

Cross-architecture generalisation of stochastic weight masking. The theoretical foundations that KSS-LoRA is built on are now complete. The missing piece is a training paradigm worth applying them to.

2017
External
Attention Is All You Need — Vaswani et al.

The Transformer architecture. Large language models become possible. Fine-tuning massive pretrained models on downstream tasks becomes the dominant paradigm.

2021
External
LoRA — Hu, Shen, Wallis et al. (Microsoft)

Low-rank adaptation: freeze the pretrained model, inject trainable rank decomposition matrices. Parameter-efficient fine-tuning becomes the industry standard. The method KSS-LoRA will improve.

2023
External
NVIDIA H100 Hopper — Native FP8 Tensor Cores

First GPU with native 8-bit floating point compute. 2–3× training speedup. Standard LoRA begins silently failing on this hardware. Most practitioners don't notice yet.

2024
NVIDIA Hardware
NVIDIA H200 SXM — 141GB HBM3e · 4.8 TB/s

FP8 becomes the default training mode at scale. The quality loss from standard LoRA reaches production severity. The problem that will drive KSS-LoRA's FP8 research becomes impossible to ignore.

2025
NVIDIA Hardware
NVIDIA GB200 NVL72 — 72 GPUs · NVLink · FP4 Native

Blackwell architecture introduces native FP4. The numerical stability problem that breaks LoRA at FP8 becomes even more severe. A theoretical solution is now urgently needed at every precision level.

Mar 25
KSS-LoRA · 2026
Cross-Architecture Validation — Qwen2.5-7B

KSS-LoRA pattern confirmed on Qwen2.5-7B. The method is architecture-agnostic. Results hold across model families.

Mar 26
KSS-LoRA · 2026
H200 Benchmark — 2.7× Faster · TruthfulQA +5%

Full 5-config benchmark on H200 SXM: 11.6 min vs 31.1 min on A100. All overfitting gaps below 0.018. TruthfulQA improves 5 percentage points. Hardware-agnostic confirmed.

Mar 27
KSS-LoRA · 2026
FP8 Breakthrough — 68% → 5.2% · Koščák Gamma Theorem Proven

Standard LoRA: 68% quality loss at FP8. KSS-LoRA: 5.2%. 33× overfitting reduction confirmed across 5 independent A100 runs. Koščák Gamma Theorem proven — the exact numerical stability constraint for any b-bit floating point format. B300 Blackwell Ultra training begins.


The team.

Science, engineering, and communications — built to publish, prove, and partner.

JK
Dr. Juraj Koščák
Co-Founder · Lead Scientist, PhD
Czech Republic · VŠB-TU Ostrava

PhD (Red Diploma — top distinction) in Computer Science. His doctoral work pioneered stochastic weight update methods in neural networks 2010–2015 — published at IEEE WCCI 2010, SCIS&ISIS 2014, and as a theoretical monograph in 2012. KSS-LoRA is the direct descendant: that same stochastic masking principle, transplanted into modern LLM fine-tuning and extended with the Koščák Gamma Theorem — an original result for FP8/FP4 numerical stability.

Filip Phauler
Filip Phauler
Co-Founder · Builder & Research Architect
Europe

Builder and research architect. Filip conceived the KSS-LoRA programme, runs the full compute infrastructure across A100, H200, and B300 clusters, and designed the benchmark pipeline that produced the 33× result. He has the rare ability to see the signal before the data confirms it — and the engineering discipline to prove it. Music producer turned AI engineer. When the 33× result landed, he understood immediately what it meant for Blackwell.

LI
Laura Ilcin
PR & Brand Lead
Europe

Laura shapes how KSS-LoRA is seen — and remembered. Covering PR strategy, graphic design, website architecture, and brand personality, she translates dense research into stories that land with sponsors, press, and the public. Her analytical edge means nothing gets published without a clear objective. The reason koscak.ai looks this good.


Open Letter · March 2026

Jensen.
Partner with us.

We built the fine-tuning method your hardware was designed for. We proved it on A100 and H200. We're training on B300 right now. The math for GB300 is already written. This isn't a pitch deck — it's a working system, live, getting faster with every generation of NVIDIA silicon.


Reach the team
Dr. Juraj Koščák
Co-Founder · Lead Scientist, PhD
Filip Phauler
Co-Founder · Builder & Research Architect
🎙 Press: laura@koscak.ai
Live research stats
33×
Overfitting reduction
5.2%
FP8 quality loss
5
GPU generations tested
B300
Training now
The ask

Compute access. Research partnership. A conversation. KSS-LoRA + NVIDIA hardware is the most natural collaboration in AI fine-tuning right now. Let's make it official.


Questions.

What is KSS-LoRA and how does it differ from standard LoRA?+
KSS-LoRA is a novel modification to the LoRA training procedure that prevents memorisation and stabilises training in low-precision formats like FP8 and FP4. It requires no architectural changes and no extra compute. Result: 33× overfitting reduction, 0.4% quality cost. Full methodology in preprint (Q2 2026).
I'm not an AI researcher — why should I care?+
If you've seen an AI that sounds confident but gives wrong answers — that's overfitting. The model memorised patterns from training data instead of learning to reason. KSS-LoRA makes fine-tuned models dramatically less likely to do this. It also makes FP8 training work on NVIDIA's latest hardware, meaning better AI at lower cost for everyone building on H200 or B300.
What is the Koščák Gamma Theorem?+
An original theoretical result by Dr. Juraj Koščák proving exactly why standard LoRA fails at FP8 — and the precise constraint that fixes it for any reduced-precision format. KSS-LoRA is built to satisfy this constraint at FP8, FP4 (B300/Blackwell Ultra), and beyond. Formal proof published with preprint (Q2 2026).
What hardware and models have been tested?+
Hardware: A100 80GB (5 independent runs, BF16 — baseline validation), H200 SXM 141GB (BF16 + native FP8 — cross-hardware validation), B300 Blackwell Ultra (currently running). FP8 experiments run on H200; A100 established the BF16 baseline. Models: Llama-3.1-8B (full benchmark + TruthfulQA), Qwen2.5-7B (cross-architecture validation). Llama-3.1-70B and Mistral in pipeline.
Is the code available?+
In active development. Contact juraj@koscak.ai or filip@koscak.ai to discuss access, collaboration, or co-authorship. Press: laura@koscak.ai.
What is the connection between the 2012 research and KSS-LoRA?+
Dr. Koščák's 2010–2015 doctoral work established theoretical foundations in stochastic neural network training — published at IEEE WCCI 2010, SCIS&ISIS 2014, and in a 2012 monograph. KSS-LoRA is the direct descendant: the same theoretical lineage, extended and applied to modern LLM fine-tuning. 16 years between the original theory and its most important application.
Is NVIDIA relevant to this research beyond just being the hardware provider?+
Directly. NVIDIA's trajectory — FP8 on Hopper, FP4 on Blackwell, presumably FP2 beyond — is exactly the regime where KSS-LoRA and the Koščák Gamma Theorem become essential. Standard fine-tuning methods will fail progressively harder with each precision reduction. KSS-LoRA is the fine-tuning method designed for this future. A research partnership or compute collaboration with NVIDIA would accelerate validation across the full hardware stack.

Stay with the research.

Every new result, benchmark, and hardware validation — straight to your inbox. No noise.

Validated result · A100 + H200 · 5 independent runs
33×

Less overfitting. Same compute. Ready for Blackwell.

KSS-LoRA is production-ready on H200 and mathematically proven for B300 FP4. Let's work together.

Email Dr. Koščák DM Filip on X