Research Notes
On-device AI · WebGPU

Small models, big edge:
novel use cases for
in-browser inference

A fan-out, fact-checked survey of where sub-2 billion-parameter neural networks — trained by one person in Python, exported to ONNX, and run on WebGPU — create a decisive advantage over the cloud.

01 — Method

How this was researched

A deterministic multi-agent harness: decompose into angles, fan out parallel web searches, dedup & fetch sources, then verify every load-bearing claim with a 3-vote adversarial panel before synthesis.

Scope
5 angles
Capability frontier · verticals · emerging modalities · build feasibility · contrarian limits
Search
Parallel sweep
One search agent per angle, blind to the others
Fetch
Extract claims
23 sources → 112 falsifiable claims
Verify
3-vote refute
A claim dies only if ≥2 of 3 skeptics refute it
Synthesize
Rank by confidence
Merge dupes, cite, surface caveats
112
claims extracted
25
sent to verification
23
confirmed
2
refuted & killed
7
findings after merge

02 — The headline

The stack is real, the hype isn't

As of mid-2026, the full loop you'd want — fine-tune a sub-2B model in Python → export to ONNX → run it entirely in the browser on WebGPU — is genuinely buildable, and cheap. The advantages that survived verification are narrower than the marketing, but they're concrete: on-device data privacy, sub-200ms real-time loops, zero server cost, and offline operation.

The cleanest proof point is clinical. A team LoRA-fine-tuned Llama 3.2 1B on just 1,500 synthetic pairs to generate structured medical notes fully in the browser — training in minutes on a free Colab T4. Low data, low compute, decisive privacy win. That single example is the entire thesis in miniature.

But the verifier also killed two popular claims: that local execution is unconditionally private (model weights still download once from a CDN), and the oft-repeated “100× faster than WASM” figure (real-world WebGPU gains are more like 3–15×). Honest framing matters here.


03 — What fits

The in-browser model landscape

Every model below has been demonstrated running client-side on WebGPU. Bars show parameter count — all comfortably under the 2B ceiling, most far under it.

Models proven to run in-browser on WebGPU

By parameter count (millions). Footprint shrinks ~8× more after 4-bit quantization.
Moonshine Tiny streaming ASR
26 M
Moonshine Medium streaming ASR
245 M
SmolVLM-256M vision-language
256 M
Gemma 3 270M LLM · <300MB q4
270 M
Qwen2.5-0.5B LLM · 4-bit
500 M
Llama 3.2 1B LLM · ~1.24GB q4f16
1,000 M
Scale check: SmolVLM-256M beats the 300× larger Idefics-80B on aggregate VLM benchmarks while using <1 GB of GPU memory and decoding at ~80 tok/s on an M4 Max. Capability per parameter is the whole game.
Sources: SmolVLM (arXiv 2504.05299) · Transformers.js v3 · Gemma 3 270M on-device (Google) · Moonshine

Why real-time voice goes local: the latency gap

End-of-speech latency, on a MacBook Pro. On-device streaming vs. a batch cloud-class model.
Moonshine Medium — streaming, on-device~107 ms
107 ms
Whisper Large v3 — batch11,286 ms
11,286 ms
A ~105× difference — the sliver at top is the on-device model. Caveat (per verification): this is a somewhat apples-to-oranges comparison (a batch model judged on end-of-speech latency), but the core finding — sub-200ms, fully on-device, no account or API key — is well-supported.
Source: Moonshine (moonshine-ai) · Transformers.js v3

04 — The map

Five ideas, plotted

Solo-builder feasibility against novelty & defensibility of the local angle. Bubble size ≈ potential impact. Top-right is the sweet spot; top-left is the frontier.

SWEET SPOT Solo-builder feasibility → Novelty & defensibility → harder easier 2 Splatting 1 Scribe 3 Voice 5 RAG 4 VLM
1 · Vertical scribe — best novelty-to-feasibility ratio
2 · Splatting + ONNX — frontier; high novelty, harder build
3–5 · Voice / VLM / RAG — high feasibility, solid niches

05 — The ideas

Five buildable use cases

Ranked by how novel and defensible the local angle is — not by how easy. Each carries an honest read on realism, model class, the local moat, and solo difficulty.

1
🩺 Privacy-decisive vertical scribes
Clinical, legal & HR note generation that never leaves the device

A regulated professional dictates → on-device ASR → a small fine-tuned LLM emits a structured note (SOAP note, legal memo, intake form). No audio or PHI ever touches a server. This is the rare case where local isn't a nice-to-have — it's the compliance unlock. Validated end-to-end: a LoRA'd Llama 3.2 1B lifted ROUGE-1 from 0.35→0.50 and cut hallucinations 85→35 on 1,500 synthetic pairs, running fully in-browser.

RealismValidated preprint · buildable now
ModelMoonshine ASR + LoRA Llama 3.2 1B / Gemma 270M
Why local winsDecisive — regulated data can't touch a cloud API
Solo difficultyLow–med
2
🌐 Gaussian Splatting + per-frame neural inference
The genuinely futuristic one — neural rendering with zero server compute

New 2025–26 work (Visionary, WebSplatter) fuses real-time 3D/4D Gaussian Splatting rendering with per-frame ONNX inference, all client-side: neural avatars, dynamic scenes, and style/enhancement networks via a standardized "Gaussian Generator" ONNX contract. WebSplatter hits ~105 FPS on an RTX 3070 and runs multi-million-splat scenes on an iPhone 15 Pro — "click-to-run" in a browser tab, no backend.

RealismFrontier · recent preprints, live demos
ModelSmall MLPs driving splats (ONNX)
Why local winsRender + infer in one GPU pipeline; streaming is bandwidth-prohibitive
Solo difficultyHigh
3
🎙️ Sub-200ms voice loops for hostile environments
Where cloud latency or egress is a non-starter

The niche isn't transcription (commodity) — it's environments where cloud simply can't go: courtroom record, factory-floor voice commands, offline field/wilderness wearables, aircraft & maritime. Moonshine streams at ~107ms end-of-speech latency, fully on-device, and runs even on a Raspberry Pi. Real-time and offline and private, simultaneously.

RealismShipping today
ModelMoonshine 26–245M (ONNX)
Why local winsLatency + offline + privacy together
Solo difficultyLow
4
👓 Tiny-VLM field & accessibility apps
Offline visual Q&A and scene description

SmolVLM-256M runs in-browser at ~80 tok/s using <1GB GPU RAM. That makes offline equipment-inspection assistants for technicians with no signal, real-time scene description for blind users (the camera feed stays private), and offline museum/field-guide apps all viable on a phone.

RealismBuildable now
ModelSmolVLM 256M/500M · Moondream · Florence-2
Why local winsOffline + camera-feed privacy
Solo difficultyLow–med
5
🔒 Private RAG over regulated documents
Search the corpus you're legally forbidden to upload

On-device embeddings (mxbai-embed-xsmall) + a sub-1B generator = Q&A over M&A data rooms, ITAR-controlled docs, patient records, journalist source material. The commodity version is boring; the regulated-corpus version is the moat — fully composable in Transformers.js today.

RealismComposable today
Modelmxbai-embed-xsmall + Qwen2.5-0.5B
Why local winsThe documents legally can't be sent to a cloud
Solo difficultyLow

06 — Proof point

The clinical scribe, by the numbers

A LoRA-fine-tuned Llama 3.2 1B, trained on 1,500 synthetic pairs, running entirely in-browser — the result that anchors the whole "local wins for regulated verticals" thesis.

ROUGE-1 (note quality) — higher is better
0.346
Base
0.496
Fine-tuned
▲ +43% quality
Hallucinations (per eval set) — lower is better
85
Base
35
Fine-tuned
▼ −59% hallucinations
Build cost: PEFT/LoRA on 1,500 synthetic pairs · minutes on a free Colab T4 · single conversion script to a ~1.24GB q4f16 ONNX build that runs via Transformers.js + WebGPU. Caveat: preprint, small synthetic set, residual hallucinations — feasibility is shown, not clinical-grade accuracy.
Source: In-browser clinical note generation (arXiv 2507.03033)

07 — Honest limits

What the verifier pushed back on

⚠ Caveats that survived
  • "Private" has an asterisk. Your data stays local, but model weights still download once from a CDN (200–600 MB).
  • Speed is GPU-dependent. Several "real-time" splatting figures are ~10–16 FPS on mobile, below 30 FPS.
  • Hard ceilings. ~4GB/tab memory cap, main-thread blocking without Web Workers, and encoder-decoder models still fail on some browsers. Safari/iOS parity is improving but uneven.
  • Sub-1B LLMs are narrow. Great at structured tasks; not open-ended agentic reasoning. Pick the task accordingly.
✕ Claims refuted & removed (0–3 votes)
  • "Local execution keeps data completely private and works offline — the decisive argument."  Real but not universal; weights still download.
  • "Transformers.js v3 WebGPU is up to 100× faster than WASM."  Not substantiated — real gains are ~3–15×.

Open questions worth chasing


08 — Sources

Primary & supporting references

23 sources fetched across 5 angles; the load-bearing ones below carried verified claims.