On-device AI · WebGPU

Small models, big edge:
novel use cases for
in-browser inference

A fan-out, fact-checked survey of where sub-2 billion-parameter neural networks — trained by one person in Python, exported to ONNX, and run on WebGPU — create a decisive advantage over the cloud.

105 research agents 23 sources fetched 25 claims adversarially verified June 7, 2026

01 — Method

How this was researched

A deterministic multi-agent harness: decompose into angles, fan out parallel web searches, dedup & fetch sources, then verify every load-bearing claim with a 3-vote adversarial panel before synthesis.

Scope

5 angles

Capability frontier · verticals · emerging modalities · build feasibility · contrarian limits

Parallel sweep

One search agent per angle, blind to the others

Fetch

Extract claims

23 sources → 112 falsifiable claims

Verify

3-vote refute

A claim dies only if ≥2 of 3 skeptics refute it

Synthesize

Rank by confidence

Merge dupes, cite, surface caveats

112

claims extracted

→

sent to verification

→

confirmed

→

refuted & killed

→

findings after merge

02 — The headline

The stack is real, the hype isn't

As of mid-2026, the full loop you'd want — fine-tune a sub-2B model in Python → export to ONNX → run it entirely in the browser on WebGPU — is genuinely buildable, and cheap. The advantages that survived verification are narrower than the marketing, but they're concrete: on-device data privacy, sub-200ms real-time loops, zero server cost, and offline operation.

The cleanest proof point is clinical. A team LoRA-fine-tuned Llama 3.2 1B on just 1,500 synthetic pairs to generate structured medical notes fully in the browser — training in minutes on a free Colab T4. Low data, low compute, decisive privacy win. That single example is the entire thesis in miniature.

But the verifier also killed two popular claims: that local execution is unconditionally private (model weights still download once from a CDN), and the oft-repeated “100× faster than WASM” figure (real-world WebGPU gains are more like 3–15×). Honest framing matters here.

03 — What fits

The in-browser model landscape

Every model below has been demonstrated running client-side on WebGPU. Bars show parameter count — all comfortably under the 2B ceiling, most far under it.

Models proven to run in-browser on WebGPU

By parameter count (millions). Footprint shrinks ~8× more after 4-bit quantization.

Moonshine Tiny streaming ASR

26 M

Moonshine Medium streaming ASR

245 M

SmolVLM-256M vision-language

256 M

Gemma 3 270M LLM · <300MB q4

270 M

Qwen2.5-0.5B LLM · 4-bit

500 M

Llama 3.2 1B LLM · ~1.24GB q4f16

1,000 M

Scale check: SmolVLM-256M beats the 300× larger Idefics-80B on aggregate VLM benchmarks while using <1 GB of GPU memory and decoding at ~80 tok/s on an M4 Max. Capability per parameter is the whole game.

Sources: SmolVLM (arXiv 2504.05299) · Transformers.js v3 · Gemma 3 270M on-device (Google) · Moonshine

Why real-time voice goes local: the latency gap

End-of-speech latency, on a MacBook Pro. On-device streaming vs. a batch cloud-class model.

Moonshine Medium — streaming, on-device~107 ms

107 ms

Whisper Large v3 — batch11,286 ms

11,286 ms

A ~105× difference — the sliver at top is the on-device model. Caveat (per verification): this is a somewhat apples-to-oranges comparison (a batch model judged on end-of-speech latency), but the core finding — sub-200ms, fully on-device, no account or API key — is well-supported.

Source: Moonshine (moonshine-ai) · Transformers.js v3

04 — The map

Five ideas, plotted

Solo-builder feasibility against novelty & defensibility of the local angle. Bubble size ≈ potential impact. Top-right is the sweet spot; top-left is the frontier.

1 · Vertical scribe — best novelty-to-feasibility ratio

2 · Splatting + ONNX — frontier; high novelty, harder build

3–5 · Voice / VLM / RAG — high feasibility, solid niches

05 — The ideas

Five buildable use cases

Ranked by how novel and defensible the local angle is — not by how easy. Each carries an honest read on realism, model class, the local moat, and solo difficulty.

🩺 Privacy-decisive vertical scribes

Clinical, legal & HR note generation that never leaves the device

A regulated professional dictates → on-device ASR → a small fine-tuned LLM emits a structured note (SOAP note, legal memo, intake form). No audio or PHI ever touches a server. This is the rare case where local isn't a nice-to-have — it's the compliance unlock. Validated end-to-end: a LoRA'd Llama 3.2 1B lifted ROUGE-1 from 0.35→0.50 and cut hallucinations 85→35 on 1,500 synthetic pairs, running fully in-browser.

RealismValidated preprint · buildable now

ModelMoonshine ASR + LoRA Llama 3.2 1B / Gemma 270M

Why local winsDecisive — regulated data can't touch a cloud API

Solo difficultyLow–med

🌐 Gaussian Splatting + per-frame neural inference

The genuinely futuristic one — neural rendering with zero server compute

New 2025–26 work (Visionary, WebSplatter) fuses real-time 3D/4D Gaussian Splatting rendering with per-frame ONNX inference, all client-side: neural avatars, dynamic scenes, and style/enhancement networks via a standardized "Gaussian Generator" ONNX contract. WebSplatter hits ~105 FPS on an RTX 3070 and runs multi-million-splat scenes on an iPhone 15 Pro — "click-to-run" in a browser tab, no backend.

RealismFrontier · recent preprints, live demos

ModelSmall MLPs driving splats (ONNX)

Why local winsRender + infer in one GPU pipeline; streaming is bandwidth-prohibitive

Solo difficultyHigh

🎙️ Sub-200ms voice loops for hostile environments

Where cloud latency or egress is a non-starter

The niche isn't transcription (commodity) — it's environments where cloud simply can't go: courtroom record, factory-floor voice commands, offline field/wilderness wearables, aircraft & maritime. Moonshine streams at ~107ms end-of-speech latency, fully on-device, and runs even on a Raspberry Pi. Real-time and offline and private, simultaneously.

RealismShipping today

ModelMoonshine 26–245M (ONNX)

Why local winsLatency + offline + privacy together

Solo difficultyLow

👓 Tiny-VLM field & accessibility apps

Offline visual Q&A and scene description

SmolVLM-256M runs in-browser at ~80 tok/s using <1GB GPU RAM. That makes offline equipment-inspection assistants for technicians with no signal, real-time scene description for blind users (the camera feed stays private), and offline museum/field-guide apps all viable on a phone.

RealismBuildable now

ModelSmolVLM 256M/500M · Moondream · Florence-2

Why local winsOffline + camera-feed privacy

Solo difficultyLow–med

🔒 Private RAG over regulated documents

Search the corpus you're legally forbidden to upload

On-device embeddings (mxbai-embed-xsmall) + a sub-1B generator = Q&A over M&A data rooms, ITAR-controlled docs, patient records, journalist source material. The commodity version is boring; the regulated-corpus version is the moat — fully composable in Transformers.js today.

RealismComposable today

Modelmxbai-embed-xsmall + Qwen2.5-0.5B

Why local winsThe documents legally can't be sent to a cloud

Solo difficultyLow

06 — Proof point

The clinical scribe, by the numbers

A LoRA-fine-tuned Llama 3.2 1B, trained on 1,500 synthetic pairs, running entirely in-browser — the result that anchors the whole "local wins for regulated verticals" thesis.

ROUGE-1 (note quality) — higher is better

0.346

Base

0.496

Fine-tuned

▲ +43% quality

Hallucinations (per eval set) — lower is better

Base

Fine-tuned

▼ −59% hallucinations

Build cost: PEFT/LoRA on 1,500 synthetic pairs · minutes on a free Colab T4 · single conversion script to a ~1.24GB q4f16 ONNX build that runs via Transformers.js + WebGPU. Caveat: preprint, small synthetic set, residual hallucinations — feasibility is shown, not clinical-grade accuracy.

Source: In-browser clinical note generation (arXiv 2507.03033)

07 — Honest limits

What the verifier pushed back on

⚠ Caveats that survived

"Private" has an asterisk. Your data stays local, but model weights still download once from a CDN (200–600 MB).
Speed is GPU-dependent. Several "real-time" splatting figures are ~10–16 FPS on mobile, below 30 FPS.
Hard ceilings. ~4GB/tab memory cap, main-thread blocking without Web Workers, and encoder-decoder models still fail on some browsers. Safari/iOS parity is improving but uneven.
Sub-1B LLMs are narrow. Great at structured tasks; not open-ended agentic reasoning. Pick the task accordingly.

✕ Claims refuted & removed (0–3 votes)

"Local execution keeps data completely private and works offline — the decisive argument." Real but not universal; weights still download.
"Transformers.js v3 WebGPU is up to 100× faster than WASM." Not substantiated — real gains are ~3–15×.

Open questions worth chasing

Where exactly do sub-1B in-browser LLMs cross the usefulness threshold for agentic tool-use, versus the structured-text tasks where they shine today?
How production-robust is cross-browser WebGPU in mid-2026 — Safari/iOS parity, integrated-GPU VRAM stability, the 4GB/tab cap for 1–2B models?
For splatting + per-frame ONNX: what's achievable on median (not flagship) mobile, and is there a solo-builder training path or only pre-built pipelines?
Beyond clinical notes, which privacy-decisive verticals (legal, finance, field inspection, defense/IoT sensor fusion) have actually shipped — not just feasibility preprints?

08 — Sources

Primary & supporting references

23 sources fetched across 5 angles; the load-bearing ones below carried verified claims.

primarySmolVLM: small but capable VLMs — arXiv 2504.05299

primaryMoonshine — fast on-device streaming ASR

primaryTransformers.js v3: WebGPU support (Hugging Face)

primaryFine-tuning Gemma 3 270M for on-device (Google)

primaryIn-browser clinical note generation — arXiv 2507.03033

primaryVisionary: WebGPU + per-frame ONNX — arXiv 2512.08478

primaryWebSplatter: cross-device Gaussian Splatting — arXiv 2602.03207

primaryShakti compact edge LLMs — arXiv 2503.01933

primaryTransformers.js documentation

secondaryMoondream 2 — tiny VLM (Roboflow)

secondaryWebGPU + ONNX Runtime Web RAG with Phi-3 (Microsoft)

blogRunning SmolVLM in-browser (PyImageSearch)

blogBuilding a browser-based RAG system with WebGPU

blogBrowser-native LLM inference: the WebGPU engineering