A fan-out, fact-checked survey of where sub-2 billion-parameter neural networks — trained by one person in Python, exported to ONNX, and run on WebGPU — create a decisive advantage over the cloud.
A deterministic multi-agent harness: decompose into angles, fan out parallel web searches, dedup & fetch sources, then verify every load-bearing claim with a 3-vote adversarial panel before synthesis.
As of mid-2026, the full loop you'd want — fine-tune a sub-2B model in Python → export to ONNX → run it entirely in the browser on WebGPU — is genuinely buildable, and cheap. The advantages that survived verification are narrower than the marketing, but they're concrete: on-device data privacy, sub-200ms real-time loops, zero server cost, and offline operation.
The cleanest proof point is clinical. A team LoRA-fine-tuned Llama 3.2 1B on just 1,500 synthetic pairs to generate structured medical notes fully in the browser — training in minutes on a free Colab T4. Low data, low compute, decisive privacy win. That single example is the entire thesis in miniature.
But the verifier also killed two popular claims: that local execution is unconditionally private (model weights still download once from a CDN), and the oft-repeated “100× faster than WASM” figure (real-world WebGPU gains are more like 3–15×). Honest framing matters here.
Every model below has been demonstrated running client-side on WebGPU. Bars show parameter count — all comfortably under the 2B ceiling, most far under it.
Solo-builder feasibility against novelty & defensibility of the local angle. Bubble size ≈ potential impact. Top-right is the sweet spot; top-left is the frontier.
Ranked by how novel and defensible the local angle is — not by how easy. Each carries an honest read on realism, model class, the local moat, and solo difficulty.
A regulated professional dictates → on-device ASR → a small fine-tuned LLM emits a structured note (SOAP note, legal memo, intake form). No audio or PHI ever touches a server. This is the rare case where local isn't a nice-to-have — it's the compliance unlock. Validated end-to-end: a LoRA'd Llama 3.2 1B lifted ROUGE-1 from 0.35→0.50 and cut hallucinations 85→35 on 1,500 synthetic pairs, running fully in-browser.
New 2025–26 work (Visionary, WebSplatter) fuses real-time 3D/4D Gaussian Splatting rendering with per-frame ONNX inference, all client-side: neural avatars, dynamic scenes, and style/enhancement networks via a standardized "Gaussian Generator" ONNX contract. WebSplatter hits ~105 FPS on an RTX 3070 and runs multi-million-splat scenes on an iPhone 15 Pro — "click-to-run" in a browser tab, no backend.
The niche isn't transcription (commodity) — it's environments where cloud simply can't go: courtroom record, factory-floor voice commands, offline field/wilderness wearables, aircraft & maritime. Moonshine streams at ~107ms end-of-speech latency, fully on-device, and runs even on a Raspberry Pi. Real-time and offline and private, simultaneously.
SmolVLM-256M runs in-browser at ~80 tok/s using <1GB GPU RAM. That makes offline equipment-inspection assistants for technicians with no signal, real-time scene description for blind users (the camera feed stays private), and offline museum/field-guide apps all viable on a phone.
On-device embeddings (mxbai-embed-xsmall) + a sub-1B generator = Q&A over M&A data rooms, ITAR-controlled docs, patient records, journalist source material. The commodity version is boring; the regulated-corpus version is the moat — fully composable in Transformers.js today.
A LoRA-fine-tuned Llama 3.2 1B, trained on 1,500 synthetic pairs, running entirely in-browser — the result that anchors the whole "local wins for regulated verticals" thesis.
23 sources fetched across 5 angles; the load-bearing ones below carried verified claims.