[Running AI Models Locally in the Browser with WebGPU and WebAssembly]
Analyze with AI
Get AI-powered insights from this Mad Devs tech article:
There's a particular moment that keeps happening in frontend conversations. Someone proposes an AI feature: autocomplete, real-time image classification, a local chat assistant, and the first instinct is to reach for an API. Spin up a backend, pick a hosted model, and wire up the calls. But then reality arrives: latency, cost per inference, data leaving the user's device, and a growing server bill that scales with every user interaction.
The uncomfortable truth is that for a specific and growing class of AI tasks, you don't need a server at all. The user's browser can run the model on their own GPU and produce results locally. No round-trip. No cost per call. No data exposure.
This tech article is a practical guide to making that happen. We'll cover why WebGPU changed the equation, how WASM fills the gaps where GPU access isn't available, and how to wire up actual inference code that works in production today.
A note before you dive in:The person behind this guide is a professional, just not a technical one. A copywriter put it together using Claude, ChatGPT, and Antigravity, from theory to full working code, and then real engineers gave it a proper review. So you're in good hands. Read freely, judge kindly.
WebGPU, WASM, and the rise of browser-native AI
Typical API-based AI features carry hidden costs that compound as you scale:
- Latency – network round-trip times plus throttling under load.
- Cost per inference – grows linearly with every user interaction.
- Compliance and trust – user data leaves the device. For medical records, legal documents, or anything users expect to stay private, this isn't just a latency problem — it's a liability.
- Extra backend surface area – infrastructure that exists only to forward prompts to a hosted model.
Modern browsers can run models locally, on the user's GPU via WebGPU or on CPU via WASM, with model assets cached after the first download. No per-call cost, no round-trip, and sensitive data stays on-device.
WebGPU vs WebGL: why ML needed a new GPU API
For most of the web's history, running meaningful computation in the browser meant JavaScript and nothing else. When machine learning started gaining traction, the question of browser inference came up naturally, but the honest answer was: the tools weren't there yet.
WebGL existed, but it was designed for rendering triangles, not tensor operations.
WebAssembly arrived in 2017 and opened the door to running compiled C++ and Rust in the browser, which made lightweight models viable. But without real GPU access, anything beyond small CNNs was painfully slow. You could classify an image, maybe. Run a language model? Not really.
The shift happened gradually and then all at once. Browsers started shipping WebGPU, a modern compute API that gives JavaScript programs actual access to the GPU. Not a graphics wrapper, but a proper compute interface with shader pipelines designed for workloads like matrix multiplication. Simultaneously, the ML ecosystem moved toward smaller, aggressively quantized models: Phi-3.5 Mini at 3.8B parameters fits in 2GB of VRAM in int4. Whisper Small transcribes audio in real time on a mid-range laptop. The models came down to meet the browser's constraints; at the same time, the browser's capabilities went up to meet the models.
The result is a stack that genuinely works in 2025. Libraries like Transformers.js, WebLLM, and ONNX Runtime Web abstract the complexity, but understanding what's underneath, WebGPU for GPU inference, and WASM as the fallback compute layer, matters when things don't work, when you're debugging performance, or when you're deciding which models are even worth attempting.
// WebGPU compute shader for a simple matrix multiply
// WGSL - the shader language WebGPU uses
@group(0) @binding(0) var<storage, read> A : array<f32>;
@group(0) @binding(1) var<storage, read> B : array<f32>;
@group(0) @binding(2) var<storage, read_write> C : array<f32>;
struct Dims { M: u32, N: u32, K: u32 }
@group(0) @binding(3) var<uniform> dims : Dims;
@compute @workgroup_size(16, 16)
fn main(@builtin(global_invocation_id) gid: vec3<u32>) {
let row = gid.x;
let col = gid.y;
if (row >= dims.M || col >= dims.N) { return; }
var sum = 0.0;
for (var k = 0u; k < dims.K; k++) {
sum += A[row * dims.K + k] * B[k * dims.N + col];
}
C[row * dims.N + col] = sum;
}You wouldn't write this shader manually; libraries handle it, but it illustrates the key point. The workgroup runs 256 threads simultaneously (16×16), each computing one output cell of the result matrix. That's the compute model that makes neural network layers fast. WebGL couldn't express this directly. WebGPU was built for it.
WASM + WebGPU architecture for private-first frontend AI
The usual conversation about browser AI focuses on capability: can it run? The more interesting question for production applications is privacy: where does the data go? If you're building a tool that processes medical records, legal documents, or anything a user would reasonably expect to stay on their device, sending data to a hosted model API isn't just a latency problem – it's a compliance and trust problem.
Browser-native AI flips the data flow entirely. The model file is downloaded once and cached locally. After that, every inference runs on the user's hardware: their CPU via WASM, or their GPU via WebGPU. Nothing leaves the device.
User's browser
│
├── Your app code (React / Vanilla / whatever)
│
├── ML library (Transformers.js / WebLLM / ONNX Runtime)
│ │
│ ├── WebGPU backend ──► User's GPU (fast, requires support)
│ └── WASM backend ──► User's CPU (slower, works everywhere)
│
└── Cache API / IndexedDB
└── Model files (cached after first download, never re-sent)The architecture has two runtime paths. WebGPU is the primary path – direct GPU compute, 3-10x faster than the WASM alternative. WASM is the fallback: compiled ML runtime (usually a port of ONNX Runtime or llama.cpp) running on the CPU. Modern WASM with SIMD extensions can process 4-8 values per instruction, which makes small and medium models viable on a CPU without feeling unusable.
Most production setups use both. You detect WebGPU availability at runtime, load the best backend, and fall back gracefully. The user experience is identical; only the inference speed differs.
One infrastructure detail that trips people up: WebGPU and multi-threaded WASM both require SharedArrayBuffer, which the browser only enables under Cross-Origin Isolation. That means two HTTP headers that must be present on every response from your origin:
Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corpWithout these headers, SharedArrayBuffer is unavailable, WebGPU initialization will either fail or be severely limited, and multi-threaded WASM silently falls back to single-threaded mode. This is not obvious when debugging locally, because Chrome DevTools doesn't always surface the restriction clearly. The simplest check:
console.log('Cross-Origin Isolated:', window.crossOriginIsolated);
// Must be true. If false, check your server headers.For Vite, add the headers and configure the worker format in vite.config.ts:
import { defineConfig } from 'vite';
export default defineConfig({
server: {
headers: {
'Cross-Origin-Opener-Policy': 'same-origin',
'Cross-Origin-Embedder-Policy': 'require-corp',
},
},
worker: {
format: 'es',
},
});The worker.format: 'es' setting ensures Web Workers are bundled as ES modules, required for Transformers.js imports to work correctly inside worker files.
For Nginx in production:
add_header Cross-Origin-Opener-Policy "same-origin" always;
add_header Cross-Origin-Embedder-Policy "require-corp" always;The always flag is important – without it, Nginx only adds headers on 200 responses, which means error pages won't be isolated, and some browsers will refuse to share the isolation context.
COEP Can Block Cross-Origin Model Downloads
This is the part that bites people in production. With Cross-Origin-Embedder-Policy: require-corp, the browser blocks any cross-origin resource that doesn't include compatible CORS/CORP headers in its response. That includes model weight files fetched from CDNs or directly from Hugging Face.
If you fetch models from huggingface.co or another third-party origin, you need those responses to include Access-Control-Allow-Origin and ideally Cross-Origin-Resource-Policy: cross-origin. Some CDNs set these headers; many don't.
The most reliable production path: self-host model assets on your own origin. Download the ONNX weights, serve them from the same domain as your app, and the COEP restriction becomes a non-issue. This also gives you control over caching headers, versioning, and availability, which matters when your model files are hundreds of megabytes.
WebGPU test: checking support and running your first model
Before writing inference code, it's worth understanding what you're actually checking for. "WebGPU is supported" isn't binary. The API might be available, but the adapter might be software-only. The adapter might be real hardware, but have very limited buffer sizes. These differences matter for model selection.
Detecting WebGPU (avoid obsolete APIs)
Here's a detection utility that tells you not just whether WebGPU exists, but what kind of GPU you're actually working with:
// src/utils/webgpu.ts
export interface GPUCapabilities {
available: boolean;
backend?: string;
vendor?: string;
device?: string;
maxBufferSize?: number;
maxStorageBinding?: number;
isSoftware?: boolean;
}
export async function detectWebGPU(): Promise<GPUCapabilities> {
if (!('gpu' in navigator)) return { available: false };
const adapter = await navigator.gpu.requestAdapter({
powerPreference: 'high-performance',
});
if (!adapter) return { available: false };
const device = await adapter.requestDevice();
const info = (adapter as any).info ?? {};
const vendor = info.vendor ?? undefined;
const deviceName = info.device ?? undefined;
const backend = info.backend ?? undefined;
const isSoftware =
typeof vendor === 'string' && vendor.toLowerCase().includes('software');
return {
available: true,
vendor,
device: deviceName,
backend,
maxBufferSize: device.limits.maxBufferSize,
maxStorageBinding: device.limits.maxStorageBufferBindingSize,
isSoftware,
};
}Don't use adapter.requestAdapterInfo(), it was removed from the spec. Earlier WebGPU drafts exposed adapter metadata through this async method, but current implementations have moved it to a synchronous adapter.info property. The code above uses (adapter as any).info with a fallback to handle both cases and avoid type errors with @webgpu/types definitions that may lag behind browser implementations.
Picking a backend + dtype
With capabilities in hand, you can make an informed decision about which backend and model size to use. A pragmatic rule set: if real WebGPU is available, use webgpu + q4f16 (good balance of speed and quality). Otherwise, fall back to wasm + q8 if SIMD is supported, or fp32 as the safest default.
export async function chooseBackend(): Promise<{ backend: 'webgpu' | 'wasm'; dtype: any }> {
const gpu = await detectWebGPU();
if (gpu.available && !gpu.isSoftware) {
return { backend: 'webgpu', dtype: 'q4f16' };
}
const simdAvailable = WebAssembly.validate(
new Uint8Array([
0, 97, 115, 109, 1, 0, 0, 0, 1, 5, 1, 96, 0, 1, 123, 3,
2, 1, 0, 10, 10, 1, 8, 0, 65, 0, 253, 15, 253, 98, 11
])
);
return { backend: 'wasm', dtype: simdAvailable ? 'q8' : 'fp32' };
}The SIMD check deserves explanation. That opaque byte array is a minimal valid WebAssembly module that uses a SIMD instruction (v128.const). If WebAssembly.validate() returns true, the browser supports SIMD, and you can use INT8 quantized models on the CPU path – roughly 2-4x faster than FP32 WASM without SIMD. All modern browsers support it, but older devices and some embedded WebViews don't.
Running your first model with Transformers.js
Transformers.js is the easiest entry point – it's the Hugging Face Transformers library ported to JavaScript, with WebGPU support, pulling models directly from the HF Hub. The following runs sentiment classification end-to-end:
// src/classifier.ts
import { pipeline, env } from '@huggingface/transformers';
if (env.backends.onnx.wasm) {
env.backends.onnx.wasm.proxy = false;
}
export async function classify(text: string) {
const classifier = await pipeline(
'text-classification',
'Xenova/distilbert-base-uncased-finetuned-sst-2-english',
{ device: 'webgpu', dtype: 'q8' }
);
return classifier(text);
}
// First call takes ~3-5 seconds (model load + first inference)
// Subsequent calls: ~20ms on GPU, ~120ms on CPU WASM
const result = await classify("The documentation for this library is actually quite good.");
// [{ label: 'POSITIVE', score: 0.9987 }]Note the guard around env.backends.onnx.wasm: in some bundler configurations, the ONNX WASM backend object may not be initialized at import time, so accessing .proxy directly can throw. The conditional check prevents a runtime error without changing behavior – when the backend is available, proxy mode is disabled so the model runs directly in the current context rather than spawning an additional worker.
Text generation in a Web Worker
For anything heavier (text generation, local chat), you'll want the model running in a Web Worker to avoid blocking the main thread. The pattern is straightforward: put generation in a Worker and stream tokens back.
// src/worker.ts
import { pipeline, TextStreamer, env } from '@huggingface/transformers';
if (env.backends.onnx.wasm) {
env.backends.onnx.wasm.proxy = false;
}
let generator: any = null;
self.onmessage = async ({ data }) => {
if (data.type === 'load') {
try {
generator = await pipeline('text-generation', data.model, {
device: 'webgpu',
dtype: 'q4',
progress_callback: (info: any) => {
if (info.status === 'downloading') {
self.postMessage({
type: 'progress',
file: info.file,
progress: Math.round((info.loaded / info.total) * 100),
});
}
},
});
self.postMessage({ type: 'ready' });
} catch (e) {
self.postMessage({ type: 'error', error: e });
}
return;
}
if (data.type === 'generate') {
if (!generator) {
self.postMessage({ type: 'error', error: 'Model not loaded' });
return;
}
const streamer = new TextStreamer(generator.tokenizer, {
skip_prompt: true,
callback_function: (token: string) => self.postMessage({ type: 'token', token }),
});
await generator(data.prompt as string, {
max_new_tokens: 512,
temperature: 0.7,
streamer,
});
self.postMessage({ type: 'done' });
}
};The load handler is wrapped in try/catch – model loading can fail for dozens of reasons (network timeout, unsupported operator, GPU out of memory), and an unhandled rejection inside a worker is silent unless you explicitly catch and report it. The generate handler checks whether the model is actually loaded before attempting inference, and posts a typed error message back to the main thread instead of crashing.
On the main thread side:
// src/main.ts (excerpt)
const worker = new Worker(new URL('./worker.ts', import.meta.url), { type: 'module' });
worker.onmessage = (e) => {
const { type, token, progress, file, error } = e.data;
if (type === 'progress') {
updateLoadingUI(`Downloading ${file}: ${progress}%`);
} else if (type === 'ready') {
enableGenerateButton();
} else if (type === 'token') {
appendToOutput(token);
} else if (type === 'done') {
finalizeResponse();
} else if (type === 'error') {
showError(error);
}
};
worker.postMessage({ type: 'load', model: 'Xenova/gpt2' });At this point, you have a working pipeline: model downloads once, caches in the browser, and runs locally via WebGPU. If WebGPU isn't available, Transformers.js falls back to WASM automatically. The user gets inference without ever pinging your server.
Optimizing models for the browser: performance, quantization, and limits
The honest constraint of browser-native AI is memory. A browser tab doesn't get the full GPU. The WebGPU adapter typically has access to 1-4GB of VRAM on a desktop, less on mobile. This means model selection and quantization aren't optional concerns. They determine whether your feature works at all.
Quantization
Quantization is the primary lever. A model stored in FP32 takes four bytes per parameter. Move to INT8 (q8), and it's one byte – 4x smaller with minimal quality loss. Move to INT4 (q4), and it's half a byte – 8x smaller, with some degradation on complex reasoning tasks, but typically fine for classification, summarization, and conversational tasks.
The table below shows weight-only memory for a 3B parameter model. Total runtime memory also includes KV cache and intermediate buffers, so plan for headroom beyond these numbers.
| FORMAT | MEMORY (3B PARAMS) | QUALITY RADEOFF | BEST FOR |
|---|---|---|---|
| fp32 | ~12 GB | Reference | Not practical in a browser |
| fp16 | ~6 GB | Minimal loss | Small models on good GPUs |
| q8 | ~3 GB | Very small loss | Classification, encoding |
| q4 | ~1.5 GB | Noticeable in reasoning | Chat, generation |
| q4f | ~1.7 GB | Good balance | Production LLM inference |
Mixed-precision quantization (lower precision for weights, higher for activations) gives you the best of both worlds. Transformers.js and WebLLM both support per-component dtype settings:
import { pipeline } from '@huggingface/transformers';
const generator = await pipeline('text-generation', 'Xenova/Qwen2.5-1.5B-Instruct', {
device: 'webgpu',
dtype: {
embed_tokens: 'fp16', // embeddings are sensitive -- keep precision
lm_head: 'fp32', // output layer -- keep full precision for output quality
default: 'q4', // all other weights -- aggressively quantize
},
});Caching models: avoid a first-load disaster
Model caching deserves the same attention as the inference itself. On first load, the user is downloading hundreds of megabytes. On every subsequent visit, they should hit the local cache. Transformers.js uses the Cache API automatically, but you need to build the surrounding UX, and you need to handle several things before it becomes a production incident:
- A download + warmup UX – show progress, don't leave users staring at a spinner.
- A quota check – verify storage before attempting a large download.
- An eviction story – browsers can clear cached data under storage pressure, and users should know when a re-download is happening.
// src/utils/storage.ts
export async function checkStorageBeforeLoad(modelSizeMB: number): Promise<boolean> {
const estimate = await navigator.storage.estimate();
const availableMB = ((estimate.quota ?? 0) - (estimate.usage ?? 0)) / 1024 / 1024;
if (availableMB < modelSizeMB * 1.3) {
console.warn(`Not enough storage. Need ~${modelSizeMB}MB, have ${Math.round(availableMB)}MB`);
return false;
}
return true;
}When models update across versions, old cached weights can pile up. A cleanup function prevents the cache from growing unbounded:
async function clearOldModelVersions(currentCacheName: string) {
const keys = await caches.keys();
await Promise.all(
keys
.filter(k => k.startsWith('transformers-cache-') && k !== currentCacheName)
.map(k => caches.delete(k))
);
}Performance floor
The performance floor is worth being explicit about. On a MacBook Pro M2 with WebGPU, Phi-3.5 Mini in q4 generates around 25–35 tokens per second. On a mid-range Windows laptop with an integrated Intel GPU, the same model drops to 8–12 tokens per second. On an older iPhone via WASM (no WebGPU), it's 3–6 tokens per second. These are real numbers, not benchmarks from ideal conditions. Your UX needs to be designed for the slow case, not optimized for the fast one.
The practical implication: stream tokens as they're generated, never wait for the full response. A user watching text appear at 8 tok/s feels like real-time. A user staring at a spinner for 10 seconds, then seeing the full response dump, does not.
What's next for WebGPU-powered frontend AI (2026 and beyond)
The trajectory here is unusually clear. Three things are happening simultaneously, and they compound.
First, model sizes are continuing to shrink without quality collapse. Phi-4 Mini, Gemma 3 2B, Qwen 2.5-0.5B – the research community is getting genuinely good results from models small enough to run comfortably in 1–2GB of browser memory. A year ago, running a useful language model locally required 7B parameters. Today, you can get surprisingly capable responses from a 500M parameter model if it's been distilled well. The floor keeps dropping.
Second, WebGPU support is broadening. Chrome and Edge are solid. Firefox enabled WebGPU by default in mid-2024. Safari is behind but catching up – Apple has strong incentives to make WebGPU work well, given their hardware. The "check if it works first" section of this writeup exists because we're still in the phase where you can't assume support. That phase will be over within two years.
Third, the tooling layer is maturing. WebLLM already implements PagedAttention and FlashAttention in WGSL shaders, which means GPU memory is used more efficiently, and longer context windows become feasible. ONNX Runtime Web is adding support for more operator types, expanding the range of models you can run without custom porting work. The gap between "what works on a server" and "what works in a browser" is closing.
What this means practically: the use cases that make sense today, local chat assistants, real-time audio transcription, image classification, and text embedding, are going to expand. Multimodal models with vision capabilities are coming to Transformers.js. Diffusion models are already running in the browser via ONNX, just slowly. Code completion at the IDE level might make sense browser-side before long if model quality holds.
The constraint that isn't going away is memory. Browser tabs have limits that desktop applications don't. A 70B model is not going to run locally in a tab in any foreseeable future. But the relevant question isn't whether you can run GPT-4 in the browser. It's whether you can run a model good enough for your specific task. For an increasing number of tasks, the answer in 2026 is yes.
For frontend developers, the decision framework is simpler than it sounds. If your feature handles sensitive data, or if you're serving users who have consistent GPU access, or if API costs are already a concern at your scale, start with browser inference. The tools are stable, the models are capable, and the privacy story is genuine. For everything else, the hosted API path still makes sense: simpler, more predictable, no first-load penalty.
The interesting middle ground is hybrid: run a small local model for low-latency tasks like autocomplete and classification, escalate to a hosted model only for complex generation or reasoning. That architecture gives you the latency benefits of local inference without betting your product on browser AI for every use case.
The browser is a serious computing environment now. That changes what's possible to build without a server, and it's worth understanding the stack well enough to use it deliberately.
Production checklist
Before shipping browser-native inference to users, verify:
- COOP/COEP enabled –
window.crossOriginIsolated === truein both dev and production. - Model hosting strategy compatible with COEP — if fetching from a third-party CDN, confirm CORS/CORP headers. In most cases, self-hosting on your own origin is the safest path.
- WebGPU detection uses
adapter.info– not the obsoleterequestAdapterInfo(). - Generation runs in a Worker — with streaming output and error handling.
- Quantization defaults chosen per backend –
q4f16for WebGPU,q8for WASM+SIMD,fp32as a last resort. - Cache and storage UX – quota check before download, progress indication during first load, awareness that browsers can evict cached data under storage pressure.

