Transformers.js v4: WebGPU-Powered AI Now Runs Locally in Browsers and Node.js

Running state-of-the-art AI models locally just got a major upgrade. Hugging Face has released Transformers.js v4, featuring a complete WebGPU runtime rewrite that delivers dramatically better performance while running models entirely in the browser or server-side JavaScript environments.

After nearly a year of development, this major release represents the most significant overhaul of the library since its inception. The new architecture leverages ONNX Runtime’s WebGPU support, enabling hardware-accelerated inference across browsers, Node.js, and Deno from the same codebase.

Why It Matters

The shift to WebGPU isn’t just technical jargon — it fundamentally changes what’s possible with client-side AI:

Offline-first: Full offline support with local WASM caching after initial download
Cross-platform: Single codebase runs in browsers, Node.js, Bun, and Deno
Performance gains: Up to 4x speedup for BERT embedding models using optimized operators
Larger models: Support for models exceeding 8B parameters (GPT-OSS 20B tested at ~60 tokens/sec on M4 Pro Max)

For developers, this means deploying sophisticated AI features without relying on backend API calls or worrying about server costs.

The Technical Core

The v4 release introduces several architectural improvements:

New WebGPU Runtime The entire runtime was rewritten in C++ with close collaboration from the ONNX Runtime team. This enables support for custom operators like GroupQueryAttention, MatMulNBits, and QMoE that power modern LLM architectures.

Repository Restructuring Transformers.js has evolved from a single package to a monorepo using pnpm workspaces. This allows shipping focused sub-packages without the overhead of maintaining separate repositories.

Standalone Tokenizers The tokenization logic is now available as a separate @huggingface/tokenizers{rel=“nofollow”} library — just 8.8kB gzipped with zero dependencies.

Build System Migration Moving from Webpack to esbuild reduced build times from 2 seconds to 200 milliseconds, while bundle sizes decreased by 10% (transformers.web.js is now 53% smaller).

New Model Support

Version 4 adds support for cutting-edge architectures including GPT-OSS, Chatterbox, GraniteMoeHybrid, LFM2-MoE, HunYuanDenseV1, Apertus, Olmo3, FalconH1, and Yitu-LLM. These include: - Mamba (state-space models) - Multi-head Latent Attention (MLA) - Mixture of Experts (MoE)

Key Takeaways

Local-first AI: Run SOTA models completely offline in browsers or Node.js
4x faster inference: WebGPU + optimized ONNX operators deliver significant speedups
Cross-runtime compatibility: Same code works across all JavaScript environments
Expanded model support: New architectures including MoE and state-space models

Install the preview with npm i @huggingface/transformers@next and explore examples at the Transformers.js repository{rel=“nofollow”}.