Small Language Models: Edge AI Innovation From AI21

While most of the AI world is racing to build ever-bigger language models like OpenAIâ€™s GPT-5 and Anthropicâ€™s Claude Sonnet 4.5, the Israeli AI startup AI21 is taking a different path.

AI21 has just unveiled Jamba Reasoning 3B, a 3-billion-parameter model. This compact, open-source model can handle massive context windows of 250,000 tokens (meaning that it can â€œrememberâ€ and reason over much more text than typical language models) and can run at high speed, even on consumer devices. The launch highlights a growing shift: smaller, more efficient models could shape the future of AI just as much as raw scale.

â€œWe believe in a more decentralized future for AIâ€”one where not everything runs in massive data centers,â€ says Ori Goshen, Co-CEO of AI21, in an interview with IEEE Spectrum. â€œLarge models will still play a role, but small, powerful models running on devices will have a significant impactâ€ on both the future and the economics of AI, he says. Jamba is built for developers who want to create edge-AI applications and specialized systems that run efficiently on-device.

AI21â€™s Jamba Reasoning 3B is designed to handle long sequences of text and challenging tasks like math, coding, and logical reasoningâ€”all while running with impressive speed on everyday devices like laptops and mobile phones. Jamba Reasoning 3B can also work in a hybrid setup: Simple jobs are handled locally by the device, while heavier problems get sent to powerful cloud servers. According to AI21, this smarter routing could dramatically cut AI infrastructure costs for certain workloadsâ€”potentially by an order of magnitude.

A Small but Mighty LLM

With 3 billion parameters, Jamba Reasoning 3B is tiny by todayâ€™s AI standards. Models like GPT-5 or Claude run well past 100 billion parameters, and even smaller models, such as Llama 3 (8B) or Mistral (7B), are more than twice the size of AI21â€™s model, Goshen notes.

That compact size makes it more remarkable that AI21â€™s model can handle a context window of 250,000 tokens on consumer devices. Some proprietary models, like GPT-5, offer even longer context windows, but Jamba sets a new high-water mark among open-source models. The previous open-model record of 128,000 tokens was held by Metaâ€™s Llama 3.2 (3B), Microsoftâ€™s Phi-4 Mini, and DeepSeek R1, which are all much larger models. Jamba Reasoning 3B can process more than 17 tokens per second even when working at full capacityâ€”that is, with extremely long inputs that use its full 250,000-token context window. Many other models slow down or struggle once their input length exceeds 100,000 tokens.

Goshen explains that the model is built on an architecture called Jamba, which combines two types of neural network designs: transformer layers, familiar from other large language models, and Mamba layers, which are designed to be more memory-efficient. This hybrid design enables the model to handle long documents, large codebases, and other extensive inputs directly on a laptop or phoneâ€”using about one-tenth the memory of traditional transformers. Goshen says the model runs much faster than traditional transformers because it relies less on a memory component called the KV cache, which can slow down processing as inputs get longer.

Why Small LLMs Are Needed

The modelâ€™s hybrid architecture gives it an advantage in both speed and memory efficiency, even with very long inputs, confirms a software engineer who works in the LLM industry. The engineer requested anonymity because theyâ€™re not authorized to comment on other companiesâ€™ models. As more users run generative AI locally on laptops, models need to handle long context lengths quickly without consuming too much memory. At 3 billion parameters, Jamba meets these requirements, says the engineer, making it a model thatâ€™s optimized for on-device use.

Jamba Reasoning 3B is open source under the permissive Apache 2.0 license and available on popular platforms such as Hugging Face and LM Studio. The release also comes with instructions for fine-tuning the model through an open-source reinforcement-learning platform (called VERL), making it easier and more affordable for developers to adapt the model for their own tasks.

â€œJamba Reasoning 3B marks the beginning of a family of small, efficient reasoning models,â€ Goshen said. â€œScaling down enables decentralization, personalization, and cost efficiency. Instead of relying on expensive GPUs in data centers, individuals and enterprises can run their own models on devices. That unlocks new economics and broader accessibility.â€

From Your Site Articles

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

Fast, Tiny, and Smart AI: Small Language Models for Your Phone

A Small but Mighty LLM

Why Small LLMs Are Needed

Nokia Bell Labs Celebrates Its New N.J. Headquarters

5G Could Be Key to Reliable Navigation Services

Next-Gen AI Needs Liquid Cooling

Related Stories

Light-Based AI Could Be More Efficient

Generative AI Creates Potent Antimicrobial Agents

AI Models Embrace Humanlike Reasoning

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

Enjoy more free content and benefits by creating an account

Saving articles to read later requires an IEEE Spectrum account

The Institute content is only available for members

Downloading full PDF issues is exclusive for IEEE Members

Downloading this e-book is exclusive for IEEE Members

Access to Spectrum 's Digital Edition is exclusive for IEEE Members

Following topics is a feature exclusive for IEEE Members

Adding your response to an article requires an IEEE Spectrum account

Create an account to access more content and features on IEEE Spectrum , including the ability to save articles to read later, download Spectrum Collections, and participate in conversations with readers and editors. For more exclusive content and features, consider Joining IEEE .

Join the worldâ€™s largest professional organization devoted to engineering and applied sciences and get access to all of Spectrumâ€™s articles, archives, PDF downloads, and other benefits. Learn more about IEEE â†’

Join the worldâ€™s largest professional organization devoted to engineering and applied sciences and get access to this e-book plus all of IEEE Spectrumâ€™s articles, archives, PDF downloads, and other benefits. Learn more about IEEE â†’

Access Thousands of Articles â€” Completely Free

Create an account and get exclusive content and features: Save articles, download collections, and post comments â€” all free! For full access and benefits, subscribe to Spectrum.

Fast, Tiny, and Smart AI: Small Language Models for Your Phone

A Small but Mighty LLM

Why Small LLMs Are Needed

Nokia Bell Labs Celebrates Its New N.J. Headquarters

5G Could Be Key to Reliable Navigation Services

Next-Gen AI Needs Liquid Cooling

Related Stories

Light-Based AI Could Be More Efficient

Generative AI Creates Potent Antimicrobial Agents

AI Models Embrace Humanlike Reasoning