Table of Contents

Liquid AI’s 350M Models Bring Multilingual Search to Laptops

Liquid AI released two compact retrieval models on June 18, 2026, giving developers a choice between fast dense embeddings and more precise token-level retrieval.

LFM2.5 Embedding 350M converts each document into one vector, making it suitable for low-cost semantic search and large indexes. Its companion, LFM2.5-ColBERT-350M, preserves a separate representation for each token and uses late interaction to compare queries with documents more precisely.

The models matter to developers building multilingual RAG systems, product search, support assistants, local document search, and edge applications. Liquid reports NanoBEIR multilingual NDCG@10 scores of 0.577 for the embedding model and 0.605 for ColBERT, but those benchmark results were produced by the model provider and have not yet been independently replicated.

What Is LFM2.5 Embedding 350M?

LFM2.5-Embedding-350M is a dense bi-encoder built on Liquid AI’s LFM2.5-350M backbone.

A bi-encoder processes queries and documents independently. Each document becomes one fixed-size vector, and each query becomes another vector. Retrieval then uses cosine similarity or normalized dot product to find documents whose vectors are closest to the query.

Liquid’s model produces a 1,024-dimensional CLS vector for each query or document. It supports documents up to 512 tokens and uses separate query and document prompts.

This approach is attractive because document vectors can be calculated once and stored in a standard vector database. At search time, only the query needs to be embedded.

The result is a small, fast, and operationally simple index.

How LFM2.5-ColBERT-350M Works

LFM2.5-ColBERT-350M uses the same general 350M-parameter backbone but represents text differently.

Instead of collapsing an entire passage into one vector, it creates a 128-dimensional vector for each token. A query is also represented at token level.

The model then uses MaxSim late interaction. For every query token, the system finds the most similar token in the document and combines those scores.

This helps preserve fine-grained matches that may disappear inside one dense document vector.

For example, a user searching for “battery overheating after firmware update” may benefit when the retriever separately recognizes the significance of “battery,” “overheating,” and “firmware update,” rather than compressing the entire sentence into one representation. Liquid recommends ColBERT when ranking quality and generalization matter more than index size.

Dense Embeddings vs ColBERT Late Interaction

Feature	LFM2.5-Embedding-350M	LFM2.5-ColBERT-350M
Representation	One vector per document	One vector per token
Vector size	1,024 dimensions	128 dimensions per token
Similarity method	Cosine similarity	MaxSim
Index footprint	Smaller	Larger
Search architecture	Standard vector retrieval	Multi-vector late interaction
Main advantage	Speed and storage efficiency	Higher retrieval precision
Best fit	Large catalogs, local search, first-stage retrieval	Reranking, high-value RAG, detailed matching

The storage difference can be substantial.

A dense system stores one vector for every chunk. A ColBERT system may store dozens or hundreds of token vectors for the same chunk. Even though each ColBERT vector has fewer dimensions, the total index can still become much larger.

Dense embedding retrieval compared with ColBERT token-level late interaction — Dense retrieval favors compact indexes; ColBERT preserves finer query-document matches.

The exact storage requirement depends on average document length, quantization, compression, token filtering, and index format. Liquid does not publish one universal storage multiplier because deployment choices vary.

What Is Genuinely New?

Liquid AI previously released LFM2-ColBERT-350M. The newer LFM2.5 release changes more than the name.

The company says the new models use the newer LFM2.5 checkpoint, increase language coverage, add explicit multilingual and cross-lingual training, and introduce a dense companion model built with the same architecture and training recipe.

Both are also the first bidirectional members of the LFM family.

Generative language models normally process text causally, meaning each token sees only earlier tokens. Retrieval models benefit from bidirectional attention because every token can use context from both sides.

Liquid adapted the LFM2.5 backbone into a full-context encoder, then trained the models through:

Large-scale English contrastive pretraining
Multilingual and cross-lingual distillation
Fine-tuning with hard-mined negatives

The models support Arabic, German, English, Spanish, French, Italian, Japanese, Korean, Norwegian, Portuguese, and Swedish.

Benchmark Audit

Liquid evaluated the models on NanoBEIR Multilingual Extended and MKQA-11.

Benchmark	Metric	Embedding model	ColBERT model	Reported difference	Evaluation owner	Independent verification
NanoBEIR Multilingual Extended	Average NDCG@10	0.577	0.605	ColBERT +0.028	Liquid AI	Not yet independently reproduced
MKQA-11	Average Recall@20	0.691	0.694	ColBERT +0.003	Liquid AI	Not yet independently reproduced

NDCG@10 measures how effectively a system places relevant documents near the top of the first ten results. NanoBEIR is a reduced version of the broader BEIR suite, with a maximum of 50 queries per included benchmark. This makes evaluation faster, but also means results may be more sensitive to small dataset changes than a larger production test.

The reported NanoBEIR gain is meaningful but not enormous: 0.605 versus 0.577 is an absolute increase of 0.028, or roughly 4.9% relative to the dense model’s score.

The comparison does not include a full accounting of index size, retrieval throughput, memory usage, or cost at equivalent recall.

Important missing details include:

Production index sizes for the same corpus
Query throughput at high concurrency
Performance after vector quantization
Recall and latency at equivalent candidate counts
Results on long documents beyond 512 tokens
Independent multilingual benchmark submissions
Accuracy on domain-specific enterprise data

Deployment Through GGUF and llama.cpp

Liquid released GGUF versions of both retrieval models for local execution through llama.cpp. The official model cards include multiple quantization choices.

For ColBERT, the GGUF model can be launched as an embedding server:

llama-server \

-hf LiquidAI/LFM2.5-ColBERT-350M-GGUF \

–embeddings

The application must still compute MaxSim scores over the returned token vectors. ColBERT is not a normal chat model, and serving it through an OpenAI-compatible endpoint does not automatically provide a complete vector index or retrieval engine.

Liquid reports BF16 query-embedding latency of 7.3 milliseconds for the dense model and 8.1 milliseconds for ColBERT on an M4 Max through llama.cpp. Query embedding plus cached-document MaxSim reportedly reached a median of 8.2 milliseconds for ColBERT. These are company-run measurements on a specified short query and document setup, not independent hardware benchmarks.

Which Model Is Better for RAG?

The answer depends on the retrieval stage.

Use LFM2.5-Embedding-350M when:

The corpus contains millions of chunks
Index cost matters
Latency must remain predictable
A standard vector database is already in place
Retrieval is the first stage of a larger pipeline
The application runs on a laptop or edge device

Use LFM2.5-ColBERT-350M when:

Fine-grained lexical and semantic matches matter
Wrong retrieval results are expensive
Multilingual queries must match documents in another language
The system can afford a multi-vector index
ColBERT is used to rerank a smaller candidate set
Tool or API selection requires precise matching

A practical hybrid architecture may use the dense model to retrieve the top 50 or 100 candidates, then use ColBERT to rerank them.

Hybrid RAG workflow using dense retrieval and ColBERT reranking — A hybrid pipeline can combine compact retrieval with more precise reranking.

That approach limits the storage and scoring burden while preserving some of the late-interaction quality advantage.

Why This Matters

Retrieval quality often limits the performance of a RAG application more than the final language model.

If the system retrieves the wrong document, a strong generator may still produce a confident but incorrect answer.

Small retrieval models also create new deployment options. A company can keep document embeddings, search traffic, and potentially sensitive queries on local hardware rather than sending them to a hosted embedding API.

This may reduce cloud cost and improve privacy, although local deployment still requires secure storage, access control, logging, and protection of the index itself.

Limitations and Risks

The models are designed primarily for short-context retrieval. The model cards specify a 512-token document length and a 32-token query length for ColBERT. Long reports must therefore be split into chunks. Poor chunking can erase the quality advantage of the retriever.

ColBERT also adds operational complexity:

More vectors must be stored
Index construction takes longer
MaxSim scoring requires specialized retrieval logic
Database support is less universal
Updates and deletions are more expensive
Debugging ranking behavior is harder

The models support 11 languages, not all languages. Performance may also vary across legal, medical, scientific, or highly specialized corpora.

Teams should evaluate on their own queries rather than selecting a model only from the provider’s average benchmark score.

Simple Explanation for Beginners

A dense embedding model turns an entire paragraph into one summary point on a map.

ColBERT keeps a separate point for many individual words.

The summary point is cheaper to store and faster to search.

The word-level map takes more space, but it can notice detailed matches that the summary misses.

Conclusion: LFM2.5 Embedding 350M

LFM2.5 Embedding 350M gives developers a compact option for multilingual semantic search, while LFM2.5-ColBERT-350M offers stronger reported retrieval quality through token-level late interaction.

Liquid’s NanoBEIR results favor ColBERT, but the accuracy gain must be weighed against index size, scoring complexity, and operational cost.

For most large RAG deployments, the dense model is the simpler first-stage retriever. ColBERT is more attractive when retrieval mistakes are costly or as a second-stage reranker over a smaller candidate set.

The release is notable not because one model replaces the other, but because Liquid AI provides two deployment points on the same speed-versus-quality curve.

Final Takeaways

Liquid AI released both models on June 18, 2026.
Each model contains about 350 million parameters.
The dense model stores one 1,024-dimensional vector per document.
ColBERT stores one 128-dimensional vector per token.
Liquid reports NanoBEIR NDCG@10 scores of 0.577 and 0.605.
The benchmark results are provider-reported.
ColBERT offers higher reported quality but requires a larger index.
Both support 11 languages and cross-lingual retrieval.
GGUF versions can run locally through llama.cpp.
Dense retrieval is simpler for large-scale first-stage search.
ColBERT is well suited to precision-focused retrieval and reranking.

Suggested Read:

How RAG Systems Work
Best Embedding Models for RAG
ColBERT Explained
Local AI Models Guide
Latest AI Model Releases

FAQ: LFM2.5 Embedding 350M

What is LFM2.5 Embedding 350M?

It is a 350M-parameter multilingual dense retrieval model from Liquid AI. It converts each query or document into one 1,024-dimensional vector for cosine-similarity search.

How is ColBERT different from dense embeddings?

Dense embeddings compress a document into one vector. ColBERT creates vectors for individual tokens and scores query-document matches using MaxSim, which preserves more detailed interactions.

Which Liquid AI retrieval model should I use?

Use the embedding model when storage, speed, and standard vector-database compatibility matter most. Use ColBERT when higher retrieval precision justifies a larger, more complex index.

Can LFM2.5 Embedding run locally?

Yes. Liquid provides GGUF variants for llama.cpp, allowing local execution on supported CPUs, laptops, and edge hardware.

What languages do the models support?

They support English, Spanish, German, French, Italian, Portuguese, Arabic, Swedish, Norwegian, Japanese, and Korean.

Is LFM2.5 ColBERT better for RAG?

It may retrieve more relevant passages, especially for precise or multilingual matching, but it requires more storage and specialized MaxSim scoring. Its value should be tested on the target dataset.

References:

LFM2.5 Embedding 350M: Liquid AI’s New Retrieval Models Explained