Building Intuition for Embedding Similarity: What Do the Numbers Actually Look Like?

Embedding models convert text into vectors, and cosine similarity between those vectors tells you how related two texts are. But what does a score of 0.4 actually mean? Is that good? How much does a paragraph score against its own article? What about an irrelevant question - where does it land?

I wanted to build intuition for these numbers, so I measured them.

The Experiment

I took the Wikipedia article on fly fishing and measured cosine similarity between the full article and chunks of different sizes - random paragraphs, sentences, and individual words from it. I also compared the article against relevant questions, irrelevant questions, and random English words. I ran this on two embedding models: OpenAI text-embedding-3-large and Amazon Titan Embeddings v2.

Chunk Size vs. Similarity

No surprises in the ranking - bigger chunks score higher:

Chunk type	OpenAI	Titan
Word from article	0.10	0.05
Sentence from article	0.40	0.32
Paragraph from article	0.66	0.55

What’s worth noting is the magnitude. A paragraph captures only about two-thirds of the article’s semantic signal. A sentence - about a third. A single word - practically nothing.

Where Queries Land on That Scale

Query Relevance

This is where it gets interesting. Here’s how queries compare to those chunk baselines:

Input	OpenAI	Titan
Relevant question	0.44	0.38
Random word (from article)	0.10	0.05
Irrelevant question	0.08	0.015
Random English word	~0.10	~0.05

A few things stand out:

A relevant question scores about the same as a random sentence from the article (0.44 vs 0.40 on OpenAI, 0.38 vs 0.32 on Titan). This makes intuitive sense - a question about fly fishing carries roughly as much topical signal as a random sentence from the article itself.

Whether a word appears in the article or not doesn’t matter. A random English word scores the same as a word pulled directly from the article (~0.10 on OpenAI, ~0.05 on Titan). At the single-word level, there’s no meaningful similarity signal - it’s all noise.

An irrelevant question scores worse than a random word (0.08 vs 0.10 on OpenAI, 0.015 vs 0.05 on Titan). A coherent sentence about the wrong topic gets actively pushed away from the article’s embedding, while a random word just floats at baseline. Being specifically off-topic is worse than being meaningless.

What This Means for RAG

Calibrate thresholds against your specific model. Both models agree on the hierarchy, but the absolute numbers differ. A threshold that works for OpenAI won’t transfer to Titan without re-measuring.

Test with irrelevant queries. Most people benchmark RAG only with queries they expect to match. The data shows that the gap between “relevant” and “noise” is where your threshold needs to live - and you can’t know where that is without measuring both sides.

Methodology

Source text: Wikipedia article on fly fishing, truncated to 8,192 characters
Chunk sampling: 10 random paragraphs, sentences, and words from the article, averaged across 5 runs
Similarity metric: Cosine similarity via LlamaIndex
Models: OpenAI text-embedding-3-large, Amazon amazon.titan-embed-text-v2:0
Relevant questions: LLM-generated from the article content
Irrelevant questions: Manually curated on unrelated topics

Building Intuition for Embedding Similarity: What Do the Numbers Actually Look Like?

The Experiment

Chunk Size vs. Similarity

Where Queries Land on That Scale

What This Means for RAG

Methodology

About Aleksandr Furmanov

Comments

Read Next

Reducing anxiety with Bayesian network.

Two Faces of Agent Orchestration