Notebook: embeddings_workshop.ipynb
We'll try to understand
the
underlying concepts & technologies
So you can use these tools in your own work
What happens when we "embed" text?
It works by converting text into numerical representations (vectors) that can be compared
These vectors are called embeddings
Because we're "embedding" language
Into numerical space
While trying to preserve semantic relationships
To create these vectors we use specialized
"embedding models"
These models are trained for this task specifically
Anchor: "The cat sat on the mat"
Positive: "A feline rested on the rug" → Pull closer
Negative: "Stocks rose 3% today" → Push apart
Maximize similarity for positive pairs, minimize for negative
Positive pairs: paraphrases, similar sentences, consecutive sentences
Negative pairs: random unrelated sentences
There's more to it
But we're not here to train our own embedding model
Let's see what this looks like using all-MiniLM-L6-v2
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
sentence = "The sun is shining"
embedding = model.encode(sentence)
"The sun is shining" becomes...
5.42898139e-04 1.02551401e-01 8.36703405e-02 9.39253941e-02
2.07145736e-02 -1.12772360e-02 9.93586630e-02 -4.44610193e-02
3.98165248e-02 -2.13668533e-02 2.09684577e-02 -3.66757251e-02
7.11042061e-03 4.15845886e-02 1.01847239e-01 7.34366402e-02
1.43970475e-02 4.65401914e-03 9.70691442e-03 -4.16579135e-02
-2.30006799e-02 -2.27442160e-02 -2.73266062e-02 -1.22460043e-02
-8.25940259e-03 6.72213510e-02 -2.77701933e-02 2.63383351e-02
-6.12662174e-02 -1.37788551e-02 2.81139649e-02 5.81778679e-03
... etc ... ...
384 dimensions
Great. Now that we have our numerical representation, how do we compare it to others?
A vector can be thought of as a list of numbers, or a coordinate or direction in a space
Embedding models are trained so that vectors with similar direction are semantically similar
Why compare directions and not distance between coordinates?
Because high-dimensional geometry is weird
All points become equidistant with Euclidean distance
So most of the times, we compare vectors by their angles. By calculating Cosine Similarity
Let's see what that looks like
384 dimensions is hard to visualize though...
So let's imagine this in 2 dimensions
So to come back to "The sun is shining"...
| Text | Similarity |
|---|---|
| Bring an umbrella | 0.320 |
| The weather is great | 0.418 |
| It's daytime | 0.622 |
Great, we can find similar things!
What can we do with that?
Well, it would be great to visualize this space
But we can't visualize 384 dimensions
To see what our embedding space looks like
We need to reduce the dimensions
Take our vectors and project them into 2D
While preserving the structure of the data
This is called dimensionality reduction
And there's algorithms for that
We'll use UMAP
(Uniform Manifold Approximation and Projection)
UMAP works in two steps
Step 1: Build a weighted neighborhood graph
Connect each point to its k nearest neighbors
Closer neighbors get a stronger connection
Step 2: Arrange in 2D using forces
Stronger connections pull harder
The system finds equilibrium
Two parameters to know:
n_neighbors: how many neighbors to consider?
min_dist: how tightly should points pack together?
Now we can plot our documents
And actually see what's going on
When we plot them, we can see what groups together
Similar documents land in the same neighborhood
Would be cool if we could find these groupings automatically
Without telling the algorithm how many to expect
This is called clustering
There's multiple ways to do it
We'll take a look at HDBSCAN
(Hierarchical Density-Based Spatial Clustering of Applications with Noise)
The core idea: density
Where points are tightly packed → clusters
Sparse areas → noise
But what counts as "dense enough"?
Different clusters can have different densities
No single threshold works
HDBSCAN finds each cluster at its own natural density
Sweep the threshold and watch what's stable
The clusters that survived many thresholds are real
Two parameters to know:
min_cluster_size: how big must a group be?
min_samples: how surrounded must a point be?
What can we do with clusters?
You made it through the theory
Language model ❤️ markdown