Welcome

Notebook: embeddings_workshop.ipynb

Slides: resolveworks.github.io/dataharvest2026

Repo: github.com/resolveworks/dataharvest2026

Rui Barros

Rui barros Rui barros

Navigating Text
in
High Dimensions

Who are we?

I'm Johan

I'm Ada

What we'll cover today

  • Text embeddings
  • Reducing dimensionality with UMAP
  • Clustering with HDBSCAN

We'll try to understand
the underlying concepts & technologies

So you can use these tools in your own work

So what is embedding?

What happens when we "embed" text?

It works by converting text into numerical representations (vectors) that can be compared

These vectors are called embeddings

Because we're "embedding" language

Into numerical space

While trying to preserve semantic relationships

To create these vectors we use specialized
"embedding models"

These models are trained for this task specifically


Anchor: "The cat sat on the mat" 
Positive: "A feline rested on the rug" → Pull closer 
Negative: "Stocks rose 3% today"       → Push apart
          

Maximize similarity for positive pairs, minimize for negative

Positive pairs: paraphrases, similar sentences, consecutive sentences

Negative pairs: random unrelated sentences

There's more to it

But we're not here to train our own embedding model

Let's see what this looks like using all-MiniLM-L6-v2


from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

sentence = "The sun is shining"
embedding = model.encode(sentence)
            

"The sun is shining" becomes...

 5.42898139e-04  1.02551401e-01  8.36703405e-02  9.39253941e-02
 2.07145736e-02 -1.12772360e-02  9.93586630e-02 -4.44610193e-02
 3.98165248e-02 -2.13668533e-02  2.09684577e-02 -3.66757251e-02
 7.11042061e-03  4.15845886e-02  1.01847239e-01  7.34366402e-02
 1.43970475e-02  4.65401914e-03  9.70691442e-03 -4.16579135e-02
-2.30006799e-02 -2.27442160e-02 -2.73266062e-02 -1.22460043e-02
-8.25940259e-03  6.72213510e-02 -2.77701933e-02  2.63383351e-02
-6.12662174e-02 -1.37788551e-02  2.81139649e-02  5.81778679e-03
... etc ... ...

384 dimensions
            

Great. Now that we have our numerical representation, how do we compare it to others?

A vector can be thought of as a list of numbers, or a coordinate or direction in a space

Embedding models are trained so that vectors with similar direction are semantically similar

Why compare directions and not distance between coordinates?

Because high-dimensional geometry is weird

All points become equidistant with Euclidean distance

So most of the times, we compare vectors by their angles. By calculating Cosine Similarity

Let's see what that looks like

384 dimensions is hard to visualize though...

So let's imagine this in 2 dimensions

−1.0 −0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 −1.0 −0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 Vector A (0.15, 0.72) Vector B (0.51, 0.43) θ = 38.1° Similarity = 0.787 cos(θ) = A·B / |A|·|B|

So to come back to "The sun is shining"...

Text Similarity
Bring an umbrella 0.320
The weather is great 0.418
It's daytime 0.622

Great, we can find similar things!

What can we do with that?

Well, it would be great to visualize this space

But we can't visualize 384 dimensions

To see what our embedding space looks like

We need to reduce the dimensions

Take our vectors and project them into 2D

While preserving the structure of the data

This is called dimensionality reduction

And there's algorithms for that

We'll use UMAP

(Uniform Manifold Approximation and Projection)

UMAP works in two steps

Step 1: Build a weighted neighborhood graph

Connect each point to its k nearest neighbors

Closer neighbors get a stronger connection

Step 2: Arrange in 2D using forces

Attract (connected points) Repel (all points)

Stronger connections pull harder

The system finds equilibrium

Two parameters to know:

n_neighbors: how many neighbors to consider?

  • Low → local structure
  • High → global structure

min_dist: how tightly should points pack together?

Now we can plot our documents

And actually see what's going on

When we plot them, we can see what groups together

Similar documents land in the same neighborhood

Would be cool if we could find these groupings automatically

Without telling the algorithm how many to expect

This is called clustering

There's multiple ways to do it

We'll take a look at HDBSCAN

(Hierarchical Density-Based Spatial Clustering of Applications with Noise)

The core idea: density

Where points are tightly packed → clusters

Sparse areas → noise

But what counts as "dense enough"?

Different clusters can have different densities

No single threshold works

Strict Sparse cluster missed Loose Dense cluster swallowed HDBSCAN Both found

HDBSCAN finds each cluster at its own natural density

Sweep the threshold and watch what's stable

Strict — 3 clusters emerge Moderate — same 3 clusters Loose — everything merges

The clusters that survived many thresholds are real

A B C Noise

Two parameters to know:

min_cluster_size: how big must a group be?

min_samples: how surrounded must a point be?

What can we do with clusters?

  • Topic discovery: what are people talking about?
  • Anomaly detection: what doesn't fit any group?
  • Organization: automatically sort documents

That's it!

You made it through the theory

Notebook time! 🎉

embeddings_workshop.ipynb

Tools to play with

  1. sbert.net
    Sentence Transformers — the Python library for embeddings
  2. huggingface.co/spaces/mteb/leaderboard
    MTEB Leaderboard — compare embedding models
  3. projector.tensorflow.org
    Embedding Projector — interactive 3D visualization
  4. umap-learn.readthedocs.io
    UMAP — dimensionality reduction for visualization

Parsing documents

Language model ❤️ markdown

  1. github.com/docling-project/docling
    IBM Research — tables, figures, complex layouts
  2. github.com/microsoft/markitdown
    Microsoft — Office, PDF, audio, images → markdown
  3. pymupdf4llm.readthedocs.io
    Fast PDF → markdown, no GPU needed