using SimilaritySearch, SimSearchManifoldLearning, TextSearch, CodecZlib, JSON, DataFrames, PlotlyLight, StatsBase, LinearAlgebra, Markdown, Embeddings, Random
using Downloads: downloadVisualizing Twitter Messages with Emojis
by: Eric S. Téllez
This example creates a visualization of Glove word embeddings using Embeddings.jl package to fetch them.
Note: This example needs a lot of computing power; therefore you may want to set the environment variable JULIA_NUM_THREADS=auto before running julia.
emb, vocab = let E = load_embeddings(GloVe{:en}, 2) # you can change with any of the available embeddings in `Embeddings`
emb, vocab = E.embeddings[:, 1:50_000], E.vocab[1:50_000]
for c in eachcol(emb)
1 normalize!(c)
end
2 Float16.(emb), vocab
end
3dist = Dist.CastF32.NormCosine()
4vocab2id = Dict(w => i for (i, w) in enumerate(vocab))- 1
- Normalizes all vectors to have a unitary norm; this allow us to use the dot product as similarity (see point 3)
- 2
-
The speed can be improved through memory’s bandwidth using less memory per vector; using
Float16as memory representation is a good idea even if your computer doesn’t support 16-bit floating point arithmetic natively. - 3
-
Since we have unitary norm vectors we can simplify the cosine distance (i.e., \(1 - dot(\cdot, \cdot)\)); note that we are using
Float16and the moduleDist.CastF32which selects a distance function that converts numbers toFloat32just before performing arithmetic operations. - 4
-
Inverse map from words to identifiers in
vocab.
Now we can create the index
- 1
-
Defines the index and the search context (caches and hyperparameters); particularly, we use a very high quality build
MinRecall(0.99); high quality constructions yield to faster queries due to the underlying graph structure. - 2
- Actual indexing procedure using the given search context.
- 3
- Optimizing the index to trade quality and speed.
Searching
Our index can solve queries over the entire dataset, for instance, solving synonym queries as nearest neighbor queries.
function search_and_render(index, ctx, vocab, q, res, k, qword)
res = reuse!(res, k)
@time search(index, ctx, q, res)
L = [
"""## result list for _$(qword)_ """,
"""| nn | word | wordID | dist |""",
"""|----|------|--------|------|"""
]
for (j, p) in enumerate(viewitems(res))
push!(L, """| $j | $(vocab[p.id]) | $(p.id) | $(round(p.dist, digits=3)) |""")
end
Markdown.parse(join(L, "\n"))
end
display(md"## Search examples (random)")
res = knnqueue(ctx, 12)
for word in ["hat", "bat", "car", "dinosaur"]
#for qid in rand(1:length(vocab))
qid = vocab2id[word]
search_and_render(index, ctx, vocab, index[qid], res, maxlength(res), vocab[qid]) |> display
#end
end 0.000055 seconds (3 allocations: 64 bytes)
0.000040 seconds (3 allocations: 64 bytes)
0.000054 seconds (3 allocations: 64 bytes)
0.000048 seconds (3 allocations: 64 bytes)
Search examples (random)
result list for hat
| nn | word | wordID | dist |
|---|---|---|---|
| 1 | hat | 5626 | 0.0 |
| 2 | hats | 11439 | 0.288 |
| 3 | shirt | 5099 | 0.309 |
| 4 | wears | 9318 | 0.347 |
| 5 | outfit | 9363 | 0.351 |
| 6 | trick | 6922 | 0.357 |
| 7 | boots | 8847 | 0.36 |
| 8 | wore | 5052 | 0.363 |
| 9 | jacket | 8401 | 0.366 |
| 10 | wearing | 2759 | 0.368 |
| 11 | scarf | 19468 | 0.379 |
| 12 | coat | 6629 | 0.389 |
result list for bat
| nn | word | wordID | dist |
|---|---|---|---|
| 1 | bat | 4926 | -0.0 |
| 2 | bats | 8090 | 0.292 |
| 3 | balls | 4438 | 0.319 |
| 4 | pitch | 3099 | 0.339 |
| 5 | batting | 5278 | 0.358 |
| 6 | wicket | 5874 | 0.36 |
| 7 | ball | 1084 | 0.361 |
| 8 | toss | 8220 | 0.378 |
| 9 | innings | 2207 | 0.392 |
| 10 | pitches | 7935 | 0.4 |
| 11 | catch | 3162 | 0.421 |
| 12 | swinging | 12682 | 0.426 |
result list for car
| nn | word | wordID | dist |
|---|---|---|---|
| 1 | car | 570 | -0.0 |
| 2 | vehicle | 1908 | 0.137 |
| 3 | truck | 2576 | 0.14 |
| 4 | cars | 1278 | 0.163 |
| 5 | driver | 1926 | 0.181 |
| 6 | driving | 2032 | 0.219 |
| 7 | motorcycle | 7214 | 0.245 |
| 8 | vehicles | 1635 | 0.254 |
| 9 | parked | 8838 | 0.254 |
| 10 | bus | 1709 | 0.263 |
| 11 | suv | 14016 | 0.286 |
| 12 | pickup | 6948 | 0.304 |
result list for dinosaur
| nn | word | wordID | dist |
|---|---|---|---|
| 1 | dinosaur | 13309 | 0.0 |
| 2 | dinosaurs | 14728 | 0.243 |
| 3 | fossils | 13023 | 0.267 |
| 4 | fossilized | 40434 | 0.286 |
| 5 | mammal | 21720 | 0.326 |
| 6 | fossil | 9045 | 0.334 |
| 7 | reptile | 30126 | 0.344 |
| 8 | skeletons | 22379 | 0.36 |
| 9 | prehistoric | 15412 | 0.376 |
| 10 | bones | 6392 | 0.377 |
| 11 | jurassic | 19639 | 0.395 |
| 12 | paleontologists | 41659 | 0.415 |
Interestingle, we can use this kind of models for analogy resolution, i.e., \(a\) is a \(b\) like \(c\) is a \(d\), or more wordly, father is a man like mother is a woman. The concepts learned by the embeddings are linear, and then we can state the analogy resolution, that is, solve father is a main like ? is a woman. It can be obtained with a simple vector arithmetic operations: \[a - b + d \rightarrow c\]
This can be interpreted as having a concept \(a\) remove the concept \(b\) and adds the concept \(c\); the resulting vector
function analogy(a, b, d, k)
1 c = index[vocab2id[a]] - index[vocab2id[b]] + index[vocab2id[d]]
2 normalize!(c)
3 search_and_render(index, ctx, vocab, c, res, k, "_$(a)_ - _$(b)_ + _$(d)_") |> display
end
4analogy("father", "man", "woman", 10)
analogy("fireman", "man", "woman", 10)
analogy("policeman", "man", "woman", 10)
analogy("mississippi", "usa", "france", 10)- 1
- Vector operations to state the analogy.
- 2
- Normalize the vector.
- 3
- Search the index to find similar queries to \(c\) vector.
- 4
- Different analogies to solve; using 10nn.
0.000058 seconds (3 allocations: 64 bytes)
0.000040 seconds (3 allocations: 64 bytes)
0.000048 seconds (3 allocations: 64 bytes)
0.000048 seconds (3 allocations: 64 bytes)
result list for father_ - man + _woman
| nn | word | wordID | dist |
|---|---|---|---|
| 1 | mother | 809 | 0.098 |
| 2 | daughter | 1132 | 0.132 |
| 3 | wife | 703 | 0.146 |
| 4 | father | 630 | 0.148 |
| 5 | husband | 1328 | 0.172 |
| 6 | grandmother | 7401 | 0.189 |
| 7 | sister | 2004 | 0.213 |
| 8 | married | 1168 | 0.214 |
| 9 | niece | 14268 | 0.234 |
| 10 | woman | 788 | 0.234 |
result list for fireman_ - man + _woman
| nn | word | wordID | dist |
|---|---|---|---|
| 1 | fireman | 27345 | 0.157 |
| 2 | firefighter | 15812 | 0.303 |
| 3 | paramedic | 33841 | 0.394 |
| 4 | rescuer | 44915 | 0.439 |
| 5 | janitor | 32488 | 0.476 |
| 6 | lifeguard | 38623 | 0.476 |
| 7 | welder | 49430 | 0.487 |
| 8 | schoolteacher | 22298 | 0.505 |
| 9 | pensioner | 41032 | 0.529 |
| 10 | volunteer | 5359 | 0.53 |
result list for policeman_ - man + _woman
| nn | word | wordID | dist |
|---|---|---|---|
| 1 | policeman | 6857 | 0.144 |
| 2 | wounding | 6118 | 0.285 |
| 3 | policemen | 4984 | 0.295 |
| 4 | wounded | 1392 | 0.331 |
| 5 | injuring | 6494 | 0.341 |
| 6 | soldier | 2482 | 0.353 |
| 7 | bystander | 29838 | 0.363 |
| 8 | fatally | 12321 | 0.368 |
| 9 | stabbed | 9973 | 0.375 |
| 10 | protester | 18152 | 0.379 |
result list for mississippi_ - usa + _france
| nn | word | wordID | dist |
|---|---|---|---|
| 1 | mississippi | 4050 | 0.381 |
| 2 | louisiana | 3844 | 0.43 |
| 3 | rhine | 13957 | 0.488 |
| 4 | coast | 955 | 0.506 |
| 5 | brittany | 15877 | 0.506 |
| 6 | southern | 483 | 0.506 |
| 7 | northern | 530 | 0.514 |
| 8 | normandy | 13625 | 0.518 |
| 9 | canal | 4370 | 0.519 |
| 10 | river | 621 | 0.52 |
UMAP Visualization
Computing UMAP projections
e2, e3 = let min_dist=0.3f0,
k=15,
n_epochs=75,
neg_sample_rate=3,
tol=1e-3,
layout=SpectralLayout()
@time "Compute 2D UMAP model" U2 = fit(UMAP, index; k, neg_sample_rate, layout, n_epochs, tol, min_dist)
@time "Compute 3D UMAP model" U3 = fit(U2, 3; neg_sample_rate, n_epochs, tol)
@time "predicting 2D embeddings" e2 = clamp.(predict(U2), -10f0, 10f0)
@time "predicting 3D embeddings" e3 = clamp.(predict(U3), -10f0, 10f0)
e2, e3
end Computing visualization
function normcolors!(V)
min_, max_ = extrema(V)
s = 255.0 / (max_ - min_)
V .= (V .- min_) .* s
V .= round.(V; digits=0)
V .= clamp.(V, 0, 255)
end
normcolors!(@view e3[1, :])
normcolors!(@view e3[2, :])
normcolors!(@view e3[3, :])
colors = [
"rgba($(Int(c[1])), $(Int(c[2])), $(Int(c[3])), 0.3)"
for c in eachcol(e3)
]
text = [replace(w, r"\W" => "") for w in vocab] # avoids issues on plotly hovers
#@warn text
data = [Config(;
x = view(e2, 1, :),
y = view(e2, 2, :),
mode = "markers",
marker = (
color = colors,
size = 4,
line = (width = 0,)
),
text,
type = "scatter"
)]
layout = Config(
width = 600,
height = 600,
xaxis = (visible = false, showgrid = false, zeroline = false),
yaxis = (visible = false, showgrid = false, zeroline = false),
hovermode = "closest",
plot_bgcolor = "white"
)
Plot(data, layout)Final notes
This example shows how to index and search dense vector databases, in particular GloVe word embeddings using the cosine distance. Low dimensional projections are made with SimSearchManifoldLearning, note that SimilaritySearch is also used for computing the all \(k\) nearest neighbors needed by the UMAP model. Note that this notebook should be ran with several threads to reduce time costs.
Environment and dependencies
Julia Version 1.10.11 Commit a2b11907d7b (2026-03-09 14:59 UTC) Build Info: Official https://julialang.org/ release Platform Info: OS: macOS (x86_64-apple-darwin24.0.0) CPU: 8 × Intel(R) Core(TM) i5-8257U CPU @ 1.40GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-15.0.7 (ORCJIT, skylake) Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores) Environment: JULIA_NUM_THREADS = auto JULIA_PROJECT = @. JULIA_LOAD_PATH = @:@stdlib Status `~/Research/SimilaritySearchDemos/Project.toml` [aaaa29a8] Clustering v0.15.8 [944b1d66] CodecZlib v0.7.8 [5ae59095] Colors v0.13.1 [a93c6f00] DataFrames v1.8.1 [c5bfea45] Embeddings v0.4.6 [f67ccb44] HDF5 v0.17.2 [916415d5] Images v0.26.2 [b20bd276] InvertedFiles v0.9.2 ⌅ [682c06a0] JSON v0.21.4 [23fbe1c1] Latexify v0.16.10 [eb30cadb] MLDatasets v0.7.21 [06eb3307] ManifoldLearning v0.9.0 ⌃ [ca7969ec] PlotlyLight v0.11.0 [27ebfcd6] Primes v0.5.7 [ca7ab67e] SimSearchManifoldLearning v0.4.0 [053f045d] SimilaritySearch v0.14.3 ⌅ [2913bbd2] StatsBase v0.33.21 [7f6f6c8a] TextSearch v0.20.0 Info Packages marked with ⌃ and ⌅ have new versions available. Those with ⌃ may be upgradable, but those with ⌅ are restricted by compatibility constraints from upgrading. To see why use `status --outdated`