Visualizing Twitter Messages with Emojis

by: Eric S. Téllez

This example creates a visualization of Glove word embeddings using Embeddings.jl package to fetch them.

Note: This example needs a lot of computing power; therefore you may want to set the environment variable JULIA_NUM_THREADS=auto before running julia.

using SimilaritySearch, SimSearchManifoldLearning, TextSearch, CodecZlib, JSON, DataFrames, Plots, StatsBase, LinearAlgebra, Markdown, Embeddings, Random
using Downloads: download

emb, vocab = let emb = load_embeddings(GloVe{:en}, 2)  # you can change with any of the available embeddings in `Embeddings`
    for c in eachcol(emb.embeddings)
1        normalize!(c)
    end

2    Float16.(emb.embeddings), emb.vocab
end

3dist = NormalizedCosine_asf32()
4vocab2id = Dict(w => i for (i, w) in enumerate(vocab))

1: Normalizes all vectors to have a unitary norm; this allow us to use the dot product as similarity (see point 3)
2: The speed can be improved through memory’s bandwidth using less memory per vector; using Float16 as memory representation is a good idea even if your computer doesn’t support 16-bit floating point arithmetic natively.
3: Since we have unitary norm vectors we can simplify the cosine distance (i.e., \(1 - dot(\cdot, \cdot)\)); note that we are using Float16 and the suffix _asf32 will select a distance function that converts numbers to Float32 just before performing arithmetic operations.
4: Inverse map from words to identifiers in vocab.

Now we can create the index

1index = SearchGraph(; dist, db=MatrixDatabase(emb))
ctx = SearchGraphContext(hyperparameters_callback=OptimizeParameters(MinRecall(0.99)))
2index!(index, ctx)
3optimize_index!(index, ctx, MinRecall(0.9))

1: Defines the index and the search context (caches and hyperparameters); particularly, we use a very high quality build MinRecall(0.99); high quality constructions yield to faster queries due to the underlying graph structure.
2: Actual indexing procedure using the given search context.
3: Optimizing the index to trade quality and speed.

Searching

Our index can solve queries over the entire dataset, for instance, solving synonym queries as nearest neighbor queries.

function search_and_render(index, ctx, vocab, q, res, k, qword)
    res = reuse!(res, k)
    @time search(index, ctx, q, res)

    L = [
        """## result list for _$(qword)_ """,
        """| nn | word | wordID | dist |""",
        """|----|------|--------|------|"""
    ]
    for (j, p) in enumerate(res)
        push!(L, """| $j | $(vocab[p.id]) | $(p.id) | $(round(p.weight, digits=3)) |""")     
    end

    Markdown.parse(join(L, "\n"))
end

display(md"## Search examples (random)")

res = KnnResult(12)
for word in ["hat", "bat", "car", "dinosaur"]
    #for qid in rand(1:length(vocab))
    qid = vocab2id[word]
    search_and_render(index, ctx, vocab, index[qid], res, maxlength(res), vocab[qid]) |> display
    #end
end

  0.000249 seconds (2 allocations: 32 bytes)
  0.000123 seconds (2 allocations: 32 bytes)
  0.000093 seconds (2 allocations: 32 bytes)
  0.000177 seconds (2 allocations: 32 bytes)

Search examples (random)

result list for hat

nn	word	wordID	dist
1	hat	5626	0.0
2	hats	11439	0.288
3	shirt	5099	0.309
4	wears	9318	0.347
5	outfit	9363	0.351
6	trick	6922	0.357
7	boots	8847	0.36
8	wore	5052	0.363
9	jacket	8401	0.366
10	wearing	2759	0.368
11	scarf	19468	0.379
12	coat	6629	0.389

result list for bat

nn	word	wordID	dist
1	bat	4926	-0.0
2	bats	8090	0.292
3	balls	4438	0.319
4	pitch	3099	0.339
5	batting	5278	0.358
6	wicket	5874	0.36
7	ball	1084	0.361
8	toss	8220	0.378
9	innings	2207	0.392
10	pitches	7935	0.4
11	batsman	8275	0.406
12	catch	3162	0.421

result list for car

nn	word	wordID	dist
1	car	570	-0.0
2	vehicle	1908	0.137
3	truck	2576	0.14
4	cars	1278	0.163
5	driver	1926	0.181
6	driving	2032	0.219
7	motorcycle	7214	0.245
8	vehicles	1635	0.254
9	parked	8838	0.254
10	bus	1709	0.263
11	taxi	7019	0.284
12	suv	14016	0.286

result list for dinosaur

nn	word	wordID	dist
1	dinosaur	13309	0.0
2	dinosaurs	14728	0.243
3	sauropod	77302	0.265
4	fossils	13023	0.267
5	theropod	66200	0.273
6	fossilized	40434	0.286
7	mammal	21720	0.326
8	fossil	9045	0.334
9	reptile	30126	0.344
10	skeletons	22379	0.36
11	hominid	55275	0.37
12	footprints	25905	0.376

Interestingle, we can use this kind of models for analogy resolution, i.e., \(a\) is a \(b\) like \(c\) is a \(d\), or more wordly, father is a man like mother is a woman. The concepts learned by the embeddings are linear, and then we can state the analogy resolution, that is, solve father is a main like ? is a woman. It can be obtained with a simple vector arithmetic operations: \[a - b + d \rightarrow c\]

This can be interpreted as having a concept \(a\) remove the concept \(b\) and adds the concept \(c\); the resulting vector

function analogy(a, b, d, k)
1    c = index[vocab2id[a]] - index[vocab2id[b]] + index[vocab2id[d]]
2    normalize!(c)
3    search_and_render(index, ctx, vocab, c, res, k, "_$(a)_ - _$(b)_ + _$(d)_") |> display
end

4analogy("father", "man", "woman", 10)
analogy("fireman", "man", "woman", 10)
analogy("policeman", "man", "woman", 10)
analogy("mississippi", "usa", "france", 10)

1: Vector operations to state the analogy.
2: Normalize the vector.
3: Search the index to find similar queries to \(c\) vector.
4: Different analogies to solve; using 10nn.

  0.000179 seconds (2 allocations: 32 bytes)
  0.000201 seconds (2 allocations: 32 bytes)
  0.000106 seconds (2 allocations: 32 bytes)
  0.000235 seconds (2 allocations: 32 bytes)

result list for **father_ - man + _woman**

nn	word	wordID	dist
1	mother	809	0.098
2	daughter	1132	0.132
3	wife	703	0.146
4	father	630	0.148
5	husband	1328	0.172
6	grandmother	7401	0.189
7	sister	2004	0.213
8	married	1168	0.214
9	niece	14268	0.234
10	son	631	0.239

result list for **fireman_ - man + _woman**

nn	word	wordID	dist
1	fireman	27345	0.157
2	firefighter	15812	0.303
3	paramedic	33841	0.394
4	rescuer	44915	0.439
5	passerby	53776	0.459
6	janitor	32488	0.476
7	lifeguard	38623	0.476
8	welder	49430	0.487
9	steelworker	91104	0.491
10	schoolteacher	22298	0.505

result list for **policeman_ - man + _woman**

nn	word	wordID	dist
1	policeman	6857	0.144
2	wounding	6118	0.285
3	policemen	4984	0.295
4	passerby	53776	0.306
5	wounded	1392	0.331
6	injuring	6494	0.341
7	soldier	2482	0.353
8	bystander	29838	0.363
9	fatally	12321	0.368
10	stabbed	9973	0.375

result list for **mississippi_ - usa + _france**

nn	word	wordID	dist
1	france	388	0.468
2	rhine	13957	0.488
3	coast	955	0.506
4	brittany	15877	0.506
5	southern	483	0.506
6	northern	530	0.514
7	normandy	13625	0.518
8	canal	4370	0.519
9	river	621	0.52
10	gulf	1666	0.529

UMAP Visualization

function normcolors(V)
    min_, max_ = extrema(V)
    V .= (V .- min_) ./ (max_ - min_)
    V .= clamp.(V, 0, 1)
end

normcolors(@view e3[1, :])
normcolors(@view e3[2, :])
normcolors(@view e3[3, :])

let C = [RGB(c[1], c[2], c[3]) for c in eachcol(e3)],
    X = view(e2, 1, :),
    Y = view(e2, 2, :)
    scatter(X, Y, color=C, fmt=:png, alpha=0.2, size=(600, 600), ma=0.3, ms=2, msw=0, label="", yticks=nothing, xticks=nothing, xaxis=false, yaxis=false)
end

plot!()

e2, e3 = let min_dist=0.5f0,
             k=12,
             n_epochs=75,
             neg_sample_rate=3,
             tol=1e-3,
             layout=RandomLayout()

    @time "Compute 2D UMAP model" U2 = fit(UMAP, index; k, neg_sample_rate, layout, n_epochs, tol, min_dist)
    @time "Compute 3D UMAP model" U3 = fit(U2, 3; neg_sample_rate, n_epochs, tol)
    @time "predicting 2D embeddings" e2 = clamp.(predict(U2), -10f0, 10f0)
    @time "predicting 3D embeddings" e3 = clamp.(predict(U3), -10f0, 10f0)
    e2, e3
end

Final notes

This example shows how to index and search dense vector databases, in particular GloVe word embeddings using the cosine distance. Low dimensional projections are made with SimSearchManifoldLearning, note that SimilaritySearch is also used for computing the all \(k\) nearest neighbors needed by the UMAP model. Note that this notebook should be ran with several threads to reduce time costs.

Environment and dependencies

Julia Version 1.10.9
Commit 5595d20a287 (2025-03-10 12:51 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, cascadelake)
Threads: 64 default, 0 interactive, 32 GC (on 64 virtual cores)
Environment:
  JULIA_PROJECT = .
  JULIA_NUM_THREADS = auto
  JULIA_LOAD_PATH = @:@stdlib
Status `~/sites/SimilaritySearchDemos/Project.toml`
  [aaaa29a8] Clustering v0.15.8
  [944b1d66] CodecZlib v0.7.8
  [a93c6f00] DataFrames v1.7.0
  [c5bfea45] Embeddings v0.4.6
  [f67ccb44] HDF5 v0.17.2
  [b20bd276] InvertedFiles v0.8.0 `~/.julia/dev/InvertedFiles`
  [682c06a0] JSON v0.21.4
  [23fbe1c1] Latexify v0.16.6
  [eb30cadb] MLDatasets v0.7.18
  [06eb3307] ManifoldLearning v0.9.0
⌃ [ca7969ec] PlotlyLight v0.11.0
  [91a5bcdd] Plots v1.40.11
  [27ebfcd6] Primes v0.5.7
  [ca7ab67e] SimSearchManifoldLearning v0.3.0 `~/.julia/dev/SimSearchManifoldLearning`
  [053f045d] SimilaritySearch v0.12.0 `~/.julia/dev/SimilaritySearch`
⌅ [2913bbd2] StatsBase v0.33.21
  [f3b207a7] StatsPlots v0.15.7
  [7f6f6c8a] TextSearch v0.19.0 `~/.julia/dev/TextSearch`
Info Packages marked with ⌃ and ⌅ have new versions available. Those with ⌃ may be upgradable, but those with ⌅ are restricted by compatibility constraints from upgrading. To see why use `status --outdated`