Visualizing Twitter Messages with Emojis

by: Eric S. Téllez

This example creates a visualization of Glove word embeddings using Embeddings.jl package to fetch them.

Note: This example needs a lot of computing power; therefore you may want to set the environment variable JULIA_NUM_THREADS=auto before running julia.

using SimilaritySearch, SimSearchManifoldLearning, TextSearch, CodecZlib, JSON, DataFrames, PlotlyLight, StatsBase, LinearAlgebra, Markdown, Embeddings, Random
using Downloads: download
emb, vocab = let E = load_embeddings(GloVe{:en}, 2)  # you can change with any of the available embeddings in `Embeddings`
    emb, vocab = E.embeddings[:, 1:50_000], E.vocab[1:50_000]
    for c in eachcol(emb)
1        normalize!(c)
    end

2    Float16.(emb), vocab
end

3dist = Dist.CastF32.NormCosine()
4vocab2id = Dict(w => i for (i, w) in enumerate(vocab))
1
Normalizes all vectors to have a unitary norm; this allow us to use the dot product as similarity (see point 3)
2
The speed can be improved through memory’s bandwidth using less memory per vector; using Float16 as memory representation is a good idea even if your computer doesn’t support 16-bit floating point arithmetic natively.
3
Since we have unitary norm vectors we can simplify the cosine distance (i.e., \(1 - dot(\cdot, \cdot)\)); note that we are using Float16 and the module Dist.CastF32 which selects a distance function that converts numbers to Float32 just before performing arithmetic operations.
4
Inverse map from words to identifiers in vocab.

Now we can create the index

1index = SearchGraph(; dist, db=MatrixDatabase(emb))
ctx = SearchGraphContext(hyperparameters_callback=OptimizeParameters(MinRecall(0.99)))
2index!(index, ctx)
3optimize_index!(index, ctx, MinRecall(0.9))
1
Defines the index and the search context (caches and hyperparameters); particularly, we use a very high quality build MinRecall(0.99); high quality constructions yield to faster queries due to the underlying graph structure.
2
Actual indexing procedure using the given search context.
3
Optimizing the index to trade quality and speed.

Searching

Our index can solve queries over the entire dataset, for instance, solving synonym queries as nearest neighbor queries.

function search_and_render(index, ctx, vocab, q, res, k, qword)
    res = reuse!(res, k)
    @time search(index, ctx, q, res)

    L = [
        """## result list for _$(qword)_ """,
        """| nn | word | wordID | dist |""",
        """|----|------|--------|------|"""
    ]
    for (j, p) in enumerate(viewitems(res))
        push!(L, """| $j | $(vocab[p.id]) | $(p.id) | $(round(p.dist, digits=3)) |""")     
    end

    Markdown.parse(join(L, "\n"))
end

display(md"## Search examples (random)")

res = knnqueue(ctx, 12)
for word in ["hat", "bat", "car", "dinosaur"]
    #for qid in rand(1:length(vocab))
    qid = vocab2id[word]
    search_and_render(index, ctx, vocab, index[qid], res, maxlength(res), vocab[qid]) |> display
    #end
end
  0.000055 seconds (3 allocations: 64 bytes)
  0.000040 seconds (3 allocations: 64 bytes)
  0.000054 seconds (3 allocations: 64 bytes)
  0.000048 seconds (3 allocations: 64 bytes)

Search examples (random)

result list for hat

nn word wordID dist
1 hat 5626 0.0
2 hats 11439 0.288
3 shirt 5099 0.309
4 wears 9318 0.347
5 outfit 9363 0.351
6 trick 6922 0.357
7 boots 8847 0.36
8 wore 5052 0.363
9 jacket 8401 0.366
10 wearing 2759 0.368
11 scarf 19468 0.379
12 coat 6629 0.389

result list for bat

nn word wordID dist
1 bat 4926 -0.0
2 bats 8090 0.292
3 balls 4438 0.319
4 pitch 3099 0.339
5 batting 5278 0.358
6 wicket 5874 0.36
7 ball 1084 0.361
8 toss 8220 0.378
9 innings 2207 0.392
10 pitches 7935 0.4
11 catch 3162 0.421
12 swinging 12682 0.426

result list for car

nn word wordID dist
1 car 570 -0.0
2 vehicle 1908 0.137
3 truck 2576 0.14
4 cars 1278 0.163
5 driver 1926 0.181
6 driving 2032 0.219
7 motorcycle 7214 0.245
8 vehicles 1635 0.254
9 parked 8838 0.254
10 bus 1709 0.263
11 suv 14016 0.286
12 pickup 6948 0.304

result list for dinosaur

nn word wordID dist
1 dinosaur 13309 0.0
2 dinosaurs 14728 0.243
3 fossils 13023 0.267
4 fossilized 40434 0.286
5 mammal 21720 0.326
6 fossil 9045 0.334
7 reptile 30126 0.344
8 skeletons 22379 0.36
9 prehistoric 15412 0.376
10 bones 6392 0.377
11 jurassic 19639 0.395
12 paleontologists 41659 0.415

Interestingle, we can use this kind of models for analogy resolution, i.e., \(a\) is a \(b\) like \(c\) is a \(d\), or more wordly, father is a man like mother is a woman. The concepts learned by the embeddings are linear, and then we can state the analogy resolution, that is, solve father is a main like ? is a woman. It can be obtained with a simple vector arithmetic operations: \[a - b + d \rightarrow c\]

This can be interpreted as having a concept \(a\) remove the concept \(b\) and adds the concept \(c\); the resulting vector

function analogy(a, b, d, k)
1    c = index[vocab2id[a]] - index[vocab2id[b]] + index[vocab2id[d]]
2    normalize!(c)
3    search_and_render(index, ctx, vocab, c, res, k, "_$(a)_ - _$(b)_ + _$(d)_") |> display
end

4analogy("father", "man", "woman", 10)
analogy("fireman", "man", "woman", 10)
analogy("policeman", "man", "woman", 10)
analogy("mississippi", "usa", "france", 10)
1
Vector operations to state the analogy.
2
Normalize the vector.
3
Search the index to find similar queries to \(c\) vector.
4
Different analogies to solve; using 10nn.
  0.000058 seconds (3 allocations: 64 bytes)
  0.000040 seconds (3 allocations: 64 bytes)
  0.000048 seconds (3 allocations: 64 bytes)
  0.000048 seconds (3 allocations: 64 bytes)

result list for father_ - man + _woman

nn word wordID dist
1 mother 809 0.098
2 daughter 1132 0.132
3 wife 703 0.146
4 father 630 0.148
5 husband 1328 0.172
6 grandmother 7401 0.189
7 sister 2004 0.213
8 married 1168 0.214
9 niece 14268 0.234
10 woman 788 0.234

result list for fireman_ - man + _woman

nn word wordID dist
1 fireman 27345 0.157
2 firefighter 15812 0.303
3 paramedic 33841 0.394
4 rescuer 44915 0.439
5 janitor 32488 0.476
6 lifeguard 38623 0.476
7 welder 49430 0.487
8 schoolteacher 22298 0.505
9 pensioner 41032 0.529
10 volunteer 5359 0.53

result list for policeman_ - man + _woman

nn word wordID dist
1 policeman 6857 0.144
2 wounding 6118 0.285
3 policemen 4984 0.295
4 wounded 1392 0.331
5 injuring 6494 0.341
6 soldier 2482 0.353
7 bystander 29838 0.363
8 fatally 12321 0.368
9 stabbed 9973 0.375
10 protester 18152 0.379

result list for mississippi_ - usa + _france

nn word wordID dist
1 mississippi 4050 0.381
2 louisiana 3844 0.43
3 rhine 13957 0.488
4 coast 955 0.506
5 brittany 15877 0.506
6 southern 483 0.506
7 northern 530 0.514
8 normandy 13625 0.518
9 canal 4370 0.519
10 river 621 0.52

UMAP Visualization

Computing UMAP projections

e2, e3 = let min_dist=0.3f0,
             k=15,
             n_epochs=75,
             neg_sample_rate=3,
             tol=1e-3,
             layout=SpectralLayout()

    @time "Compute 2D UMAP model" U2 = fit(UMAP, index; k, neg_sample_rate, layout, n_epochs, tol, min_dist)
    @time "Compute 3D UMAP model" U3 = fit(U2, 3; neg_sample_rate, n_epochs, tol)
    @time "predicting 2D embeddings" e2 = clamp.(predict(U2), -10f0, 10f0)
    @time "predicting 3D embeddings" e3 = clamp.(predict(U3), -10f0, 10f0)
    e2, e3
end    

Computing visualization

function normcolors!(V)
    min_, max_ = extrema(V)
    s = 255.0 / (max_ - min_)
    V .= (V .- min_) .* s
    V .= round.(V; digits=0)
    V .= clamp.(V, 0, 255)
end

normcolors!(@view e3[1, :])
normcolors!(@view e3[2, :])
normcolors!(@view e3[3, :])

colors = [
    "rgba($(Int(c[1])), $(Int(c[2])), $(Int(c[3])), 0.3)" 
    for c in eachcol(e3)
]

text = [replace(w, r"\W" => "") for w in vocab] # avoids issues on plotly hovers

#@warn text

data = [Config(;
    x = view(e2, 1, :),
    y = view(e2, 2, :),
    mode = "markers",
    marker = (
        color = colors, 
        size = 4, 
        line = (width = 0,)
    ),
    text,
    type = "scatter"
)]

layout = Config(
    width = 600,
    height = 600,
    xaxis = (visible = false, showgrid = false, zeroline = false),
    yaxis = (visible = false, showgrid = false, zeroline = false),
    hovermode = "closest",
    plot_bgcolor = "white"
)

Plot(data, layout)

Final notes

This example shows how to index and search dense vector databases, in particular GloVe word embeddings using the cosine distance. Low dimensional projections are made with SimSearchManifoldLearning, note that SimilaritySearch is also used for computing the all \(k\) nearest neighbors needed by the UMAP model. Note that this notebook should be ran with several threads to reduce time costs.

Environment and dependencies

Julia Version 1.10.11
Commit a2b11907d7b (2026-03-09 14:59 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (x86_64-apple-darwin24.0.0)
  CPU: 8 × Intel(R) Core(TM) i5-8257U CPU @ 1.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
Environment:
  JULIA_NUM_THREADS = auto
  JULIA_PROJECT = @.
  JULIA_LOAD_PATH = @:@stdlib
Status `~/Research/SimilaritySearchDemos/Project.toml`
  [aaaa29a8] Clustering v0.15.8
  [944b1d66] CodecZlib v0.7.8
  [5ae59095] Colors v0.13.1
  [a93c6f00] DataFrames v1.8.1
  [c5bfea45] Embeddings v0.4.6
  [f67ccb44] HDF5 v0.17.2
  [916415d5] Images v0.26.2
  [b20bd276] InvertedFiles v0.9.2
 [682c06a0] JSON v0.21.4
  [23fbe1c1] Latexify v0.16.10
  [eb30cadb] MLDatasets v0.7.21
  [06eb3307] ManifoldLearning v0.9.0
 [ca7969ec] PlotlyLight v0.11.0
  [27ebfcd6] Primes v0.5.7
  [ca7ab67e] SimSearchManifoldLearning v0.4.0
  [053f045d] SimilaritySearch v0.14.3
 [2913bbd2] StatsBase v0.33.21
  [7f6f6c8a] TextSearch v0.20.0
Info Packages marked with  and  have new versions available. Those with  may be upgradable, but those with  are restricted by compatibility constraints from upgrading. To see why use `status --outdated`