Prime numbers

by: Eric S. Téllez

This demonstration is about prime numbers and its similarity based on its factors. This is a well-known demonstration of SimSearchManifoldLearning.jl. This notebook does not requires to download any dataset.

Note: This example needs a lot of computing power; therefore you may want to set the environment variable JULIA_NUM_THREADS=auto before running julia.

using SimilaritySearch, SimSearchManifoldLearning, Primes, Plots, StatsBase, LinearAlgebra, Markdown, Random

Now, we can define our dataset. The set of factors are found using the Primes package. Note that we use VectorDatabase to represent the dataset.

n = 100_000
F = Vector{Vector{Int32}}(undef, n+1)

function encode_factors(num)
    sort!(Int32[convert(Int32, f) for f in factor(Set, num)])
end

F[1] = Int32[1]

for i in 2:n+1
    s = encode_factors(i)
    F[i] = s
end

db = VectorDatabase(F)
#dist = JaccardDistance()  # Other distances from SimilaritySearch
#dist = IntersectionDissimilarity()
#dist = CosineDistanceSet()
dist = DiceDistance()

We use Int32 ordered arrays to store prime factors to represent each integer. In the following cell define the cosine distance equivalent for this representation. While other representations may perform faster, this is quite straighforward and demonstrates the use of user’s defined distance functions.

Index construction

Note that the primes factors are pretty large for some large \(n\) and this imply challengues for metric indexes (since it is related with the intrinsic dimension of the dataset). We used a kernel that starts 64 threads, it solves \(100000\) in a few seconds but it can take pretty large time using single threads and larger \(n\) values. The construction of the index is used by the visualization algorithm (UMAP) to construct an all-knn graph, which can be a quite costly procedure.

G = SearchGraph(; db, dist)
1ctx = SearchGraphContext(hyperparameters_callback=OptimizeParameters(MinRecall(0.95)))
2index!(G, ctx)
3optimize_index!(G, ctx, MinRecall(0.9))

1: Defines the index and the search context (caches and hyperparameters); particularly, we use a very high quality build MinRecall(0.95); high quality constructions yield to faster queries due to the underlying graph structure.
2: Actual indexing procedure using the given search context.
3: Optimizing the index to trade quality and speed.

Searching

Our index can solve queries over the entire dataset, for instance, solving synonym queries as nearest neighbor queries.

function search_and_render(index, ctx, qnum, qenc, res, k)
    res = reuse!(res, k)
    @time search(index, ctx, qenc, res)

    L = [
        """## result list for num: $(qnum), factors: $qenc""",
        """| nn | num | factors | dist |""",
        """|----|------|--------|------|"""
    ]
    for (j, p) in enumerate(res)
        push!(L, """| $j | $(p.id) | $(index[p.id]) | $(round(p.weight, digits=3)) |""")     
    end

    Markdown.parse(join(L, "\n"))
end

display(md"## Search examples (random)")

res = KnnResult(12)
for qnum in [42, 51, 1000, 1492, 3192]
    qenc = encode_factors(qnum)
    search_and_render(G, ctx, qnum, qenc, res, maxlength(res)) |> display
end

  0.000450 seconds (3 allocations: 2.969 KiB)
  0.000511 seconds (4 allocations: 3.766 KiB)
  0.000764 seconds (4 allocations: 3.766 KiB)
  0.000183 seconds (3 allocations: 848 bytes)
  0.000342 seconds (3 allocations: 2.969 KiB)

Search examples (random)

result list for num: 42, factors: Int32[2, 3, 7]

nn	num	factors	dist
1	42	Int32[2, 3, 7]	0.0
2	84	Int32[2, 3, 7]	0.0
3	126	Int32[2, 3, 7]	0.0
4	168	Int32[2, 3, 7]	0.0
5	252	Int32[2, 3, 7]	0.0
6	294	Int32[2, 3, 7]	0.0
7	336	Int32[2, 3, 7]	0.0
8	378	Int32[2, 3, 7]	0.0
9	504	Int32[2, 3, 7]	0.0
10	672	Int32[2, 3, 7]	0.0
11	756	Int32[2, 3, 7]	0.0
12	588	Int32[2, 3, 7]	0.0

result list for num: 51, factors: Int32[3, 17]

nn	num	factors	dist
1	51	Int32[3, 17]	0.0
2	153	Int32[3, 17]	0.0
3	459	Int32[3, 17]	0.0
4	867	Int32[3, 17]	0.0
5	1377	Int32[3, 17]	0.0
6	2601	Int32[3, 17]	0.0
7	4131	Int32[3, 17]	0.0
8	7803	Int32[3, 17]	0.0
9	12393	Int32[3, 17]	0.0
10	14739	Int32[3, 17]	0.0
11	23409	Int32[3, 17]	0.0
12	37179	Int32[3, 17]	0.0

result list for num: 1000, factors: Int32[2, 5]

nn	num	factors	dist
1	10	Int32[2, 5]	0.0
2	20	Int32[2, 5]	0.0
3	40	Int32[2, 5]	0.0
4	50	Int32[2, 5]	0.0
5	80	Int32[2, 5]	0.0
6	100	Int32[2, 5]	0.0
7	160	Int32[2, 5]	0.0
8	200	Int32[2, 5]	0.0
9	250	Int32[2, 5]	0.0
10	320	Int32[2, 5]	0.0
11	400	Int32[2, 5]	0.0
12	500	Int32[2, 5]	0.0

result list for num: 1492, factors: Int32[2, 373]

nn	num	factors	dist
1	746	Int32[2, 373]	0.0
2	1492	Int32[2, 373]	0.0
3	2984	Int32[2, 373]	0.0
4	5968	Int32[2, 373]	0.0
5	11936	Int32[2, 373]	0.0
6	23872	Int32[2, 373]	0.0
7	47744	Int32[2, 373]	0.0
8	95488	Int32[2, 373]	0.0
9	61918	Int32[2, 83, 373]	0.2
10	2238	Int32[2, 3, 373]	0.2
11	8206	Int32[2, 11, 373]	0.2
12	9698	Int32[2, 13, 373]	0.2

result list for num: 3192, factors: Int32[2, 3, 7, 19]

nn	num	factors	dist
1	798	Int32[2, 3, 7, 19]	0.0
2	1596	Int32[2, 3, 7, 19]	0.0
3	2394	Int32[2, 3, 7, 19]	0.0
4	3192	Int32[2, 3, 7, 19]	0.0
5	4788	Int32[2, 3, 7, 19]	0.0
6	5586	Int32[2, 3, 7, 19]	0.0
7	6384	Int32[2, 3, 7, 19]	0.0
8	7182	Int32[2, 3, 7, 19]	0.0
9	9576	Int32[2, 3, 7, 19]	0.0
10	11172	Int32[2, 3, 7, 19]	0.0
11	12768	Int32[2, 3, 7, 19]	0.0
12	14364	Int32[2, 3, 7, 19]	0.0

Visualizing with UMAP projection

We select to initialize the embedding randomly, this could yield to low quality embeddings, but it is much faster than other techniques like spectral layout. Note that we pass the Search graph G. We also use a second call to compute a 3D embedding for computing a kind of colour embedding, here we pass U2 to avoid recomputing several of the involved structures.

function normcolors(V)
    min_, max_ = extrema(V)
    V .= (V .- min_) ./ (max_ - min_)
    V .= clamp.(V, 0, 1)
end

normcolors(@view e3[1, :])
normcolors(@view e3[2, :])
normcolors(@view e3[3, :])

let C = [RGB(c[1], c[2], c[3]) for c in eachcol(e3)],
    X = view(e2, 1, :),
    Y = view(e2, 2, :)
    scatter(X, Y, color=C, fmt=:png, alpha=0.2, size=(600, 600), ma=0.3, ms=2, msw=0, label="", yticks=nothing, xticks=nothing, xaxis=false, yaxis=false)
end

plot!()

e2, e3 = let min_dist=0.5f0,
             k=20,
             n_epochs=75,
             neg_sample_rate=3,
             tol=1e-3,
             layout=SpectralLayout()

    @time "Compute 2D UMAP model" U2 = fit(UMAP, G; k, neg_sample_rate, layout, n_epochs, tol, min_dist)
    @time "Compute 3D UMAP model" U3 = fit(U2, 3; neg_sample_rate, n_epochs, tol)
    @time "predicting 2D embeddings" e2 = clamp.(predict(U2), -10f0, 10f0)
    @time "predicting 3D embeddings" e3 = clamp.(predict(U3), -10f0, 10f0)
    e2, e3
end

Environment and dependencies

Julia Version 1.10.9
Commit 5595d20a287 (2025-03-10 12:51 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, cascadelake)
Threads: 64 default, 0 interactive, 32 GC (on 64 virtual cores)
Environment:
  JULIA_PROJECT = .
  JULIA_NUM_THREADS = auto
  JULIA_LOAD_PATH = @:@stdlib
Status `~/sites/SimilaritySearchDemos/Project.toml`
  [aaaa29a8] Clustering v0.15.8
  [944b1d66] CodecZlib v0.7.8
  [a93c6f00] DataFrames v1.7.0
  [c5bfea45] Embeddings v0.4.6
  [f67ccb44] HDF5 v0.17.2
  [b20bd276] InvertedFiles v0.8.0 `~/.julia/dev/InvertedFiles`
  [682c06a0] JSON v0.21.4
  [23fbe1c1] Latexify v0.16.6
  [eb30cadb] MLDatasets v0.7.18
  [06eb3307] ManifoldLearning v0.9.0
⌃ [ca7969ec] PlotlyLight v0.11.0
  [91a5bcdd] Plots v1.40.11
  [27ebfcd6] Primes v0.5.7
  [ca7ab67e] SimSearchManifoldLearning v0.3.0 `~/.julia/dev/SimSearchManifoldLearning`
  [053f045d] SimilaritySearch v0.12.0 `~/.julia/dev/SimilaritySearch`
⌅ [2913bbd2] StatsBase v0.33.21
  [f3b207a7] StatsPlots v0.15.7
  [7f6f6c8a] TextSearch v0.19.0 `~/.julia/dev/TextSearch`
Info Packages marked with ⌃ and ⌅ have new versions available. Those with ⌃ may be upgradable, but those with ⌅ are restricted by compatibility constraints from upgrading. To see why use `status --outdated`