Demonstrations
Usage demonstrations of `SimilaritySearch` with synthetic and real world data
Introduction
Our demonstrations are Pluto and Jupyter notebooks that can be used to replicate and interactively use SimilaritySearch
. To make the demonstrations more attractive, we also make intensive use of visualizations based on non-linear dimensional reduction, These kind of algorithms use k
nearest neighbors of a database as input to produce the low dimensional embedding. In particular, we use the SimSearchManifoldLearning
which provides an implementation of UMAP
and also defines the necessary functions to interoperate with the ManifoldLearning
package.
We provide two kinds of examples:
Pluto
reactive notebooks that can run locally or online.Jupyter
notebooks are less reactive but they are great to visualize directly on github without running.
List of examples
We separate the examples by the kind of data, since some of the datasets are quite large and will require a lot of computer power. We also list how to connect SimilaritySearch
with other packages that require solving k
nearest neighbor queries.
Indexing and visualizing synthetic and easily generated data
Synthetic 8D: A tutorial-like Jupyter notebook that shows how to create an index on synthetic data and search it. Synthetic 8-dimensional dataset under L2.
Synthetic 2D: A tutorial-like Jupyter notebook working on 2D synthetic dataset, also shows how the index works on different density regions of the database. Synthetic 2-dimensional dataset under L2.
Indexing and visualizing real high dimensional datasets
All Pluto notebooks can work on mybinder and run without install anything in your computer, however, some examples uses large datasets and some of them require high computational resources (as many threads as you have); they could run slow in cloud computing services.
Integers as prime factors: A tutorial-like Jupyter notebook that produces an UMAP visualization of integers represented by its prime factors. It uses UMAP 2D and 3D projections. Very high dimension, based on the number of factors under the integers; different user defined distances.
Prime gaps: Visualization of sequences of prime gaps to visualize them for searching patterns in this infinity source of objects. It uses 2D and 3D projections.
Search and UMAP projection Prime Gaps demo. It generates the dataset.
The end of Primes (using
ManifoldLearning
) contains a prime-gap visualization withIsomap
.
WIT: This example shows how to navigate, query, and visualize a small subset of the WIT dataset using Clip embeddings (vision \& language). ~300K 512-dimensional vectors using the cosine distance.
Glove: Navigate and visualize semantic representations (Glove word embeddings), also can solve analogies. The vocabulary consists of 400K tokens represented as 100-dimensional vectors under the cosine distance.
Jupyter-based GloVe demo, SimilaritySearch v0.10.
Pluto-based GloVe demo, SimilaritySerach v0.8.
MNIST: Navigation and visualization of the MNIST dataset of hand drawing numbers. It uses images directly as objects (28x28 matrices).
Jupyter-based MNIST demo, SimilaritySearch v0.10.
Pluto-based MNIST demo, SimilaritySearch v0.8.
Pluto-based MNIST animated projections, SimilaritySearch v0.8.
Wiktionary: Pluto notebook to navigate and query the Wiktionary vocabulary using Levenshtein distance (~1M words)
Jupyter-based Wiktionary demo, SimilaritySearch v0.10.
Pluto-based Wiktionary demo, SimilaritySearch v0.8.
Tweets: Pluto notebook to visualize a collection of Twitter's Spanish messages with emojis using bag of words representations. 50K items.
Search and UMAP projection Emojispace demo.
TODO: Cites and references
Interoperating with other packages
Working with
ManifoldLearning
. This Pluto notebook implements the necessary structs and functions to solveknn
queries forManifoldLearning
algorithms. We used two datasets, the first corresponding to the scurve and the second is forPrime gaps
as time series.
Search demos and UMAP visualization
The demos are Pluto and [Jupyter](https://jupyter.org/] notebooks. Inside the repo's root run the following commands.
$ JULIA_NUM_THREADS=auto julia --project=.
...
julia> using Pluto
...
julia> Pluto.run(notebook="WIT/wit-demo.jl")
...
or
$ JULIA_NUM_THREADS=auto jupyter-lab .
Please recall that the first time you load a package Julia compiles it. Pluto notebooks also save its own environments and therefore it can use different package versions that those listed in the repo environment, which will cause installing and compiling packages the first time the notebooks run. Hopefully, this strategy improves the reproducibility at the cost of increasing loading times. Jupyter notebooks also contain the necessary package-manager instructions to improve reproducibility.
Note: Pluto interface also allows loading notebooks, so you don't need to exit and re-run to explore examples.
Visualization
Most visualizations are made with UMAP models using the SimSearchManifoldLearning
package. These can be expensive and it is always recommended to run notebooks with all available threads.
Initializing the environment
SimilaritySearch.jl
is writen in the Julia language you need to install it first in order to run them. After this it is necessary to install Pluto and/or IJulia (for Jupyter notebooks). If you need more information about how to install and use these notebooks, please see their respective sites.