Demonstrations

Usage demonstrations of `SimilaritySearch` with synthetic and real world data

Introduction

Our demonstrations are Pluto and Jupyter notebooks that can be used to replicate and interactively use SimilaritySearch. To make the demonstrations more attractive, we also make intensive use of visualizations based on non-linear dimensional reduction, These kind of algorithms use k nearest neighbors of a database as input to produce the low dimensional embedding. In particular, we use the SimSearchManifoldLearning which provides an implementation of UMAP and also defines the necessary functions to interoperate with the ManifoldLearning package.

We provide two kinds of examples:

  • Pluto reactive notebooks that can run locally or online.

  • Jupyter notebooks are less reactive but they are great to visualize directly on github without running.

List of examples

We separate the examples by the kind of data, since some of the datasets are quite large and will require a lot of computer power. We also list how to connect SimilaritySearch with other packages that require solving k nearest neighbor queries.

Indexing and visualizing synthetic and easily generated data

  • Synthetic 8D: A tutorial-like Jupyter notebook that shows how to create an index on synthetic data and search it. Synthetic 8-dimensional dataset under L2.

  • Synthetic 2D: A tutorial-like Jupyter notebook working on 2D synthetic dataset, also shows how the index works on different density regions of the database. Synthetic 2-dimensional dataset under L2.

Indexing and visualizing real high dimensional datasets

All Pluto notebooks can work on mybinder and run without install anything in your computer, however, some examples uses large datasets and some of them require high computational resources (as many threads as you have); they could run slow in cloud computing services.

  • Integers as prime factors: A tutorial-like Jupyter notebook that produces an UMAP visualization of integers represented by its prime factors. It uses UMAP 2D and 3D projections. Very high dimension, based on the number of factors under the nn integers; different user defined distances.

  • Prime gaps: Visualization of sequences of prime gaps to visualize them for searching patterns in this infinity source of objects. It uses 2D and 3D projections.

  • WIT: This example shows how to navigate, query, and visualize a small subset of the WIT dataset using Clip embeddings (vision \& language). ~300K 512-dimensional vectors using the cosine distance.

    • Jupyter-based WIT demo, SimilaritySearch v0.10.

    • Pluto-based WIT demo, SimilaritySearch v0.8.

  • Glove: Navigate and visualize semantic representations (Glove word embeddings), also can solve analogies. The vocabulary consists of 400K tokens represented as 100-dimensional vectors under the cosine distance.

  • MNIST: Navigation and visualization of the MNIST dataset of hand drawing numbers. It uses images directly as objects (28x28 matrices).

  • Wiktionary: Pluto notebook to navigate and query the Wiktionary vocabulary using Levenshtein distance (~1M words)

  • Tweets: Pluto notebook to visualize a collection of Twitter's Spanish messages with emojis using bag of words representations. 50K items.

TODO: Cites and references

Interoperating with other packages

  • Working with ManifoldLearning. This Pluto notebook implements the necessary structs and functions to solve knn queries for ManifoldLearning algorithms. We used two datasets, the first corresponding to the scurve and the second is for Prime gaps as time series.

Search demos and UMAP visualization

The demos are Pluto and [Jupyter](https://jupyter.org/] notebooks. Inside the repo's root run the following commands.

$ JULIA_NUM_THREADS=auto julia --project=.
...

julia> using Pluto
...
julia> Pluto.run(notebook="WIT/wit-demo.jl")
...

or

$ JULIA_NUM_THREADS=auto jupyter-lab .

Please recall that the first time you load a package Julia compiles it. Pluto notebooks also save its own environments and therefore it can use different package versions that those listed in the repo environment, which will cause installing and compiling packages the first time the notebooks run. Hopefully, this strategy improves the reproducibility at the cost of increasing loading times. Jupyter notebooks also contain the necessary package-manager instructions to improve reproducibility.

Note: Pluto interface also allows loading notebooks, so you don't need to exit and re-run to explore examples.

Visualization

Most visualizations are made with UMAP models using the SimSearchManifoldLearning package. These can be expensive and it is always recommended to run notebooks with all available threads.

Initializing the environment

SimilaritySearch.jl is writen in the Julia language you need to install it first in order to run them. After this it is necessary to install Pluto and/or IJulia (for Jupyter notebooks). If you need more information about how to install and use these notebooks, please see their respective sites.

CC BY-SA 4.0 Eric S. Tellez . Last modified: June 05, 2023. Website built with Franklin.jl and the Julia programming language.