Datasets

Real world

Glove - From Glove site, it corresponds to the 100d, 6B tokens. We use the Embeddings.jl package to simplify downloading and loading it.
MNIST - From MNIST site, it corresponds to 60k 28x28 handwritten numbers. We use the MLDatasets.jl package to simplify downloading and loading it.
Wiktionary. Took from Wiktionary site, we select English terms from the english wiktionary.
WIT-300K. From Clip and WIT, we downloaded the first 300K annotated images for the Spanish Wikipedia and take the Clip embeddings of them. Available from the demo subfolder.

The versions used in the demonstrations are not splitted in train and test, but those used in the paper are splitted. If you want to reproduce the same results, please use the datasets by ann-benchmarks and its repo.

For WIT and Twitter-2M, please use the following HDF5 files, they follow a similar structure than those found in the ann-benchmarks.

WIT-300K
Twitter-2M. These word embeddings corresponds to that model labeled as ALL-2M in the Regional Spanish Models site, yet partitioned for using as similarity search benchmark.

CC BY-SA 4.0 Eric S. Tellez . Last modified: March 04, 2022. Website built with Franklin.jl and the Julia programming language.

Datasets

Usage demonstrations of `SimilaritySearch` with synthetic and real world data

Datasets

Real world

Partitions