Datasets
Usage demonstrations of `SimilaritySearch` with synthetic and real world data
Datasets
Real world
Glove - From Glove site, it corresponds to the 100d, 6B tokens. We use the
Embeddings.jl
package to simplify downloading and loading it.MNIST - From MNIST site, it corresponds to 60k 28x28 handwritten numbers. We use the
MLDatasets.jl
package to simplify downloading and loading it.Wiktionary. Took from Wiktionary site, we select English terms from the english wiktionary.
WIT-300K. From Clip and WIT, we downloaded the first 300K annotated images for the Spanish Wikipedia and take the Clip embeddings of them. Available from the demo subfolder.
Partitions
The versions used in the demonstrations are not splitted in train and test, but those used in the paper are splitted. If you want to reproduce the same results, please use the datasets by ann-benchmarks and its repo.
For WIT and Twitter-2M, please use the following HDF5 files, they follow a similar structure than those found in the ann-benchmarks.
Twitter-2M. These word embeddings corresponds to that model labeled as ALL-2M in the Regional Spanish Models site, yet partitioned for using as similarity search benchmark.