Software packages

  1. Similarity Search - Nearest Neighbor Search
    1. SimilaritySearch.jl
    2. SpatialAccessTrees.jl
    3. InvertedFiles.jl
    4. SlicedSearch.jl: Turbo Scan
    5. TextSearch.jl
    6. NATIX
  2. Text classification
    1. EvoMSA
    2. MicroTC
    3. B4MSA
    4. TextClassification.jl
  3. Others
    1. SearchModels.jl (optimization)
    2. KCenters.jl (k centers and clustering)
    3. SimSearchManifoldLearning.jl (non-linear dimensional reduction)
    4. Intersections.jl (intersection algorithms)
    5. LevelDB.jl (key-value database)
    6. SnowballStemmer.jl (stemmer)
    7. Text_models
    8. BILMA


SimilaritySearch.jl is a library for nearest neighbor search. In particular, it contains the implementation for SearchGraph, a fast and flexible search index using any metric function. It is designed to support multithreading in most of its functions and structures.

The package provides the following indexes:

ParallelExhaustiveSearch: A brute force search index where each query is solved using all available threads. ExhaustiveSearch: A brute force search index, each query is solved using a single thread. SearchGraph: An approximate search index with parallel construction. The main set of functions are:

searchSolves a single query.
searchbatchSolves a set of queries.
allknnComputes the nearest neighbors for all elements in an index.
neardupRemoves near-duplicates from a metric dataset.
closestpairComputes the closest pair in a metric dataset.



Spatial access trees are a family of metric trees having excellent performance on low and medium dimensional datasets.

The package supports trading accuracy and search time strategies for spatial access trees.



This package implements inverted files, also known as inverted indexes, that are data structures that represents a large sparse matrix, specially organized to compute some distance functions and fetch k nearest neighbors. It is mainly used for full text search and other search tasks where data can be formulated as large sparse vectors. In particular, the package implements three types of inverted files:

WeightedInvertedFileInverted files for sparse vectors, it can solve nearest neighbors using the normalized cosine distance,
BinaryInvertedFileInverted file for sparse binary data, it can solve nearest neighbors using Jaccard, Dice, and Cosine distances, and also the intersection dissimilarity measure.
KnrIndexAn approximated similarity search index based on inverted files. It supports general metric spaces.

These structs integrates with the SimilaritySearch environment, such that you can use it as a drop-in replacement of other indexes. In particular, inverted files are well-known for its scalability when the proper setup is used.


SlicedSearch.jl: Turbo Scan

Implements an innovative kk-nearest neighbor search solution, Turbo Scan (TS), specifically designed to address the challenges associated with high-dimensionality data and rare workloads where building indexes cannot be effectively amortized over time. We recognize numerous scenarios where the overhead of constructing an index cannot be justified due to the limited number of queries performed on the dataset.



TextSearch.jl is a package to create vector representations of text and seach them, mostly, independently of the language. It provides the following concepts:

TextConfigDefines preprocessing pipelines.
VocabularyText vocabularies.
VectorModelText vectorizers (sparse vectors) using global and local weighting schemes
BM25InvertedFileA full text inverted files using the BM25 score

It is intended to be used with SimilaritySearch.jl and with the InvertedFiles package. The BM25InvertedFile allows searching without using other packages.



The natix library contains several exact and approximate nearest neighbor search algorithm, mostly developed during my PhD project. It also contains several intersection and t-threshold algorithms and compact data structures. The focus of the library was the exploration of compact and compressed indexes for approximate nearest neighbor search.

It contains the canonical implementation of the CNAPP, a compressed inverted file for approximate metric search.


Text classification


A Multilingual Evolutionary Approach for Sentiment Analysis.



μ\muTC is an automated text categorization framework based on hyperparameter optimization



A Simple Approach to Multilingual Polarity Classification in Twitter



A Julia package for creating text classifiers based on full model selection, mostly based on MicroTC.



SearchModels.jl (optimization)

Provides a generic tool for minimizing model errors using stochastic search, which is often used whenever the problem has no concept of derivative. This kind of problems rely on large exploration of combinatorial spaces based on error function.

It is the core for auto-tuning, and optimization in discrete domains, of other packages.


KCenters.jl (k centers and clustering)

A package that implements some algorithms for solving the K centers problem that integrates with SimilaritySearch.


SimSearchManifoldLearning.jl (non-linear dimensional reduction)

A package that implements the UMAP algorithm for computing non-linear dimensional projections. It uses the SimilaritySearch package for speeding up the construction of the kknn graph and predictions. It also implements the necessary methods to use SimilaritySearch with external manifold learning methods, like those defined in the ManifoldLearning package.


Intersections.jl (intersection algorithms)

Several intersection algorithms working on the comparison model, base for InvertedFiles.jl


LevelDB.jl (key-value database)

A leveldb wrapper, forked from LevelDB.jl package. This version has new key and value types, prefix-driven fetches, and use of newer leveldb through LevelDB_jll (a BinaryBuilder-based binary, also created with this purpose).

Submitted for merging in original LevelDB.jl


SnowballStemmer.jl (stemmer)

A wrapper for the libstemmer library, extrated from TextAnalysis. Unmantained.



Simplifica el acceso a información léxica en lenguajes árabe, inglés, español y ruso, e información de movilidad de viajeros en más de 200 regiones del planeta. Organizada por día y por región, la información es extraída de mensajes en el stream público de Twitter. La idea es simplificar el acceso a esta información para grupos de investigación que hagan uso de la misma. Esta biblioteca ha sido utilizada para detección de eventos (e.g., desastres naturales), el Índice de Movilidad COVID-19, además que es utilizada por otros grupos de investigación como es el caso de INEGI para su Indicador Oportuno de la Actividad Económica Variable Movilidad Twitter. El proyecto esta documentado y tiene ejemplos prácticos de uso.


Bert in Latinamerica. A set of language models (BERT-based with Keras) for regional spanish models.


CC BY-SA 4.0 Eric S. Téllez. Last modified: August 31, 2023. Website built with Franklin.jl and the Julia programming language.