Software packages

  1. Similarity Search - Nearest Neighbor Search
    1. SimilaritySearch.jl
    2. SpatialAccessTrees.jl
    3. InvertedFiles.jl
    4. SlicedSearch.jl: Turbo Scan
    5. TextSearch.jl
    6. NATIX
  2. Text classification
    1. EvoMSA
    2. MicroTC
    3. B4MSA
    4. TextClassification.jl
  3. Others
    1. SearchModels.jl (optimization)
    2. KCenters.jl (k centers and clustering)
    3. SimSearchManifoldLearning.jl (non-linear dimensional reduction)
    4. Intersections.jl (intersection algorithms)
    5. LevelDB.jl (key-value database)
    6. SnowballStemmer.jl (stemmer)
    7. Text_models
    8. BILMA

SimilaritySearch.jl

SimilaritySearch.jl is a library for nearest neighbor search. In particular, it contains the implementation for SearchGraph, a fast and flexible search index using any metric function. It is designed to support multithreading in most of its functions and structures.

The package provides the following indexes:

ParallelExhaustiveSearch: A brute force search index where each query is solved using all available threads. ExhaustiveSearch: A brute force search index, each query is solved using a single thread. SearchGraph: An approximate search index with parallel construction. The main set of functions are:

typedescription
searchSolves a single query.
searchbatchSolves a set of queries.
allknnComputes the nearest neighbors for all elements in an index.
neardupRemoves near-duplicates from a metric dataset.
closestpairComputes the closest pair in a metric dataset.

Links

SpatialAccessTrees.jl

Spatial access trees are a family of metric trees having excellent performance on low and medium dimensional datasets.

The package supports trading accuracy and search time strategies for spatial access trees.

Links

InvertedFiles.jl

This package implements inverted files, also known as inverted indexes, that are data structures that represents a large sparse matrix, specially organized to compute some distance functions and fetch k nearest neighbors. It is mainly used for full text search and other search tasks where data can be formulated as large sparse vectors. In particular, the package implements three types of inverted files:

typedescription
WeightedInvertedFileInverted files for sparse vectors, it can solve nearest neighbors using the normalized cosine distance,
BinaryInvertedFileInverted file for sparse binary data, it can solve nearest neighbors using Jaccard, Dice, and Cosine distances, and also the intersection dissimilarity measure.
KnrIndexAn approximated similarity search index based on inverted files. It supports general metric spaces.

These structs integrates with the SimilaritySearch environment, such that you can use it as a drop-in replacement of other indexes. In particular, inverted files are well-known for its scalability when the proper setup is used.

Links

SlicedSearch.jl: Turbo Scan

Implements an innovative kk-nearest neighbor search solution, Turbo Scan (TS), specifically designed to address the challenges associated with high-dimensionality data and rare workloads where building indexes cannot be effectively amortized over time. We recognize numerous scenarios where the overhead of constructing an index cannot be justified due to the limited number of queries performed on the dataset.

Links

TextSearch.jl

TextSearch.jl is a package to create vector representations of text and seach them, mostly, independently of the language. It provides the following concepts:

typedescription
TextConfigDefines preprocessing pipelines.
VocabularyText vocabularies.
VectorModelText vectorizers (sparse vectors) using global and local weighting schemes
BM25InvertedFileA full text inverted files using the BM25 score

It is intended to be used with SimilaritySearch.jl and with the InvertedFiles package. The BM25InvertedFile allows searching without using other packages.

Links

NATIX

The natix library contains several exact and approximate nearest neighbor search algorithm, mostly developed during my PhD project. It also contains several intersection and t-threshold algorithms and compact data structures. The focus of the library was the exploration of compact and compressed indexes for approximate nearest neighbor search.

It contains the canonical implementation of the CNAPP, a compressed inverted file for approximate metric search.

Links

Text classification

EvoMSA

A Multilingual Evolutionary Approach for Sentiment Analysis.

Links

MicroTC

μ\muTC is an automated text categorization framework based on hyperparameter optimization

Links

B4MSA

A Simple Approach to Multilingual Polarity Classification in Twitter

Links

TextClassification.jl

A Julia package for creating text classifiers based on full model selection, mostly based on MicroTC.

Links

Others

SearchModels.jl (optimization)

Provides a generic tool for minimizing model errors using stochastic search, which is often used whenever the problem has no concept of derivative. This kind of problems rely on large exploration of combinatorial spaces based on error function.

It is the core for auto-tuning, and optimization in discrete domains, of other packages.

Links

KCenters.jl (k centers and clustering)

A package that implements some algorithms for solving the K centers problem that integrates with SimilaritySearch.

Links

SimSearchManifoldLearning.jl (non-linear dimensional reduction)

A package that implements the UMAP algorithm for computing non-linear dimensional projections. It uses the SimilaritySearch package for speeding up the construction of the kknn graph and predictions. It also implements the necessary methods to use SimilaritySearch with external manifold learning methods, like those defined in the ManifoldLearning package.

Links

Intersections.jl (intersection algorithms)

Several intersection algorithms working on the comparison model, base for InvertedFiles.jl

Linkshttps://github.com/sadit/Intersections.jl

LevelDB.jl (key-value database)

A leveldb wrapper, forked from LevelDB.jl package. This version has new key and value types, prefix-driven fetches, and use of newer leveldb through LevelDB_jll (a BinaryBuilder-based binary, also created with this purpose).

Submitted for merging in original LevelDB.jl

Links

SnowballStemmer.jl (stemmer)

A wrapper for the libstemmer library, extrated from TextAnalysis. Unmantained.

Links

Text_models

Simplifica el acceso a información léxica en lenguajes árabe, inglés, español y ruso, e información de movilidad de viajeros en más de 200 regiones del planeta. Organizada por día y por región, la información es extraída de mensajes en el stream público de Twitter. La idea es simplificar el acceso a esta información para grupos de investigación que hagan uso de la misma. Esta biblioteca ha sido utilizada para detección de eventos (e.g., desastres naturales), el Índice de Movilidad COVID-19, además que es utilizada por otros grupos de investigación como es el caso de INEGI para su Indicador Oportuno de la Actividad Económica Variable Movilidad Twitter. El proyecto esta documentado y tiene ejemplos prácticos de uso.

BILMA

Bert in Latinamerica. A set of language models (BERT-based with Keras) for regional spanish models.

Links

CC BY-SA 4.0 Eric S. Téllez. Last modified: August 31, 2023. Website built with Franklin.jl and the Julia programming language.