TextSearch API
Base.:*
— Method*(a::DVEC{Ti,Tv}, b::DVEC{Ti,Tv}) where {Ti,Tv<:Real}
*(a::DVEC{K, V}, b::F) where K where {V<:Real} where {F<:Real}
Computes the element-wise product of a and b
Base.:+
— Method+(a::DVEC{Ti,Tv}, b::DVEC{Ti,Tv}) where {Ti,Tv<:Real}
+(a::DVEC, b::Pair)
Computes the sum of a
and b
Base.:-
— Method-(a::DVEC{Ti,Tv}, b::DVEC{Ti,Tv}) where {Ti,Tv<:Real}
Substracts of b
of a
Base.:/
— Method/(a::DVEC{K, V}, b::F) where K where {V<:Real} where {F<:Real}
Computes the element-wise division of a and b
Base.sum
— MethodBase.sum(col::AbstractVector{<:DVEC})
Computes the sum of the given list of vectors
Base.zero
— Methodzero(::Type{DVEC{Ti,Tv}}) where {Ti,Tv}
Creates an empty DVEC vector
Distances.evaluate
— Methodevaluate(::AngleDistance, a::DVEC, b::DVEC)::Float64
Computes the angle between two DVEC sparse vectors
Distances.evaluate
— Methodevaluate(::CosineDistance, a::DVEC, b::DVEC)::Float64
Computes the cosine distance between two DVEC sparse vectors
Distances.evaluate
— Methodevaluate(::NormalizedAngleDistance, a::DVEC, b::DVEC)::Float64
Computes the angle between two DVEC sparse vectors
It supposes that all bags are normalized (see normalize!
function)
Distances.evaluate
— Methodevaluate(::NormalizedCosineDistance, a::DVEC, b::DVEC)::Float64
Computes the cosine distance between two DVEC sparse vectors
It supposes that bags are normalized (see normalize!
function)
LinearAlgebra.dot
— Methoddot(a::DVEC, b::DVEC)::Float64
Computes the dot product for two DVEC vectors
LinearAlgebra.norm
— Methodnorm(a::DVEC)
Computes a normalized DVEC vector
LinearAlgebra.normalize!
— Methodnormalize!(bow::DVEC)
Inplace normalization of bow
SimilaritySearch.restoreindex
— Methodloadindex(...; staticgraph=false, parent="/")
restoreindex(file, parent::String, index, meta, options::Dict; staticgraph=false)
load the inverted index optionally making the postings lists static or dynamic
SimilaritySearch.search
— Methodsearch(acceptpostinglist::Function, idx::BM25InvertedFile, ctx::InvertedFileContext, qtext::AbstractString, res::KnnResult search(idx::BM25InvertedFile, ctx::InvertedFileContext, qtext::AbstractString, res::KnnResult
Find candidates for solving query Q
using idx
. It calls callback
on each candidate (docID, dist)
SparseArrays.sparsevec
— Methodsparsevec(vec::DVEC{Ti,Tv}, m=0) where {Ti<:Integer,Tv<:Number}
Creates a sparse vector from a DVEC sparse vector
TextSearch.add!
— Methodadd!(a::DVEC{Ti,Tv}, b::DVEC{Ti,Tv}) where {Ti,Tv<:Real}
add!(a::DVEC{Ti,Tv}, b::AbstractSparseArray) where {Ti,Tv<:Real}
add!(a::DVEC{Ti,Tv}, b::Pair{Ti,Tv}) where {Ti,Tv<:Real}
Updates a
to the sum of a+b
TextSearch.approxvoc
— Functionapproxvoc(
voc::Vocabulary,
dist::SemiMetric=JaccardDistance();
maxdist::Real = 0.7,
textconfig=TextConfig(qlist=[3]),
doc_min_freq::Integer=1, # any hard vocabulary pruning are expected to be made in `voc`
doc_max_ratio::AbstractFloat=0.4 # popular tokens are likely to be thrash
)
Vocabulary Lookup that retrieves the nearest token under some set distance (see SimilaritySearch
and InvertedFiles
) using a character q-gram representation.
TextSearch.bagofwords!
— Methodbagofwords!(bow::BOW, voc::Vocabulary, tokenlist::TokenizedText)
bagofwords!(buff::TextSearchBuffer, voc::Vocabulary, text)
bagofwords(voc::Vocabulary, messages)
Creates a bag of words from the given text (a string or a list of strings). If bow is given then updates the bag with the text. When config
is given, the text is parsed according to it.
TextSearch.bagofwords!
— Methodbagofwords(voc::Vocabulary, messages::AbstractVector)
bagofwords!(buff, voc::Vocabulary, messages::AbstractVector)
Computes a bag of words from messages
TextSearch.bagofwords_corpus
— Methodbagofwords_corpus(voc::Vocabulary, corpus::AbstractVector; minbatch=0)
Computes a list of bag of words from a corpus
TextSearch.centroid
— Methodcentroid(cluster::AbstractVector{<:DVEC})
Computes a centroid of the given list of DVEC vectors
TextSearch.collocations
— Methodcollocations(q, buff::TextSearchBuffer, tt::AbstractTokenTransformation, mark_token_type)
Computes a kind of collocations of the given text
TextSearch.dvec
— Methoddvec(x::AbstractSparseVector)
Converts an sparse vector into a DVEC sparse vector
TextSearch.filter_tokens!
— Methodfilter_tokens!(voc::Vocabulary, text::TokenizedText)
Removes tokens from text array
TextSearch.filter_tokens!
— Methodfilter_tokens!(voc::Vocabulary, text::TokenizedText)
Removes tokens from a given tokenized text based using the valid vocabulary
TextSearch.filter_tokens
— Methodfilter_tokens(pred::Function, voc::Vocabulary)
Returns a copy of reduced vocabulary based on evaluating pred
function for each entry in voc
TextSearch.flush_collocation!
— Methodflush_collocations!(buff::TextSearchBuffer, tt::AbstractTokenTransformation, mark_token_type)
Pushes a collocation inside the buffer to the token list; it discards empty strings.
TextSearch.flush_nword!
— Methodflush_nword!(buff::TextSearchBuffer, tt::AbstractTokenTransformation, mark_token_type)
Pushes the nword inside the buffer to the token list; it discards empty strings.
TextSearch.flush_qgram!
— Methodflush_qgram!(buff::TextSearchBuffer, tt::AbstractTokenTransformation, mark_token_type)
Pushes the qgram inside the buffer to the token list; it discards empty strings.
TextSearch.flush_skipgram!
— Methodflush_skipgram!(buff::TextSearchBuffer, tt::AbstractTokenTransformation, mark_token_type)
Pushes the skipgram inside the buffer to the token list; it discards empty strings.
TextSearch.flush_unigram!
— Methodflush_unigram!(buff::TextSearchBuffer, tt::AbstractTokenTransformation)
Pushes the word inside the buffer to the token list; it discards empty strings.
TextSearch.merge_voc
— Methodmerge_voc(voc1::Vocabulary, voc2::Vocabulary[, ...])
merge_voc(pred::Function, voc1::Vocabulary, voc2::Vocabulary[, ...])
Merges two or more vocabularies into a new one. A predicate function can be used to filter token entries.
Note: All vocabularies should had been created with a compatible TextConfig
to be able to work on them.
TextSearch.normalize_text
— Methodnormalize_text(config::TextConfig, text::AbstractString, output::Vector{Char})
Normalizes a given text using the specified transformations of config
TextSearch.nwords
— Methodnwords(q::Integer, buff::TextSearchBuffer, tt::AbstractTokenTransformation, mark_token_type)
TextSearch.qgrams
— Methodqgrams(q::Integer, buff::TextSearchBuffer, tt::AbstractTokenTransformation, mark_token_type)
Computes character q-grams for the given input
TextSearch.skipgrams
— Methodskipgrams(q::Skipgram, buff::TextSearchBuffer, tt::AbstractTokenTransformation, mark_token_type)
Tokenizes using skipgrams
TextSearch.sparse_coo
— Methodsparse(cols::AbstractVector{S}, m=0; minweight=1e-9) where S<:DVEC{Ti,Tv} where {Ti<:Integer,Tv<:Number}
sparse_coo(cols::AbstractVector{S}, minweight=1e-9) where S<:DVEC{Ti,Tv} where {Ti<:Integer,Tv<:Number}
Creates a sparse matrix from an array of DVEC sparse vectors.
TextSearch.tokenize
— Methodtokenize(textconfig::TextConfig, text)
tokenize(copy_::Function, textconfig::TextConfig, text)
tokenize(textconfig::TextConfig, text, buff)
tokenize(copy_::Function, textconfig::TextConfig, text, buff)
Tokenizes text
using the given configuration. The tokenize
makes heavy usage of buffers, and when these buffers are shared it is mandatory to create a copy of the result (buff.tokens
).
Change the default copy
function to make an additional filtering of the tokens. You can also pass the identity
function to avoid copying.
TextSearch.tokenize_and_append!
— Methodtokenize_and_append!(voc::Vocabulary, corpus; minbatch=0)
Parse each document in the given corpus and appends each token to the vocabulary.
TextSearch.tokenize_corpus
— Methodtokenize_corpus(textconfig::TextConfig, arr; minbatch=0, verbose=true)
tokenize_corpus(copy_::Function, textconfig::TextConfig, arr; minbatch=0, verbose=true)
Tokenize a list of texts. The copy_
function is passed to tokenize
as first argument.
TextSearch.transform_collocation
— Methodtransform_collocation(::AbstractTokenTransformation, tok)
Hook applied in the tokenization stage to change the input token tok
if needed. Return nothing
to ignore the tok
occurence (e.g., stop words).
TextSearch.transform_nword
— Methodtransform_nword(::AbstractTokenTransformation, tok)
Hook applied in the tokenization stage to change the input token tok
if needed. For instance, it can be used to apply stemming or any other kind of normalization. Return nothing
to ignore the tok
occurence (e.g., stop words).
TextSearch.transform_qgram
— Methodtransform_qgram(::AbstractTokenTransformation, tok)
Hook applied in the tokenization stage to change the input token tok
if needed. For instance, it can be used to apply stemming or any other kind of normalization. Return nothing
to ignore the tok
occurence (e.g., stop words).
TextSearch.transform_skipgram
— Methodtransform_skipgram(::AbstractTokenTransformation, tok)
Hook applied in the tokenization stage to change the input token tok
if needed. For instance, it can be used to apply stemming or any other kind of normalization. Return nothing
to ignore the tok
occurence (e.g., stop words).
TextSearch.transform_unigram
— Methodtransform_unigram(::AbstractTokenTransformation, tok)
Hook applied in the tokenization stage to change the input token tok
if needed. For instance, it can be used to apply stemming or any other kind of normalization. Return nothing
to ignore the tok
occurence (e.g., stop words).
TextSearch.unigrams
— Methodunigrams(buff::TextSearchBuffer, tt::AbstractTokenTransformation)
Performs the word tokenization
TextSearch.update_voc!
— Methodupdate_voc!(voc::Vocabulary, another::Vocabulary)
update_voc!(pred::Function, voc::Vocabulary, another::Vocabulary)
Update voc
vocabulary using another vocabulary. Optionally a predicate can be given to filter vocabularies.
Note 1: corpuslen
remains unchanged (the structure is immutable and a new Vocabulary
should be created to update this field). Note 2: Both voc
and another
vocabularies should had been created with a compatible TextConfig
to be able to work on them.
TextSearch.vectorize!
— Methodvectorize!(buff::TextSearchBuffer, model::VectorModel{G_,L_}, bow::BOW; normalize=true, minweight=1e-9) where {G_,L_}
Computes a weighted vector using the given bag of words and the specified weighting scheme.
TextSearch.vocab_from_small_collection
— MethodVocabulary(textconfig, corpus; minbatch=0)
Computes a vocabulary from a corpus using the TextConfig textconfig
.
TextSearch.BM25InvertedFile
— TypeBM25InvertedFile(textconfig, corpus, db=nothing)
Fits the vocabulary and BM25 score, it also creates the associated inverted file structure. NOTE: The corpus is not indexed since here we expect a relatively small sample of documents here and then an indexing stage on a larger corpus.
TextSearch.BM25InvertedFile
— Typestruct BM25InvertedFile <: AbstractInvertedFile
Parameters
TextSearch.BinaryGlobalWeighting
— TypeBinaryGlobalWeighting()
The weight is 1 for known tokens, 0 for out of vocabulary tokens
TextSearch.BinaryLocalWeighting
— TypeBinaryLocalWeighting()
The weight is 1 for known tokens, 0 for out of vocabulary tokens
TextSearch.EntropyWeighting
— TypeEntropyWeighting(; smooth=0.0, lowerweight=0.0, weights=:balance)
Entropy weighting uses the empirical entropy of the vocabulary along classes to produce a notion of importance for each token
TextSearch.FreqWeighting
— TypeFreqWeighting()
Frequency weighting
TextSearch.GlobalWeighting
— TypeGlobalWeighting
Abstract type for global weighting
TextSearch.IdfWeighting
— TypeIdfWeighting()
Inverse document frequency weighting
TextSearch.LocalWeighting
— TypeLocalWeighting
Abstract type for local weighting
TextSearch.Skipgram
— TypeSkipgram(qsize, skip)
A skipgram is a kind of tokenization where qsize
words having skip
separation are used as a single token.
TextSearch.TextConfig
— TypeTextConfig(;
del_diac::Bool=true,
del_dup::Bool=false,
del_punc::Bool=false,
group_num::Bool=true,
group_url::Bool=true,
group_usr::Bool=false,
group_emo::Bool=false,
lc::Bool=true,
collocations::Int8=0,
qlist::Vector=Int8[],
nlist::Vector=Int8[],
slist::Vector{Skipgram}=Skipgram[],
mark_token_type::Bool = true
tt=IdentityTokenTransformation()
)
Defines a preprocessing and tokenization pipeline
del_diac
: indicates if diacritic symbols should be removeddel_dup
: indicates if duplicate contiguous symbols must be replaced for a single symboldel_punc
: indicates if punctuaction symbols must be removedgroup_num
: indicates if numbers should be grouped _numgroup_url
: indicates if urls should be grouped as _urlgroup_usr
: indicates if users (@usr) should be grouped as _usrgroup_emo
: indicates if emojis should be grouped as _emolc
: indicates if the text should be normalized to lower casecollocations
: window to expand collocations as tokens, please take into account that:- 0 => disables collocations
- 1 => will compute words (ignored in favor of use typical unigrams)
- 2 => will compute bigrams (don't use this, but not disabled)
- 3 <= typical values
qlist
: a list of character q-grams to usenlist
: a list of words n-grams to useslist
: a list of skip-grams tokenizers to usemark_token_type
: each token ismarked
with its type (qgram, skipgram, nword) when is true.tt
: AnAbstractTokenTransformation
struct
Note: If qlist, nlist, and slists are all empty arrays, then it defaults to nlist=[1]
TextSearch.TextModel
— TypeModel
An abstract type that represents a weighting model
TextSearch.TfWeighting
— TypeTfWeighting()
Term frequency weighting
TextSearch.TpWeighting
— TypeTpWeighting()
Term probability weighting
TextSearch.VectorModel
— MethodVectorModel(ent::EntropyWeighting, lw::LocalWeighting, corpus::BOW, labels;
mindocs::Integer=1,
smooth::Float64=0.0,
weights=:balance
comb::CombineWeighting=NormalizedEntropy(),
)
Creates a vector model using the input corpus.
TextSearch.Vocabulary
— MethodVocabulary(textconfig::TextConfig, n::Integer)
Creates a Vocabulary
struct