Proximity searches
These kinds of proximity indices allow Cozo to perform fast searches for similar data. The HNSW index is a graph-based index that allows for fast approximate nearest neighbor searches. The MinHash-LSH index is a locality sensitive hash index that allows for fast approximate nearest neighbor searches. The FTS index is a full-text search index that allows for fast string matches.
HNSW (Hierarchical Navigable Small World) indices for vectors
Cozo supports vector proximity search using the HNSW (Hierarchical Navigable Small World) algorithm.
To use vector search, you first need to have a stored relation with vectors inside, for example:
:create table {k: String => v: <F32; 128>}Next you create a HNSW index on a table containing vectors. You use the following system operator to create the index:
::hnsw create table:index_name {
dim: 128,
m: 50,
dtype: F32,
fields: [v],
distance: L2,
ef_construction: 20,
filter: k != 'foo',
extend_candidates: false,
keep_pruned_connections: false,
}The parameters are as follows:
- The dimension
dimand the data typedtype(defaults to F32) has to match the dimensions of any vector you index. - The
fieldsparameter is a list of fields in the table that should be indexed. - The indexed fields must only contain vectors of the same dimension and data type, or
null, or a list of vectors of the same dimension and data type. - The
distanceparameter is the distance metric to use: the options areL2(default),CosineandIP. - The
mcontrols the maximal number of outgoing connections from each node in the graph. - The
ef_constructionparameter is the number of nearest neighbors to use when building the index: see the HNSW paper for details. - The
filterparameter, when given, is bound to the fields of the original relation and only those rows for which the expression evaluates totrueare indexed. - The
extend_candidatesparameter is a boolean (default false) that controls whether the index should extend the candidate list with the nearest neighbors of the nearest neighbors. - The
keep_pruned_connectionsparameter is a boolean (default false) that controls whether the index should keep pruned connections.
You can insert data as normally done into table. For vectors, use a list of
numbers and it will be verified to have the correct dimension and converted. If
you want to be more explicit, you can use the vec function.
After the index is created, you can use vector search inside normal queries in a similar manner to stored relations. For example:
?[dist, k, v] := ~table:index_name{ k, v |
query: q,
k: 2,
ef: 20,
bind_distance: dist,
bind_vector: bv,
bind_field: f,
bind_field_idx: fi,
filter: 1 + 1 == 2,
radius: 0.1
}, q = vec([200, 34])The ~ followed by the index name signifies a vector search. In the braces,
arguments before the vertical line are named bindings, with exactly the same
semantics as in normal stored relations with named fields (i.e. they may be
bound, or if they are unbound, the introduce fresh variables), and arguments
after the vertical line are query parameters.
There are three required parameters: query is an expression that evaluates to
a query vector of the expected type, and if it evaluates to a variable, the
variable must be bound inside the rule; k controls how many results to return,
and ef controls the number of neighbours to consider during the search
process.
Next, there are three bind parameters that can bind variables to data that are
only available in index or during the search process: distance binds the
distance between the query vector and the result vector; vector binds the
result vector; and field binds the field name of the result vector. The
field_idx parameter binds the index of the field in the fields list of the
index in case field resolves to a list of vectors, otherwise it is null. In
case any of the bind parameters are bound to existing variables, they act as
filters after k results are returned.
The parameter filter takes an expression that can only refer to the fields of
the original relation, and only those rows for which the expression evaluates to
true are returned, and this filtering results occurs during the search
process, so the algorithm will strive to return k results even if it must
filter out a larger number of rows. radius controls the largest distance any
return vector can have from the query vector, and this filtering process also
happens during the search.
The vector search can be used in any place where a stored relation may be used, even inside recursive rules (but be careful of non-termination).
As with normal indices, you can use the index relation as a read-only but otherwise normal relation in your query. You query the index directly by:
?[fr_k, to_k, dist] := *table:index_name {layer: 0, fr_k, to_k, dist}It is recommended to always specify layer, otherwise a full scan is required.
The schema for the above index is the following:
{
layer: Int,
fr_k: String?,
fr__field: Int?,
fr__field_idx: Int?,
to_k: String?,
to__field: Int?,
to__field_idx: Int?,
=>
dist: Float,
hash: Bytes,
ignore_link: Bool,
}Layer is the layer in the HNSW hierarchy of graphs, with 0 the most detailed
layer, -1 the layer more abstract than 0, -2 the even more abstract layer,
etc. There is also a special layer 1 containing at most one row with all other
keys set to null.
The fr_* and to_* fields mirror the indices of the indexed relation, and the
fr__* and to__* fields indicate which vectors inside the original rows this
edge connects.
dist is the distance between the two vectors when the row represents a link
between two different vectors, otherwise the link is a self-loop and dist
contains the degree of the node; hash is the hash of the vector, and
ignore_link is a boolean that indicates whether this link should be ignored
during the search process. The graph is guaranteed to be symmetric, but the
incoming and outgoing links may have different ignore_link values, and they
cannot both be true.
Walking the index graph at layer 0 amounts to probabilistically visiting "near" neigbours. More abstract layers are renormalized versions of the proximity graph and are harder to work with but are even more interesting theoretically.
To drop an HNSW index:
::hnsw drop table:index_nameMinHash-LSH for near-duplicate indexing of strings and lists
To use locality sensitive search on a relation containing string values, for example:
:create table {k: String => v: String?}You can create a MinHash-LSH index on the v field by:
::lsh create table:index_name {
extractor: v,
extract_filter: !is_null(v),
tokenizer: Simple,
filters: [],
n_perm: 200,
target_threshold: 0.7,
n_gram: 3,
false_positive_weight: 1.0,
false_negative_weight: 1.0,
}This creates a MinHash-LSH index on the v field of the table. The index
configuration includes the following parameters:
extractor: vspecifies that thevfield will be used as the feature extractor. This parameter takes an expression, which must evaluate to a string, a list of values to be indexed, ornull. If it evaluates tonull, then the row is not indexed.extract_filter: !is_null(v): this is superfluous in this case, but in more general situations you can use this to skip indexing rows based on arbitary logic.tokenizer: Simpleandfilters: []specifies the tokenizer to be used, see a later section for tokenizer.n_perm: 200sets the number of permutations for the MinHash LSH index. Higher values will result in more accurate results at the cost of increased CPU and storage usage.target_threshold: 0.7sets the target threshold for similarity comparisons when searching.n_gram: 3sets the size of the n-gram used for shingling.false_positive_weight: 1.0andfalse_negative_weight: 1.0set the weights for false positives and false negatives.
At search time:
?[k, v] := ~table:index_name {k, v |
query: $q,
k: 2,
filter: 1 + 1 == 2,
}This will look for the top 2 most similar values to the query q. The filter
parameter is evaluated on the bindings for the relation, and only those rows for
which the filter evaluates to true are returned, before restricting to k
results. The query parameter is a string, and will be subject to the same
tokenization process.
In addition to strings, you can index and search for list of arbitrary values.
In this case, the tokenizer, filters and n_gram parameters are ignored.
Again you can use the associated index relation as a normal relations in your
query. There are two now: table:index_name and table:index_name:inv. You can
use ::columns to look at their structure. In our case, the first is:
{
hash: Bytes,
src_k: String,
}and the second is:
{
k: String => minhash: List[Bytes]
}The first it more useful: it loosely groups together duplicates according to the indexing parameters.
To drop:
::lsh drop table:index_nameFull-text search (FTS)
Full-text search should be familiar. For the following relation:
:create table {k: String => v: String?}we can create an FTS index by:
::fts create table:index_name {
extractor: v,
extract_filter: !is_null(v),
tokenizer: Simple,
filters: [],
}This creates an FTS index on the v field of the table. The index configuration
includes the following parameters:
extractor: vspecifies that thevfield will be used as the feature extractor. This parameter takes an expression, which must evaluate to a string ornull. If it evaluates tonull, then the row is not indexed.extract_filter: !is_null(v): this is superfluous in this case, but in more general situations you can use this to skip indexing rows based on arbitary logic.tokenizer: Simpleandfilters: []specifies the tokenizer to be used, see a later section for tokenizer.
That's it. At query time:
?[s, k, v] := ~table:index_name {k, v |
query: $q,
k: 10,
filter: 1 + 1 == 2,
score_kind: 'bm25',
bind_score: s
}
:order -sThis query retrieves the top 10 results from the index index_name based on a
search query $q. The filter parameter can be used to filter the results
further based on additional conditions. The score_kind parameter specifies the
scoring method. The resulting scores are bound to the variable s. Finally, the
results are ordered in descending order of score (-s).
mnestic
mnestic 0.8.3 changed the default score_kind from 'tf_idf' to Okapi
'bm25' — a behavior change. The three options:
'bm25'(default) — term-frequency saturation (k1, default1.2) and document-length normalization (b, default0.75), both tunable as query params;OR-disjunction sums per-term contributions.avgdlis an O(1) durable counter, not a per-query scan.'tf_idf'— rawtf · idfwith no length normalization;ORtakes the max per-term score. Byte-identical to upstream CozoDB.'tf'— term frequency only.
Pass score_kind: 'tf_idf' (or 'tf') to keep the exact upstream scoring.
# BM25 with explicit tuning (k1 saturation, b length-normalization)
?[s, k, v] := ~table:index_name {k, v |
query: $q, k: 10, score_kind: 'bm25', k1: 1.2, b: 0.75, bind_score: s
}
:order -sThe search query must be a string and is processed by the same tokenizer as the
index. The tokenizer is specified by the tokenizer parameter, and the
filters parameter can be used to specify additional filters to be applied to
the tokens. There is a mini-language for parsing the query:
hello world,hello AND world,"hello" AND 'world': these all look for rows where both words occur.ANDis case sensitive.hello OR world: look for rows where either word occurs.hello NOT world: look for rows wherehellooccurs butworlddoes not.hell* wor*: look for rows having a word starting withhelland also a word starting withwor.NEAR/3(hello world bye): look for rows wherehello,world,byeare within 3 words of each other. You can writeNEAR(hello world bye)as a shorthand forNEAR/10(hello world bye).hello^2 OR world: look for rows wherehelloorworldoccurs, buthellohas twice of its usual weighting when scoring.- These can be combined and nested with parentheses (except that
NEARonly takes literals and prefixes):hello AND (world OR bye).
The index relation has the following schema:
{
word: String,
src_k: String,
=>
offset_from: List[Int],
offset_to: List[Int],
position: List[Int],
total_length: Int,
}Explanation of the fields:
word: the word that occurs in the document.src_k: the key of the document, the name and number varies according to the original relation schema.offset_from: the starting offsets of the word in the document.offset_to: the ending offsets of the word in the document.position: the position of the word in the document, counted as the position of entire tokens.total_length: the total number of tokens in the document.
To drop:
::fts drop table:index_nameText tokenization and filtering
Text tokenization and filtering are used in both the MinHash-LSH and FTS
indexes. The tokenizer is specified by the tokenizer parameter, and the
filters parameter can be used to specify additional filters to be applied to
the tokens.
CozoDB uses Tantivy's tokenizers and
filters (we incorporated their files directly in our source tree, as they are
not available as a library). Tokenizer is specified in the configuration as a
function call such as Ngram(9), or if you omit all arguments, Ngram is also
acceptable. The following tokenizers are available:
Raw: no tokenization, the entire string is treated as a single token.Simple: splits on whitespace and punctuation.Whitespace: splits on whitespace.Ngram(min_gram?, max_gram?, prefix_only?): splits into n-grams.min_gramis the minimum size of the n-gram (default 1),max_gramis the maximum size of the n-gram (default tomin_gram), andprefix_onlyis a boolean indicating whether to only generate prefixes of the n-grams (default false).Cangjie(kind?): this is a text segmenter for the Chinese language.kindcan be'default','all','search'or'unicode'.
After tokenization, multiple filters can be applied to the tokens. The following filters are available:
Lowercase: converts all tokens to lowercase.AlphaNumOnly: removes all tokens that are not alphanumeric.AsciiFolding: converts all tokens to ASCII (lossy), i.e.passégoes topasse.Stemmer(lang): use a language-specific stemmer. The following languages are available:'arabic','danish','dutch','english','finnish','french','german','greek','hungarian','italian','norwegian','portuguese','romanian','russian','spanish','swedish','tamil','turkish'. As an exmple, the English stemmer converts'running'to'run'.Stopwords(lang): filter out stopwords specific to the language. The stopwords come from the stopwords-iso project. Use the ISO 639-1 Code as specified on the project page. For example,Stopwords('en')for English will remove words such as'the','a','an', etc.
For English text, the recommended setup is Simple for the tokenizer and
[Lowercase, Stemmer('english'), Stopwords('en')] for the filters.
Adapted from the CozoDB documentation by Ziyang Hu and the Cozo Project Authors, used under CC‑BY‑SA‑4.0. Adaptations for mnestic are released under the same license. mnestic is an independent fork and is not affiliated with or endorsed by the original authors.