Tutorial
This tutorial will teach you the basics of using the database.
Note
This tutorial is adapted from the original CozoDB manual. The query language and semantics are inherited unchanged, so everything below applies equally to mnestic, the maintained fork. We keep the names "CozoDB"/"Cozo" where the text describes inherited behavior.
There are several ways you can run the queries in this tutorial:
- You can run the examples in your browser without installing anything: open the Cozo in Wasm demo and you are ready to go.
- You can download the appropriate
cozo-*executable for your operating system from the release page, uncompress it, rename it tocozo, and run the REPL by runningcozo replin a terminal. - If you are familiar with the Python data-science stack, you can run the examples in Jupyter Lab.
- There are many other ways, but the above ones are the easiest.
Introduction
Datalog is pattern matching for your data.
This sort of thing (Ruby):
config = {db: {user: 'admin', password: 'abc123'}}
case config
in db: {user:} # matches subhash and puts matched value in variable user
puts "Connect with user '#{user}'"
in connection: {username: }
puts "Connect with user '#{username}'"
else
puts "Unrecognized structure of config"
end
# Prints: "Connect with user 'admin'"Broadly speaking, you provide some values and variables that constitute a pattern for pulling apart some data structure. The matching simultaneously chooses an execution path and gets values of the variables that match out of the data structure.
First steps
Cozo is a relational database. Relations are similar to tables in SQL. The difference is that you can't have duplicate rows in a relation.
You can define and then retrieve all the rows from a relation like this:
?[] <- [['hello', 'world', 'Cozo!']]This is a rule. The <- separates the head of the rule on the left from the
body on the right. The [] <- acts like SELECT * in SQL. The above query in
SQL would look something like:
SELECT * FROM (VALUES('hello', 'world', 'sql')) as unnecessary_alias;(a hint here of Datalog's freedom from the endless ceremony that plagues SQL)
You can have multiple rows:
?[] <- [[1, 2, 3], ['a', 'b', 'c']]You can have the usual sorts of types:
?[] <- [[1.5, 2.5, 3, 4, 5.5],
['aA', 'bB', 'cC', 'dD', 'eE'],
[true, false, null, -1.4e-2, "A string with double quotes"]]The input literal representations are similar to those in JavaScript. In
particular, strings in double quotes are guaranteed to be interpreted in the
same way as in JSON. The output is JSON, but how it appears on your screen
depends on your setup. For example, with a Python setup, booleans are displayed
as True and False instead of in lowercase, since they are converted to the
native Python datatypes for display.
In the last example, you may have noticed that the returned order is not the same as the input order. This is because relations are always stored as trees, and trees are always sorted.
Another consequence of relations as trees is that you can have no duplicate rows:
?[] <- [[1], [2], [1], [2], [1]]Expressions
The next example shows the use of various functions, expressions and comments:
?[] <- [[
1 + 2, # addition
3 / 4, # division
5 == 6, # equality
7 > 8, # greater
true || false, # or
false && true, # and
lowercase('HELLO'), # function
rand_float(), # function taking no argument
union([1, 2, 3], [3, 4, 5], [5, 6, 7]), # variadic function
]]Cozo has a pretty robust set of built-in functions and operators.
Find values for these variables
More complex queries proceed by providing some expressions with values and variables, much like pattern matching in your favourite language.
Cozo tries to find all the sets of rows that match the given values, including values for the variables that also match across all the expressions. Some examples will make this clearer.
?[first, second, third] <- [[1, 2, 3], ['a', 'b', 'c']]Here, we've just provided named variables for the values coming back.
Note that you have to do obvious things, like not having different numbers of columns within one relation. Typing can be strict or loose as you choose (so far, it's all been loose; we'll see examples of strict typing shortly).
For <- (called constant) rules, the number of bindings must match the actual
data (the arity), otherwise you will get an error:
?[first, second] <- [[1, 2, 3], ['a', 'b', 'c']]Now let's define rules that apply (use) other rules:
rule[first, second, third] <- [[1, 2, 3], ['a', 'b', 'c']]
?[a, b, c] := rule[a, b, c]You'll notice the two different connectors. The first, <-, is actually
syntactic sugar for <~, which is used for invoking functions on the right-hand
side (as well as Cozo's many built-in functions, you can write your own in the
host language). <- is invoking a Constant function that looks like
rule[first, second, third] <~ Constant(data: [[1, 2, 3], ['a', 'b', 'c']]).
<- and <~ rules are called constant rules.
The second rule is an inline rule, meaning it's not invoking anything external
to the database; it's just matching data. Note that because it is named ?, it
is the result of the query.
Note how in the second line, we match a, b, c to the first, second,
third in the first rule.
We don't have to use all the variables from the body in an inline rule:
rule[first, second, third] <- [[1, 2, 3], ['a', 'b', 'c']]
?[c, b] := rule[a, b, c]Let's get a bit fancier:
?[c, b] := rule[a, b, c], is_num(a)
rule[first, second, third] <- [[1, 2, 3], ['a', 'b', 'c']]Here, the first rule had multiple bodies, separated by commas. These are
effectively joined by and. We see here how the variable a matches with a
value from the database, but also has to return true when passed to the
is_num function.
You can also bind constants to rule applications directly:
rule[first, second, third] <- [[1, 2, 3], ['a', 'b', 'c']]
?[c, b] := rule['a', b, c]The pattern in the head on the second line matches only rows with an 'a' for
the first field value.
You can have multiple rules with the same head name, which is like joining
their bodies with OR:
important[a] := recent[a], unread[a], not deleted[a]
important[a] := flagged[a], not deleted[a]This means a is important if it isn't deleted, and it's either flagged, or
recent and unread. You can also use and, or, not and parentheses, although
the style above is more idiomatic.
We can just directly set a value as part of the query, using =:
rule[first, second, third] <- [[1, 2, 3], ['a', 'b', 'c']]
?[c, b, d] := rule[a, b, c], is_num(a), d = a + b + 2*cThe unification = unifies with a single value. Use in to unify with each
value within a list in turn:
?[x, y] := x in [1, 2, 3], y in ['x', 'y']Having multiple rule applications in the body generates every combination of the bindings:
r1[] <- [[1, 'a'], [2, 'b']]
r2[] <- [[2, 'B'], [3, 'C']]
?[l1, l2] := r1[a, l1],
r2[b, l2]This is a cross join in SQL.
We've now seen enough of the basics to start looking at some more complex operations.
Joins, made easy
r1[] <- [[1, 'a'], [2, 'b']]
r2[] <- [[2, 'B'], [3, 'C']]
?[l1, l2] := r1[a, l1],
r2[a, l2] # reused `a`A join is as simple as using the same variable in different relations. Here, we
are asking for a single value of a that matches a row in each of the sources at
the same time.
Compare the above with the SQL for the same operation:
WITH
r1 AS SELECT * FROM (VALUES (1, 'a'), (2, 'b')) as t(c1, c2),
r2 AS SELECT * FROM (VALUES(2, 'B'), (3, 'C')) as u(d1, d2)
SELECT
r1.c2, r2.d2 FROM r1 JOIN r2 ON (r1.c1 = r2.d1);An outer join is done by considering the cases separately:
a[x, y] <- [[1, 2], [3, 4]]
b[y, z] <- [[2, 3], [2, 4]]
?[x, y, z] := a[x, y], b[y, z]
?[x, y, z] := a[x, y], not b[y, _], z = nullNote that we still only have a single output ? rule here; it is split into two
parts, effectively joining the right sides with or as we discussed previously.
Stored relations
The relations we've seen so far were created on the fly as constant rules, and only persist for the duration of the query.
You can of course also persist data. To do this, you first declare the relation:
:create stored {c1, c2}This creates a relation called stored with two columns that will take any
type. Here is a fully typed relation:
:create dept_info {
company_name: String,
department_name: String,
=>
head_count: Int default 0,
address: String,
}This should be fairly obvious, except for the => in the middle. This symbol
separates the columns in the primary key (before the =>) from the other
columns. If the => isn't present, all of the columns are considered part of
the primary key.
?[a, b, c] <- [[1, 'a', 'A'],
[2, 'b', 'B'],
[3, 'c', 'C'],
[4, 'd', 'D']]
:create fd {a, b => c}?[a, b, c] := *fd[a, b, c]Note how we create rows in fd as we define it.
In Cozo terms, the fields after the => are functionally dependent on the
ones before. Those before determine identity, which lets the :put operation
function as an upsert (update-or-insert).
?[a, b, c] <- [[3, 'c', 'CCCCCCC']]
:put fd {a, b => c}?[a, b, c] := *fd[a, b, c]Note
The name of a stored relation should not start with an underscore _. You
will get no error if you use such a name, but the relation won't be persisted —
it is actually an ephemeral relation.
You can see the stored relations in your system by running a system op:
::relationsNote that we have regular operations like :create that start with a :, and
system operations with a ::.
We can investigate the columns of the stored relation:
::columns storedStored relations can be accessed by using an asterisk * before the name:
?[a, b] := *stored[a, b]Unlike relations defined inline, the columns of stored relations have fixed names. You can take advantage of this by selectively referring to columns by name, especially if you have a lot of columns:
?[a, b] := *stored{l2: b, l1: a}If the binding is the same as the column name, you can omit the binding:
?[l2] := *stored{l2}Use :rm to remove data:
?[l1, l2] <- [['e', 'E']]
:rm stored {l1, l2}?[l1, l2] := *stored[l1, l2]Use ::remove (double colon!) to get rid of a stored relation:
::remove stored::relationsWe've now seen a variety of Cozo queries. They all have one or more rules, one of
which will have the name ? which determines which variables are the result of
the query.
A query can also have one or more operations such as :put, which are executed
after the query and can use its results. These operations might also change the
query in some way, by limiting the number of results, sorting, paginating and so
on.
Stored relations can have triggers. These are discussed in detail in Stored relations & transactions.
Before continuing, let's remove the stored relation we introduced:
::remove fdCommand blocks
Cozo executes the entire script you send to it in a single transaction. You can do multiple query-command blocks by wrapping each block in curly braces. The blocks are executed in sequence and either all of them succeed or none of them do and the state rolls back. The result returned to a client is whatever is in the last block. Like this:
{?[a] <- [[1], [2], [3]]; :replace test {a}}
{?[a] <- []; :replace test2 {a}}
%swap test test2
%return testGraphs
Now let's consider a graph stored as a relation:
?[loving, loved] <- [['alice', 'eve'],
['bob', 'alice'],
['eve', 'alice'],
['eve', 'bob'],
['eve', 'charlie'],
['charlie', 'eve'],
['david', 'george'],
['george', 'george']]
:replace love {loving, loved}The graph we have created reads like "Alice loves Eve, Bob loves Alice", "nobody
loves David, David loves George, but George only loves himself", and so on. Here
we used :replace instead of :create. The difference is that if love already
exists, it will be wiped and replaced with the new data given.
We can investigate competing interests:
?[loved_by_b_e] := *love['eve', loved_by_b_e],
*love['bob', loved_by_b_e]It feels natural here to use the or keyword:
?[loved_by_b_e] := *love['eve', loved_by_b_e] or *love['bob', loved_by_b_e],
loved_by_b_e != 'bob',
loved_by_b_e != 'eve'Compare this with multiple rule definitions under the same name:
?[loved_by_b_e] := *love['eve', loved_by_b_e],
loved_by_b_e != 'bob',
loved_by_b_e != 'eve'
?[loved_by_b_e] := *love['bob', loved_by_b_e],
loved_by_b_e != 'bob',
loved_by_b_e != 'eve'When you have multiple definitions of the same inline rule, the rule heads must be compatible. Only inline rules can have multiple definitions.
Negation
Negation of expressions should be familiar:
?[loved] := *love[person, loved], !ends_with(person, 'e')Rule applications can also be negated, not with the ! operator, but with the
not keyword:
?[loved_by_e_not_b] := *love['eve', loved_by_e_not_b],
not *love['bob', loved_by_e_not_b]There are two sets of logical operations in Cozo, one set that acts on the level of expressions, and another set that acts within the expression (we say they work on atoms):
- For atoms:
,orand(conjunction),or(disjunction),not(negation) - For expressions:
&&(conjunction),||(disjunction),!(negation)
The difference between , and and is operator precedence: and has higher
precedence than or, whereas , has lower precedence than or.
There is a safety rule for negation:
?[not_loved_by_b] := not *love['bob', not_loved_by_b]This query is forbidden because the resulting relation is infinite. For example,
'gold' should be in the result, since according to the facts stored in the
database, Bob has no interest in gold.
To make our query finite, we have to explicitly give our query a closed world:
the_population[p] := *love[p, _a]
the_population[p] := *love[_a, p]
?[not_loved_by_b] := the_population[not_loved_by_b],
not *love['bob', not_loved_by_b]Recursion
The single greatest advantage of Datalog over SQL is that recursive/graph queries are much simpler.
Inline rules can apply other rules, and can have multiple definitions. If you combine these two, you get recursions:
alice_love_chain[person] := *love['alice', person]
alice_love_chain[person] := alice_love_chain[in_person],
*love[in_person, person]
?[chained] := alice_love_chain[chained]You may object that you only need to be able to apply other rules to have recursion, without multiple definitions. Technically correct, but the resulting queries are not useful:
alice_love_chain[person] := alice_love_chain[in_person],
*love[in_person, person]
?[chained] := alice_love_chain[chained]Similar to the negation case, if there is no way to deduce a fact from the given facts, then the fact itself is considered false. You need multiple definitions to "bootstrap" the query.
Aggregation
Aggregations are usually used to compute statistics. In Cozo, aggregations are applied in the head of inline rules:
?[person, count(loved_by)] := *love[loved_by, person]The usual sum, mean, etc. are all available. Here is the
full list of aggregations for you to play with.
Query options
We have seen query options like :create, :put, :rm for manipulating stored
relations. There are also query options for controlling what is returned:
?[loving, loved] := *love{ loving, loved }
:limit 1Next we want the result to be sorted by loved in descending order, then
loving in ascending order, and skip the first row:
?[loving, loved] := *love{ loving, loved }
:order -loved, loving
:offset 1Putting - in front of variables in the :order clause denotes reverse order.
Nothing or + denotes ascending order.
The full list of query options is explained in the Queries chapter.
Fixed rules
We've seen how the <- syntax for constant rules is syntax sugar. The full
syntax is:
?[] <~ Constant(data: [['hello', 'world', 'Cozo!']])Here we are using the fixed rule Constant, which takes one option named
data. The curly tail <~ denotes a fixed rule.
Fixed rules take input relations as arguments, apply custom logic to them and produce an output relation. They're functions that return relations, defined outside the Cozo engine itself (either built-in or loaded from a library or defined by the host language).
The Constant fixed rule takes zero input relations.
If you are using Cozo in the browser, Constant is the only fixed rule you can
use. In all other cases your Cozo will have the graph algorithms module enabled,
and all graph algorithms are implemented as fixed rules. As an example, let's
find out who is most popular in the love graph by using the
PageRank algorithm:
?[person, page_rank] <~ PageRank(*love[])
:order -page_rankHere the input relation is the stored relation *love (as noted above, you will
receive an error if you run this in the WASM implementation).
Each fixed rule is different: here is the full list.
::remove loveTime travel
A very useful capability of Cozo is the ability to time travel in a stored
relation. Usually when you :put into a relation, the old value is overwritten;
and when you :rm, the row is completely gone. But the old values can be of great
value.
In this case, time travel is the solution: instead of storing current facts, the stored relation stores the complete history of facts, at least all the available history as history itself unfolds.
If you believe that you don't want this functionality at all, you can skip to the next section. At Cozo we adopt the "zero-cost, zero-mental-overhead abstraction" philosophy: if you don't use a functionality, you don't pay the performance or cognitive overhead.
Let's have a simple example: storing the head of state of countries. First we have to create a new stored relation:
:create hos {state: String, year: Validity => hos: String}hos is shorthand for "head of state". The only thing different about this
relation is that we are giving year the type Validity. You can think of
validity as a list of two elements, the first element being an integer and the
second element being a boolean. The integer indicates the "time" of the fact
recorded by the row; the boolean, if true, indicates that the new version of
the row holds starting from the indicated time; if the boolean is false, the
row is deleted at this time.
Note that the integer only identifies a temporal sequence to Cozo. What sequence you use is up to you. UNIX Epoch is the default, as we will see.
Now let's insert some data:
?[state, year, hos] <- [['US', [2001, true], 'Bush'],
['US', [2005, true], 'Bush'],
['US', [2009, true], 'Obama'],
['US', [2013, true], 'Obama'],
['US', [2017, true], 'Trump'],
['US', [2021, true], 'Biden']]
:put hos {state, year => hos}It is OK to assert a still-valid fact again, as we have done above. You can use this relation like a normal relation:
?[state, year, hos] := *hos{state, year, hos}The curious thing is that it is sorted descendingly by year. Validity sorts descendingly.
For any stored relation that has type Validity at the last slot of the key,
the time-travel capability is enabled. Say you have forgotten who the president
of the US was in 2019. Easy:
?[hos, year] := *hos{state: 'US', year, hos @ 2019}In your answer you also got the year that this fact was last known to be true.
You also don't know about the year 2099:
?[hos, year] := *hos{state: 'US', year, hos @ 2099}That certainly doesn't look right. Let's fix it by retracting facts on or after 2025, and only inserting them back when we have sure facts:
?[state, year, hos] <- [['US', [2025, false], '']]
:put hos {state, year => hos}As we have hinted, you retract facts by putting a retraction. Now let's run the query again:
?[hos, year] := *hos{state: 'US', year, hos @ 2099}Since the database does not contain facts on or after 2025, your query returns empty.
This functionality is flexible: you can mix different periods in the same query:
?[hos2018, hos2010] := *hos{state: 'US', hos: hos2018 @ 2018},
*hos{state: 'US', hos: hos2010 @ 2010}As the relation hos is just a normal relation, you can still :rm facts from
it, in which case the facts are irretrievably gone. Whether that's desirable is
up to you: the database gives you the choice of how you want to use it, and trusts
that you know how to use it correctly for your use case.
First let's create a new stored relation to store people's moods:
:create mood {name: String, at: Validity => mood: String}I want to record my mood now:
?[name, at, mood] <- [['me', 'ASSERT', 'curious']]
:put mood {name, at => mood}Instead of giving a list of two elements as we have done above, we have simply
used the string ASSERT, and the system will know that we mean an assertion of a
current fact, with the value used being the current UNIX Epoch.
?[name, at, mood] := *mood{name, at, mood}?[name, time, mood] := *mood{name, at, mood},
time = format_timestamp(at)To query for current facts, use the string NOW in the validity specification:
?[name, time, mood] := *mood{name, at, mood @ 'NOW'},
time = format_timestamp(at)You can put in facts with manual timestamps as before, so it is possible that your database contains facts about the future. Let's do just that. Instead of giving a mysterious string of numerals, you can use a string for the timestamp:
?[name, at, mood] <- [['me', '2030-01-01T00:00:00.000+00:00', 'hopeful']]
:put mood {name, at => mood}Since this is in the future, it shouldn't affect NOW:
?[name, time, mood] := *mood{name, at, mood @ 'NOW'},
time = format_timestamp(at)In this case, there is also END for the validity specification, meaning to
extract facts at the end of time:
?[name, time, mood] := *mood{name, at, mood @ 'END'},
time = format_timestamp(at)Retraction at the current timestamp can be done with the string RETRACT:
?[name, at, mood] <- [['me', 'RETRACT', '']]
:put mood {name, at => mood}Retraction placed in the future can also be done with stringy timestamps by
prefixing with ~:
?[name, at, mood] <- [['me', '~9999-01-01T00:00:00.000+00:00', 'who cares']]
:put mood {name, at => mood}Now let's look at the complete history:
?[name, time, is_assert, mood] := *mood{name, at, mood},
time = format_timestamp(at),
is_assert = to_bool(at)This time-travel facility is much faster than what you get if you try to implement it directly with Datalog. Some further technical details of time travel are discussed in its own chapter, Time travel.
::remove mood, hosExtended example: the air routes dataset
Now that you have a basic understanding of using the various constructs of Cozo, let's deal with a small real-world dataset, with about 3700 nodes and 57000 edges.
The data we are going to use, and many examples that we will present, are adapted from the book Practical Gremlin. Gremlin is an imperative query language for graphs, a very different take compared to Datalog.
First, let's create the stored relations we want (wrapping queries in braces allows you to execute several queries together atomically):
{:create airport {
code: String
=>
icao: String,
desc: String,
region: String,
runways: Int,
longest: Float,
elev: Float,
country: String,
city: String,
lat: Float,
lon: Float
}}
{:create country {
code: String
=>
desc: String
}}
{:create continent {
code: String
=>
desc: String
}}
{:create contain { entity: String, contained: String }}
{:create route { fr: String, to: String => dist: Float }}The next command applies only if you are using Jupyter notebooks: it downloads a JSON file containing the data and imports it into the database. The commented-out line shows how to do the same thing with a local file. If you are using the Cozo WASM interface, click the "Import from URL" or the "Import from File" icon, and paste in the address.
%cozo_import_remote_file 'https://raw.githubusercontent.com/cozodb/cozo/dev/cozo-core/tests/air-routes.json'
# %cozo_import_local_file '../../cozo/cozo-core/tests/air-routes.json'If you are using the cozo repl command line, the command to use instead is
simply:
%import https://raw.githubusercontent.com/cozodb/cozo/dev/cozo-core/tests/air-routes.jsonYou can replace the URL with the path to a local file as well. Other environments also have ways to do the same thing: refer to the respective documentation.
If you feel that the above is too much magic, we will show you the "hard way" of importing the same data at the end of this tutorial. For now let's move on.
Let's verify all the relations we want are there:
::relationsWhile we are at it, let's lock all these tables to prevent accidentally changing their contents:
::access_level read_only airport, contain, continent, country, routeMore information about what this does is explained in System ops.
Let's just look at some data. Start with airports:
?[code, city, desc, region, runways, lat, lon] := *airport{code, city, desc, region, runways, lat, lon}
:limit 5Airports with the most runways:
?[code, city, desc, region, runways, lat, lon] := *airport{code, city, desc, region, runways, lat, lon}
:order -runways
:limit 10How many airports are there in total?
?[count(code)] := *airport{code}Let's get a distribution of the initials of the airport codes:
?[count(initial), initial] := *airport{code}, initial = first(chars(code))
:order initialMore useful are the statistics of runways:
?[count(r), count_unique(r), sum(r), min(r), max(r), mean(r), std_dev(r)] :=
*airport{runways: r}Using country, we can find countries with no airports:
?[desc] := *country{code, desc}, not *airport{country: code}The route relation by itself is rather boring:
?[fr, to, dist] := *route{fr, to, dist}
:limit 10It just records the starting and ending airports of each route, together with the distance. This relation only becomes useful when used as a graph.
Airports with no routes:
?[code, desc] := *airport{code, desc}, not *route{fr: code}, not *route{to: code}Airports with the most outgoing routes:
route_count[fr, count(fr)] := *route{fr}
?[code, n] := route_count[code, n]
:sort -n
:limit 5How many routes are there from the European Union to the US?
routes[unique(r)] := *contain['EU', fr],
*route{fr, to},
*airport{code: to, country: 'US'},
r = [fr, to]
?[n] := routes[rs], n = length(rs)How many airports are there in the US with routes from the EU?
?[count_unique(to)] := *contain['EU', fr],
*route{fr, to},
*airport{code: to, country: 'US'}How many routes are there for each airport in London, UK?
?[code, count(code)] := *airport{code, city: 'London', region: 'GB-ENG'}, *route{fr: code}We need to specify the region, because there is another city called London, not in the UK.
How many airports are reachable from London, UK in two hops?
lon_uk_airports[code] := *airport{code, city: 'London', region: 'GB-ENG'}
one_hop[to] := lon_uk_airports[fr], *route{fr, to}, not lon_uk_airports[to];
?[count_unique(a3)] := one_hop[a2], *route{fr: a2, to: a3}, not lon_uk_airports[a3];What are the cities directly reachable from LGW (London Gatwick), but furthermost away?
?[city, dist] := *route{fr: 'LGW', to, dist},
*airport{code: to, city}
:order -dist
:limit 10What airports are within 0.1 degrees of the Greenwich meridian?
?[code, desc, lon, lat] := *airport{lon, lat, code, desc}, lon > -0.1, lon < 0.1Airports in a box drawn around London Heathrow, UK:
h_box[lon, lat] := *airport{code: 'LHR', lon, lat}
?[code, desc] := h_box[lhr_lon, lhr_lat], *airport{code, lon, lat, desc},
abs(lhr_lon - lon) < 1, abs(lhr_lat - lat) < 1For some spherical geometry: what is the angle subtended by SFO and NRT on the surface of the earth?
?[deg_diff] := *airport{code: 'SFO', lat: a_lat, lon: a_lon},
*airport{code: 'NRT', lat: b_lat, lon: b_lon},
deg_diff = rad_to_deg(haversine_deg_input(a_lat, a_lon, b_lat, b_lon))We mentioned before that aggregations in Cozo are powerful. They are powerful because they can be used in recursions (some restrictions apply).
Let's find the distance of the shortest route between two airports. One way is
to enumerate all the routes between the two airports, and then apply min
aggregation to the results. This cannot be implemented as stated, since the
routes may contain cycles and hence there can be an infinite number of routes
between two airports.
Instead, think recursively. If we already have all the shortest routes between
all nodes, we derive an equation satisfied by the shortest route: the shortest
route between a and b is either the distance of a direct route, or the sum of
the shortest distance from a to c and the distance of a direct route from c
to d. We apply our min aggregation to this recursive set instead.
Write it out and it works. For example, the shortest route between the airports
LHR and YPO:
shortest[b, min(dist)] := *route{fr: 'LHR', to: b, dist}
# Start with the airport 'LHR', retrieve a direct route from 'LHR' to b
shortest[b, min(dist)] := shortest[c, d1], # Start with an existing shortest route from 'LHR' to c
*route{fr: c, to: b, dist: d2}, # Retrieve a direct route from c to b
dist = d1 + d2 # Add the distances
?[dist] := shortest['YPO', dist] # Extract the answer for 'YPO'.
# We chose it since it is the hardest airport to get to from 'LHR'.There is a caveat when you try to write similar queries. Say you try to write it in the following way (don't try to run it):
shortest[a, b, min(dist)] := *route{fr: a, to: b, dist}
shortest[a, b, min(dist)] := shortest[a, c, d1],
*route{fr: c, to: b, dist: d2},
dist = d1 + d2
?[dist] := shortest['LHR', 'YPO', dist]You will find that the query does not complete in a reasonable amount of time, despite it being equivalent to the original query. Why?
In the changed query, you are asking the database to compute the all-pair
shortest path, and then extract the answer to a particular shortest path.
Normally Cozo would apply a technique called magic set rewrite so that only the
needed answer would be calculated. However, in the changed query the presence of
the aggregation operator min prevents that. In this case, applying the rewrite
to the variable a would still yield the correct answer, but rewriting in any
other way would give complete nonsense, and in the more general case with
recursive aggregations this is a can of worms.
So as explained in the chapter about Query execution,
magic set rewrites are only applied to rules without aggregations or recursions
for the moment, until we are sure of the exact conditions under which the
rewrites are safe. So for now at least the database executes the query as
written, computing the result of the shortest rule containing more than ten
million rows (to be exact, 3700 * 3700 = 13,690,000 rows) first!
The bottom line is: be mindful of the cardinality of the return sets of recursive rules.
A tour of graph algorithms
Now let's investigate the graph using some graph algorithms. As we have mentioned before, the Cozo running in browsers through WASM does not have the graph algorithms module enabled, so to run the following examples you will need to use some other implementation.
Since path-finding is such a common operation on graphs, Cozo has several fixed rules for that:
starting[] <- [['LHR']]
goal[] <- [['YPO']]
?[starting, goal, distance, path] <~ ShortestPathDijkstra(*route[], starting[], goal[])Not only is it more efficient, but we also get a path for the shortest route.
Not content with the shortest path, the following calculates the ten shortest paths:
starting[] <- [['LHR']]
goal[] <- [['YPO']]
?[starting, goal, distance, path] <~ KShortestPathYen(*route[], starting[], goal[], k: 10)If efficiency is really important to you, you can use the A* algorithm with a good heuristic function:
code_lat_lon[code, lat, lon] := *airport{code, lat, lon}
starting[code, lat, lon] := code = 'LHR', *airport{code, lat, lon};
goal[code, lat, lon] := code = 'YPO', *airport{code, lat, lon};
?[] <~ ShortestPathAStar(*route[],
code_lat_lon[node, lat1, lon1],
starting[],
goal[goal, lat2, lon2],
heuristic: haversine_deg_input(lat1, lon1, lat2, lon2) * 3963);There's a lot more setup required in this case: we need to retrieve the latitudes
and longitudes of airports and do processing on them first. The number 3963
above is the radius of the earth in miles.
The most important airports, by PageRank:
rank[code, score] <~ PageRank(*route[a, b])
?[code, desc, score] := rank[code, score], *airport{code, desc}
:limit 10;
:order -scoreThe following example takes a long time to run since it calculates the betweenness centrality: up to a few seconds, depending on your machine. Algorithms for calculating betweenness centrality have high complexity.
centrality[code, score] <~ BetweennessCentrality(*route[a, b])
?[code, desc, score] := centrality[code, score], *airport{code, desc}
:limit 10;
:order -scoreThese are the airports that, if disconnected from the network, cause the most disruption. As this example shows, some of the algorithms really struggle when you go beyond a small or medium sized dataset.
Community detection can collapse a graph into a supergraph. Here we store the result, since it has too many rows to display nicely:
community[detailed_cluster, code] <~ CommunityDetectionLouvain(*route[a, b])
?[code, cluster, detailed_cluster] := community[detailed_cluster, code], cluster = first(detailed_cluster)
:replace community {code => cluster, detailed_cluster}We can look at the supernodes containing specific nodes. For example, the supernode for London Gatwick consists of mainly UK and other European airports, as you would expect:
community[code] := *community{code: 'LGW', cluster}, *community{code, cluster}
?[country, count(code)] :=
community[code],
*airport{code, desc, country: country_code},
*country{code: country_code, desc: country},
:order -count(code)
:limit 5For JFK on the other hand, its supernode consists of mainly US airports:
community[code] := *community{code: 'JFK', cluster}, *community{code, cluster}
?[country, count(code)] :=
community[code],
*airport{code, desc, country: country_code},
*country{code: country_code, desc: country},
:order -count(code)
:limit 5But it does not always work according to geography. For example, Frankfurt airport is in Germany:
?[desc, country_desc] := *airport{code: 'FRA', desc, country: country_code}, *country{code: country_code, desc: country_desc}But its supernode:
community[code] := *community{code: 'FRA', cluster}, *community{code, cluster}
?[country, count(code)] :=
community[code],
*airport{code, desc, country: country_code},
*country{code: country_code, desc: country},
:order -count(code)
:limit 5Germany does not even appear in the top five. In fact, FRA is in the same
supernode as JFK. What matters is the connectivity in the graph, not the
geography. As another example:
community[code] := *community{code: 'SIN', cluster}, *community{code, cluster}
?[country, count(code)] :=
community[code],
*airport{code, desc, country: country_code},
*country{code: country_code, desc: country},
:order -count(code)
:limit 5You'd expect SIN to be a Chinese airport. Wrong:
?[desc, country_desc] := *airport{code: 'SIN', desc, country: country_code}, *country{code: country_code, desc: country_desc}Finally, let's collapse the route relation into super_route:
?[fr_cluster, to_cluster, count(dist), sum(dist)] := *route{fr, to, dist},
*community{code: fr, cluster: fr_cluster},
*community{code: to, cluster: to_cluster}
:replace super_route {fr_cluster, to_cluster => n_routes=count(dist), total_distance=sum(dist)}As expected, the "diagonals" where fr_cluster == to_cluster are larger in the
super_route graph:
?[fr_cluster, to_cluster, n_routes, total_distance] := *super_route{fr_cluster, to_cluster, n_routes, total_distance}, fr_cluster < 2Now the super graph is small enough that all analytics algorithms return instantly:
?[cluster, score] <~ PageRank(*super_route[])
:order -score
:limit 5You can now go on to investigate the supernodes, give real-world interpretations to them, etc. For example, a naïve interpretation of the above PageRank result is that North America is (still) the most prosperous part of the world, followed by East Asia in second place, South Asia in third place, and Europe in fourth place.
Importing the dataset the hard way
Previously, we imported the air-routes dataset by using Python under the hood to download a specially-crafted JSON file and feed it to the database. Here we learn how to achieve the same effect by letting Cozo fetch and import a series of CSV files, without Python's help.
Let's set the database magic up first:
::access_level normal airport, contain, continent, country, route::remove airport, contain, continent, country, route, community, super_routeNext, some parameters to make life easier (the lines commented out do the same thing by processing local files):
# %cozo_set AIR_ROUTES_NODES_URL 'https://raw.githubusercontent.com/cozodb/cozo/dev/cozo-core/tests/air-routes-latest-nodes.csv'
# %cozo_set AIR_ROUTES_EDGES_URL 'https://raw.githubusercontent.com/cozodb/cozo/dev/cozo-core/tests/air-routes-latest-edges.csv'
%cozo_set AIR_ROUTES_NODES_URL 'file://./../../cozo/cozo-core/tests/air-routes-latest-nodes.csv'
%cozo_set AIR_ROUTES_EDGES_URL 'file://./../../cozo/cozo-core/tests/air-routes-latest-edges.csv'First, import the airport relation:
res[idx, label, typ, code, icao, desc, region, runways, longest, elev, country, city, lat, lon] <~
CsvReader(types: ['Int', 'Any', 'Any', 'Any', 'Any', 'Any', 'Any', 'Int?', 'Float?', 'Float?', 'Any', 'Any', 'Float?', 'Float?'],
url: $AIR_ROUTES_NODES_URL,
has_headers: true)
?[code, icao, desc, region, runways, longest, elev, country, city, lat, lon] :=
res[idx, label, typ, code, icao, desc, region, runways, longest, elev, country, city, lat, lon],
label == 'airport'
:replace airport {
code: String
=>
icao: String,
desc: String,
region: String,
runways: Int,
longest: Float,
elev: Float,
country: String,
city: String,
lat: Float,
lon: Float
}The CsvReader utility downloads a CSV file from the internet and attempts to
parse its content into a relation. When we store the relation, we specified types
for the columns. The code column acts as a primary key for the airport stored
relation.
Next is country:
res[idx, label, typ, code, icao, desc] <~
CsvReader(types: ['Int', 'Any', 'Any', 'Any', 'Any', 'Any'],
url: $AIR_ROUTES_NODES_URL,
has_headers: true)
?[code, desc] :=
res[idx, label, typ, code, icao, desc],
label == 'country'
:replace country {
code: String
=>
desc: String
}continent:
res[idx, label, typ, code, icao, desc] <~
CsvReader(types: ['Int', 'Any', 'Any', 'Any', 'Any', 'Any'],
url: $AIR_ROUTES_NODES_URL,
has_headers: true)
?[idx, code, desc] :=
res[idx, label, typ, code, icao, desc],
label == 'continent'
:replace continent {
code: String
=>
desc: String
}We need to make a translation table for the indices the data use:
res[idx, label, typ, code] <~
CsvReader(types: ['Int', 'Any', 'Any', 'Any'],
url: $AIR_ROUTES_NODES_URL,
has_headers: true)
?[idx, code] :=
res[idx, label, typ, code],
:replace idx2code { idx => code }The contain relation contains information on the geographical inclusion of
entities:
res[] <~
CsvReader(types: ['Int', 'Int', 'Int', 'String'],
url: $AIR_ROUTES_EDGES_URL,
has_headers: true)
?[entity, contained] :=
res[idx, fr_i, to_i, typ],
typ == 'contains',
*idx2code[fr_i, entity],
*idx2code[to_i, contained]
:replace contain { entity: String, contained: String }Finally, the routes between the airports. This relation is much larger than the
rest and contains about 60k rows:
res[] <~
CsvReader(types: ['Int', 'Int', 'Int', 'String', 'Float?'],
url: $AIR_ROUTES_EDGES_URL,
has_headers: true)
?[fr, to, dist] :=
res[idx, fr_i, to_i, typ, dist],
typ == 'route',
*idx2code[fr_i, fr],
*idx2code[to_i, to]
:replace route { fr: String, to: String => dist: Float }We no longer need the idx2code relation:
::remove idx2codeLet's verify all the relations we want are there:
::relationsThat's it for the tutorial!
Adapted from the CozoDB documentation by Ziyang Hu and the Cozo Project Authors, used under CC‑BY‑SA‑4.0. Adaptations for mnestic are released under the same license. mnestic is an independent fork and is not affiliated with or endorsed by the original authors.