Optimizing queries

TypeDB’s optimizer

The built-in optimizer of TypeDB will decide the optimal order of evaluation of your query on a per-stage basis, while also respecting the boundaries of subqueries (created via or, not, or function calls). This gives user direct control over the evaluation order in their queries. For example, the query

match
  $x isa user;
  $x has username "john_1";

and the query pipeline

match
  $x isa user;
match
  $x has username "john_1";

are semantically equivalent (i.e., they yield the same results), but the first query will likely run much faster than the second query.

  1. In the first query, the optimizer will optimize evaluation order of the first (and only) match stage: it’ll decide that it is most efficient to first look up all entities with username equal to "john_1" and then filter the result to contain only user entities.

  2. The second query comprises two stages, to be executed one after the other: the database will first retrieve all users, thus constructing a very large set of answers. Only in the second stage to we filter this set to contain only users with username equal to "john_1".

In other words, the second query suffers from having large intermediate results.

Avoiding large intermediate results

A standard method for avoiding large intermediate results is to declare/enforce constraints as early as possible: this is also known as predicate-pushdown. The above queries provide an example of this principle.

In TypeQL predicate-pushdown is extended to the principle of data-predicate-locality: create data only at the stage where you need it, jointly with all applicable predicates. Pipelines give you fine-grained control to ensure that data is created where you need it. For example, a schematic pipeline of the form:

<match_data($a,$b)>
<processing($b)>
<processing($a,$b)>

could be rewritten to be of the form

<match_data($b)>
<processing($b)>
<match_data($a,$b)>
<processing($a,$b)>

This defers matching of data to be stored in the variables $a until after we have finished processing $b (assuming the latter is fully independent of results for $a).

Optimizing branches

Another source of suboptimality in queries comes from the usage of or-branches, and could be called pattern-pushdown. TypeDB performs minimal transformation of your or-branches, assuming that each disjunction is meant to be a branching point of your query, at which a set of nested subqueries is retrieved.

Thus, when working with branches, we should ensure that data is retrieved a minimal number of times. Schematically, this can be understood as follows: the query

match
  <root-pattern>;
  { <expensive-pattern1>; <cond1>; }
  or { <expensive-pattern1>; <cond2>; }
  or { <expensive-pattern2>; };

is equivalent to the pattern

match
  <root-pattern>;
  {
    <expensive-pattern1>;
    { <cond1>; } or { <cond2>; };
  } or {
    <expensive-pattern2>;
  };

We expect the second query to run faster, as it evaluates the <expensive-pattern1> only in one subquery (which itself has two further subqueries).

Optimizing negations

Negations, especially nested negations, can be extremely expensive query operations: indeed, to ensure that something does not exist, we usually must search the entire domain of possibilities. It is therefore paramount to reduce the “size of the domain” prior to using negation. For example, the query

match
  $x isa user;
  not {
    $y isa user;
    not { friendship ($x, $y) };
  };
limit 1;

does the following:

Find me a user $x for which there does not exist a user $y that is not friends with $x.

This can equivalently be achieved by the following query:

match
  $x isa user;
  friendship ($x, $y);
reduce $f_count = count groupby $x;
sort $f_count desc;
limit 1;
match $f_count == user_count(); # function returning user count

While the first query needs to consider data whose size is quadratic in the number of users, the second query considers data linear in the size of friendships.

Optimizing recursive functions

TypeDB uses tabling to break cycles when computing recursive functions. When computing a recursive function, a new table will be created for each combination of arguments the function is called with. Certain formulations of functions are more efficient than others under tabling:

#!test[schema]
#{{
define
relation edge, relates from_, relates to;
entity node, plays edge:from_, plays edge:to;
#}}

# intuitive, but actually inefficient
fun recurse_on_to($from: node) -> { node }:
match
  {
    $_ isa edge (from_: $from, to: $to);
  } or {
    $_ isa edge (from_: $from, to: $via);
    let $to in recurse_on_to($via);
  };
return { $to };

# counter-intuitive, but efficient!
fun recurse_on_from($from: node) -> { node }:
match
  {
    $_ isa edge (from_: $from, to: $to);
  } or {
    let $via in recurse_on_from($from);
    $_ isa edge (from_: $via, to: $to);
  };
return { $to };

recurse_on_to is an intuitive recursive formulation that works in most programming languages. For untabled execution, this would be the standard formulation. recurse_on_from is almost unintuitive as most programmers would expect it to suffer from infinite recursion. However it is the more efficient formulation!.

Although both formulations return the same set of answers, the difference in efficiency comes down to the number of tables we create & answers we store.

recurse_on_to is called once for each node reachable from the source node. Each call creates a table, holding one row per node reachable from that node. Across all tables, we’d have one row per nodes ($from, $to) where $to is reachable from $from.- In the worst case (a chain), this is quadratic in the number of reachable nodes.

On the other hand, recurse_on_from is always called for the same value of $from. Thus, it creates only one table, which holds a row for each node reachable from $from - this is linear in the number of reachable nodes.

Under the hood, the tabling mechanism breaks cycles by "suspending" calls which are already in the "stack" and marking them to be retried later. Here, each retry assigns the answers found in the previous iteration to $via. The next statement extends this by one edge and returns it (any returned answer is appended to the table). When no answers are added to the table the iteration stops.

Roadmap

While the above advice is timeless in any case, future versions of TypeDB may apply more aggressive query transformation techniques that will shift the burden of optimizing nested subqueries from the user further to the database.