1. Overview

In this tutorial, we’ll give an overview of what SirixDB is and its most important design goals.

Next, we’ll give a walk through a low level cursor-based transactional API.

2. SirixDB Features

SirixDB is a log-structured, temporal NoSQL document store, which stores evolutionary data. It never overwrites any data on-disk. Thus, we’re able to restore and query the full revision history of a resource in the database efficiently. SirixDB ensures, that a minimum of storage-overhead is created for each new revision.

Currently, SirixDB offers two built-in native data models, namely a binary XML store as well as a JSON store.

2.1. Design Goals

Some of the most important core principles and design goals are:

  • Concurrency – SirixDB contains very few locks and aims to be as suitable for multithreaded systems as possible
  • Asynchronous REST API – operations can happen independently; each transaction is bound to a specific revision and only one read-write transaction on a resource is permitted concurrently to N read-only transactions
  • Versioning/Revision history – SirixDB stores a revision history of every resource in the database while keeping storage-overhead to a minimum. Read and write performance is tunable. It depends on the versioning type, which we can specify for creating a resource
  • Data integrity – SirixDB, like ZFS, stores full checksums of the pages in the parent pages. That means that almost all data corruption can be detected upon reading in the future, as the SirixDB developers aim to partition and replicate databases in the future
  • Copy-on-write semantics – similarly to the file systems Btrfs and ZFS, SirixDB uses CoW semantics, meaning that SirixDB never overwrites data. Instead, database page fragments are copied and written to a new location
  • Per revision and per record versioning – SirixDB does not only version on a per-page, but also on a per-record basis. Thus, whenever we change a potentially small fraction
    of records in a data page, it does not have to copy the whole page and write it to a new location on a disk or flash drive. Instead, we can specify one of several versioning strategies known from backup systems or a sliding snapshot algorithm during the creation of a database resource. The versioning type we specify is used by SirixDB to version data pages
  • Guaranteed atomicity (without a WAL) – the system will never enter an inconsistent state (unless there is hardware failure), meaning that unexpected power-off won’t ever damage the system. This is accomplished without the overhead of a write-ahead-log (WAL)
  • Log-structured and SSD friendly – SirixDB batches writes and syncs everything sequentially to a flash drive during commits. It never overwrites committed data

We first want to introduce the low-level API exemplified with JSON data before switching our focus to higher levels in future articles. For instance an XQuery-API for querying both XML and JSON databases or an asynchronous, temporal RESTful API. We can basically use the same low-level API with subtle differences to store, traverse and compare XML resources as well.

In order to use SirixDB, we at least have to use Java 11.

3. Maven Dependency to Embed SirixDB

To follow the examples, we first have to include the sirix-core dependency, for instance, via Maven:

<dependency>
    <groupId>io.sirix</groupId>
    <artifactId>sirix-core</artifactId>
    <version>0.9.3</version>
</dependency>

Or via Gradle:

dependencies {
    compile 'io.sirix:sirix-core:0.9.3'
}

4. Tree-Encoding in SirixDB

A node in SirixDB references other nodes by a firstChild/leftSibling/rightSibling/parentNodeKey/nodeKey encoding:

The numbers in the figure are auto-generated unique, stable node IDs generated with a simple sequential number generator.

Every node may have a first child, a left sibling, a right sibling, and a parent node. Furthermore, SirixDB is able to store the number of children, the number of descendants and hashes of each node.

In the following sections, we’ll introduce the core low-level JSON API of SirixDB.

5. Create a Database With a Single Resource

First, we want to show how to create a database with a single resource. The resource is going to be imported from a JSON file and stored persistently in the internal, binary format of SirixDB:

var pathToJsonFile = Paths.get("jsonFile");
var databaseFile = Paths.get("database");

Databases.createJsonDatabase(new DatabaseConfiguration(databaseFile));

try (var database = Databases.openJsonDatabase(databaseFile)) {
    database.createResource(ResourceConfiguration.newBuilder("resource").build());

    try (var manager = database.openResourceManager("resource");
         var wtx = manager.beginNodeTrx()) {
        wtx.insertSubtreeAsFirstChild(JsonShredder.createFileReader(pathToJsonFile));
        wtx.commit();
    }
}

We first create a database. Then we open the database and create the first resource. Various options for creating a resource exist (see the official documentation).

We then open a single read-write transaction on the resource to import the JSON file. The transaction provides a cursor for navigation through moveToX methods. Furthermore, the transaction provides methods to insert, delete or modify nodes. Note that the XML API even provides methods for moving nodes in a resource and copying nodes from other XML resources.

To properly close the opened read-write transaction, the resource manager and the database we use Java’s try-with-resources statement.

We exemplified the creation of a database and resource on JSON data, but creating an XML database and resource is almost identical.

In the next section, we’ll open a resource in a database and show navigational axes and methods.

6. Open a Resource in a Database and Navigate

6.1. Preorder Navigation in a JSON Resource

To navigate through the tree structure, we’re able to reuse the read-write transaction after committing. In the following code we’ll, however, open the resource again and begin a read-only transaction on the most recent revision:

try (var database = Databases.openJsonDatabase(databaseFile);
     var manager = database.openResourceManager("resource");
     var rtx = manager.beginNodeReadOnlyTrx()) {
    
    new DescendantAxis(rtx, IncludeSelf.YES).forEach((unused) -> {
        switch (rtx.getKind()) {
            case OBJECT:
            case ARRAY:
                LOG.info(rtx.getDescendantCount());
                LOG.info(rtx.getChildCount());
                LOG.info(rtx.getHash());
                break;
            case OBJECT_KEY:
                LOG.info(rtx.getName());
                break;
            case STRING_VALUE:
            case BOOLEAN_VALUE:
            case NUMBER_VALUE:
            case NULL_VALUE:
                LOG.info(rtx.getValue());
                break;
            default:
        }
    });
}

We use the descendant axis to iterate over all nodes in preorder (depth-first). Hashes of nodes are built bottom-up for all nodes per default depending on the resource configuration.

Array nodes and Object nodes have no name and no value. We can use the same axis to iterate through XML resources, only the node types differ.

SirixDB offers a bunch of axes as for instance all XPath-axes to navigate through XML and JSON resources. Furthermore, it provides a LevelOrderAxis, a PostOrderAxis, a NestedAxis to chain axis and several ConcurrentAxis variants to fetch nodes concurrently and in parallel.

In the next section, we’ll show how to use the VisitorDescendantAxis, which iterates in preorder, guided by return types of a node visitor.

6.2. Visitor Descendant Axis

As it’s very common to define behavior based on the different node-types SirixDB uses the visitor pattern.

We can specify a visitor as a builder argument for a special axis called VisitorDescendantAxis. For each type of node, there’s an equivalent visit-method. For instance, for object key nodes it is the method VisitResult visit(ImmutableObjectKeyNode node).

Each method returns a value of type VisitResult. The only implementation of the VisitResult interface is the following enum:

public enum VisitResultType implements VisitResult {
    SKIPSIBLINGS,
    SKIPSUBTREE,
    CONTINUE,
    TERMINATE
}

The VisitorDescendantAxis iterates through the tree structure in preorder. It uses the VisitResultTypes to guide the traversal:

  • SKIPSIBLINGS means that the traversal should continue without visiting the right siblings of the current node the cursor points to
  • SKIPSUBTREE means to continue without visiting the descendants of this node
  • We use CONTINUE if the traversal should continue in preorder
  • We can also use TERMINATE to terminate the traversal immediately

The default implementation of each method in the Visitor interface returns VisitResultType.CONTINUE for each node type. Thus, we only have to implement the methods for the nodes, which we’re interested in. If we’ve implemented a class which implements the Visitor interface called MyVisitor we can use the VisitorDescendantAxis in the following way:

var axis = VisitorDescendantAxis.newBuilder(rtx)
  .includeSelf()
  .visitor(new MyVisitor())
  .build();

while (axis.hasNext()) axis.next();

The methods in MyVisitor are called for each node in the traversal. The parameter rtx is a read-only transaction. The traversal begins with the node the cursor currently points to.

6.3. Time Travel Axis


« 上一篇: Java GSS API 介绍
» 下一篇: Java Weekly, 第294期