Stable
This component is complete and can be used. The interfaces will be supported until the function is removed from the project via the deprecation process. There will be ongoing extensions to this function, but it will be done to ensure backward compatibility as far as possible. If there is a need to break backward compatibility, this will be discussed and reviewed in the community, with a documented timeline.
XTDB Native OMRS Repository Connector¶
XTDB is XTDB is a bi-temporal and dynamic relational database.
Fully conformant with all Egeria profiles
Last tested on release 3.9 of Egeria, release 3.9 of connector using release 1.21.0 of XTDB.
Profile | Result |
---|---|
Metadata sharing | CONFORMANT_FULL_SUPPORT |
Reference copies | CONFORMANT_FULL_SUPPORT |
Metadata maintenance | CONFORMANT_FULL_SUPPORT |
Dynamic types | UNKNOWN_STATUS |
Graph queries | CONFORMANT_FULL_SUPPORT |
Historical search | CONFORMANT_FULL_SUPPORT |
Entity proxies | CONFORMANT_FULL_SUPPORT |
Soft-delete and restore | CONFORMANT_FULL_SUPPORT |
Undo an update | CONFORMANT_FULL_SUPPORT |
Reidentify instance | CONFORMANT_FULL_SUPPORT |
Retype instance | CONFORMANT_FULL_SUPPORT |
Rehome instance | CONFORMANT_FULL_SUPPORT |
Entity search | CONFORMANT_FULL_SUPPORT |
Relationship search | CONFORMANT_FULL_SUPPORT |
Entity advanced search | CONFORMANT_FULL_SUPPORT |
Relationship advanced search | CONFORMANT_FULL_SUPPORT |
Additional notes
- The entity search tests could fail a particular long-running query pattern unless Lucene is configured: typically where a query by value or attribute is done without providing any restriction on the type of instances against which the query should run. Configure the connector with Lucene to avoid these timeouts.
- The Dynamic types profile currently does not have any tests defined, so will be
UNKNOWN_STATUS
for all repositories and connectors.
Navigate to the latest snapshot directory, and within that find the latest connector archive with the name: egeria-connector-xtdb-{version}-jar-with-dependencies.jar
The connector is hosted in its own repository at odpi/egeria-connector-xtdb , where the source code can be cloned and the connector built from source.
The XTDB OMRS repository connector enables the use of XTDB (formerly known as "Crux") as a native Egeria repository, including its own pluggable architecture to support a variety of underlying storage back-ends such as S3, RocksDB, Apache Kafka, LMDB, JDBC and more.
XTDB supports temporal graph queries to provide native support for storing historical information and answering temporal queries. The connector is also capable of running as a highly-available service, including self-managing write consistency across a distributed (multi-node) cluster. In addition, currently this is the highest-performance open source persistent repository for Egeria across all operations: read, write, update, search and purge.
How it works¶
The XTDB OMRS Repository Connector is a native repository connector running on a metadata access store.
XTDB itself is started as an embedded process within the connector. It can be configured to use any of the various pluggable persistence layers supported by XTDB itself, and communication between the Java code of the connector and XTDB itself (which is implemented in Clojure) occurs directly via the XTDB Java API (not via REST).
The repository connector (and metadata collection) methods of the repository connector interface simply communicate with XTDB via XTDB's Java API to read and write information to the underlying XTDB node.
XTDB itself handles write transactions and persistence guarantees via its APIs, ensuring that all data is at least recorded into the transaction log and document store prior to any write method returning.
Synchronous by default, but configurable for asynchronous operation
By default, the repository connector further awaits confirmation that any write has been indexed (and is therefore available for read operations) prior to returning. However, it is also possible to configure the connector in an "ingest-optimized" mode that allows the indexing to occur asynchronously, and can therefore improve the speed of write operations significantly.
Configuration¶
The following options are used to configure this connector, as part of the configure the local repository step when configuring a metadata server.
Pluggable persistence¶
There are many options for configuring XTDB itself.
A list of overall persistence modules and deeper configuration options for each can be found through XTDB's own documentation .
To enable persistence, there are two options:
- send in the JSON document configuration outlined in XTDB's own documentation directly to the
xtdbConfig
key of theconfigurationProperties
property of Egeria's connector configuration - send in a string to the
xtdbConfigEDN
key of theconfigurationProperties
of Egeria's connector configuration, which gives the EDN form of configuration outlined in XTDB's own documentation
Both approaches are valid and should be equally functional, but occasionally a bug may crop up that makes one or the other more or less feasible for a particular configuration.
Example persistence using JSON configuration
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
|
-
Everything in the
xtdbConfig
block should follow the JSON configuration style defined by XTDB. -
It is highly recommended to include such a Lucene entry as it offers significant performance improvements for any text-based queries.
Some of the Lucene configuration will be automatically injected
When using the JSON-based configuration, some additional entries will be automatically injected to the Lucene configuration by Egeria: specifically the
indexer
andanalyzer
entries used to configure the Lucene index optimally for the OMRS-level search interfaces that Egeria defines. If you have defined your ownanalyzer
orindexer
in the configuration, these will be overridden by the connector's injection process -- in other words, any custom configuration you attempt foranalyzer
orindexer
will be ignored. -
RocksDB is used in the rest of the configuration to act as the persistence layer for XTDB's index store, document store, and transaction log.
Example persistence using EDN configuration
1 2 3 4 5 6 7 8 9 10 |
|
The Lucene configuration will NOT be automatically injected
Unlike the JSON-based configuration, when using the EDN-based configuration the necessary Egeria components of the Lucene configuration will not be automatically injected. Therefore, make sure that your EDN configuration string includes in the Lucene configuration the following keys and settings in addition to the :db-dir
:
{:xtdb.lucene/lucene-store {
:db-dir "data/servers/xtdb/lucene"
:indexer {:xtdb/module xtdb.lucene.egeria/->egeria-indexer} ;; (1)
:analyzer {:xtdb/module xtdb.lucene.egeria/->ci-analyzer}} ;; (2)
- The
:indexer
defines the custom indexing that must be done for efficient Lucene searches covering both case-sensitive and case-insensitive scenarios. - The
:analyzer
defines the mechanism to use for catering to these different search techniques against the custom indexing.
These configure the Lucene index optimally for the OMRS-level search interfaces that Egeria defines.
Ordering of the EDN configuration is important
When configuring one of the components to use JDBC connectivity (such as the document store in the example above), the ordering of the configured services in the EDN configuration is important. Any service using a JDBC configuration should be placed at the very end of the EDN configuration. See issue #246 for details.
You may need to download additional dependencies
In general the dependent libraries for most persistence (other than JDBC) is included in the connector .jar
file itself. For JDBC, you will need to download the appropriate driver for your specific data store and make this .jar
file available in the same directory as the connector.
For example, when using PostgreSQL you will need org.postgresql:postgresql .
You can generally determine the additional dependencies you will need by looking at the project.clj
file of the relevant XTDB module -- specifically its :dependencies
section. For example, sticking with JDBC, here is the project.clj :
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
|
- This indicates that version
42.2.18
of thepostgresql
artifact within theorg.postgresql
group is needed to connect to PosgreSQL. - Similarly, this indicates that version
19.3.0.0
of theojdbc8
artifact within thecom.oracle.ojdbc
group is needed to connect to Oracle.
Connector options¶
There are currently two configuration options for the connector itself:
Option | Description |
---|---|
luceneRegexes |
Controls whether the connector will interpret unquoted regexes as Lucene-compatible (true) or not (false): in the latter case ensuring that we fallback to full Java regex checking (which will be significantly slower). |
syncIndex |
Controls whether the connector will wait for the XTDB indexes to be updated before returning from write operations (true) or only that they are only guaranteed to be persisted (false). |
Example configuration showing the default settings
{
"class": "Connection",
"connectorType": {
"class": "ConnectorType",
"connectorProviderClassName": "org.odpi.egeria.connectors.juxt.xtdb.repositoryconnector.XtdbOMRSRepositoryConnectorProvider"
},
"configurationProperties": {
"xtdbConfig": { },
"luceneRegexes": true,
"syncIndex": true // (1)
}
}
-
Be careful when setting
syncIndex
to false.When
syncIndex
is set to false, all write operations will return nullThe
syncIndex
parameter is intended for mass ingestion use only. As of v3.2 the connector now makes use of XTDB's transaction function capability to ensure that all write operations are ACID compliant, even when done asynchronously. This guarantees that:- the write operation will be persisted and durable when the write operation returns
- the write operation will be applied sequentially relative to any other (concurrent) write operations
However, asynchronous mode explicitly means that the write operation will not be immediately indexed (for reading). When running this way, the resulting state of a given write operation will not be known until some point in the future. This means that we cannot return a reliable result from the write operation to the caller when the repository is running in asynchronous mode. Rather than returning something that is potentially incorrect, we have therefore opted to ensure that these write operations always return
null
when operating in asynchronous mode.You should therefore be careful that when using asynchronous mode you are not relying on any functionality that makes direct use of the results of write operations, as those results will always be
null
in this mode. (And of course, since the write is asynchronous, there will be some period of time during which doing a read operation for that same object will also return either no results or an older / stale version of the result until the write operation has been indexed.)
High availability¶
A sample Helm chart is provided for configuring the XTDB connector for high availability . This chart starts up a number of different elements and configures them in a specific sequence.
Sample requires a pre-existing JDBC database
The sample chart relies on a pre-existing JDBC database somewhere to use as the document store, and the moment assumes this will be of a PostgreSQL variety (that's the only driver it downloads). A quick setup would be to use Enterprise DB's k8s operator to quickly start up a PostgreSQL cluster in your kubernetes cluster, which is what the following diagrams illustrate1.
Startup¶
When it is first deployed, the Helm chart starts a number of pods and services: for Egeria (purple), Kafka (red), execution of the Performance Test Suite and a pod used for configuration. (As mentioned above, it assumes a pre-existing JDBC database: a vanilla PostgreSQL cluster (grey) deployed and managed independently by EnterpriseDB's k8s operator.)
Each XTDB pod runs its own separate OMAG Server Platform, in its own JVM, and a script in the init-and-report
pod will wait until all three pods' OMAG Server Platforms are running before proceeding to any of the following steps. (The headless
service allows each pod to be directly addressed, without load-balancing, to do such a check.)
Configure¶
The next script creates a singular configuration document via the pts
pod, and deploys this common configuration to each of the pods (again using the headless
service to directly address each one individually): each will have a separate xtdb
server configured with the same XTDB connector (same metadata collection id).
When the /instance
is called against each pod to start the connector, each will create a local index and instance of the IXtdb
interface: all pointing to the same golden stores (in this example, Kafka and EDB) where all persistence for XTDB is handled. All servers will refer to the singular xtdb
load-balancing service as their root URL.
Run¶
Now when we start the Performance Test Suite, all traffic to the technology under test is routed via this xtdb
load-balancing service: which will round-robin each request it receives to the underlying pods running the XTDB native repository connector.
Kafka has a similar service, which handles load-balancing across its own pods for all write operations.
The underlying JDBC cluster may have a similar load-balancing service again (e.g. if the data store uses sharding), but also may not. In this example the edb-rw
service layer is instead an abstraction of the primary data store (edb-1
): all writes will go to this primary data store, while the others act as secondary / standby servers to which EnterpriseDB is automatically handling data replication from the primary. If the primary pod fails, EnterpriseDB can re-point the edb-rw
service layer to one of these existing secondary stores (which is automatically promoted to primary by EnterpriseDB).
Outages¶
Should there be any outage (in the example above, an Egeria pod, a Kafka pod, and an EnterpriseDB pod all going down) the Kubernetes services will simply stop routing traffic to those pods and the overall service will continue uninterrupted.
Depending on how the underlying services are managed, they may also be able to self-heal:
- Kafka is deployed as a
StatefulSet
in kubernetes, so if any pod fails kubernetes will automatically attempt to start another in its place to keep the total number of replicas defined by theStatefulSet
running at all times. - EnterpriseDB in our example was deployed through an operator: this operator self-heals any individual pod failure to e.g. start another standby server pointing at the same
PersistentVolumeClaim
as the failed pod (to pick up the data that was already replicated), switch the primary server to one of the standby servers if the primary server fails, and so on.
Limitations¶
There are a number of limitations to be aware of with the high availability configuration:
Must use a non-embedded XTDB back-end
Write operations will only be consistent when using a non-embedded XTDB back-end: e.g. Kafka, S3, or JDBC.
Read operations are eventually consistent
Since the indexes are local to each pod, read operations will be eventually consistent: the specific pod to which a query is routed may not yet have updated its embedded index with the results of the very latest write operations from some other pod.
(Note in particular that this has a knock-on impact to our test suites, which currently assume immediate consistency: expect various scenarios to fail if you decide to run them against an eventually-consistent HA configuration.)
Cannot yet be dynamically scaled
Currently configuration of Egeria requires making a number of REST API calls, which limits how dynamic we can be in adding or removing pods to an already-running cluster (in particular: we cannot rely on a readiness probe to indicate pod readiness to process actual work, but only its readiness to be configured). We hope to address this soon by allowing configuration and startup to be done without relying on REST calls, at which point we should be able to also support dynamically adding and removing pods from the cluster.
Raise an issue or comment below
-
For other databases, modify the
JDBC_DRIVER_URL
value in theconfigmap.yaml
of the chart to point to the location of the appropriate driver, and replace the use of thebin/bootstrapConfig.sh
script in theinit-and-report.yaml
template with an inline script in that template (to specify the appropriate XTDB configuration and JDBC dialect to use for the document store ). ↩