Skip to content

Integration Guide

Many tools and platforms today maintain a store of metadata. This metadata drives the behaviour of the technology. For example:

  • A database server maintains the structure of its databases in a schema catalog. This schema is used to structure queries and optimize the extraction of data.
  • Many tools that enable their users to work with data keep metadata about the known data sources. This helps their users find the data they need and to work with it. The example below shows an analytics workbench that data scientists may use to build analytics models.

Third party technology metadata

Egeria's integration guide helps you to integrate third party technologies into the open metadata ecosystem. The technology may be an open source technology, a home-grown technology or a software product. This does not matter. The integration may be to import metadata into the open metadata ecosystem, or to export it to the third party technology, or both. Egeria's integration capabilities are extremely flexible. The only limitations to metadata exchange between a third party technology and the open metadata ecosystem derive from the capabilities of the external APIs and events provided by the third party technology.

Metadata supply chains

The open metadata ecosystem collects, links and disseminates metadata from many sources. However, it is designed in an iterative, agile manner, adding new use cases and capabilities over time.

Each stage of development considers a particular source of metadata and where it needs to be distributed to. Consider this scenario...

Database schema capture and distribution

There is a database server (Database Server 1) that is used to store application data that is of interest to other teams. An initiative is started to automatically capture the schemas of the databases on this database server. This schema information will be replicated to two destinations:

  • Another database server (Database Server 2) is used by a data science team as a source of data for their work. An ETL job runs every day to refresh the data in this second database with data from the first database. The data is anonymized by the ETL job, but the schema and data profile remains consistent. If the schema in the first database changes, the ETL job is updated at the same time. However, the schema in the second database is not updated because the team making the change do not have access to it. Nevertheless it must be updated consistently before the ETL job runs; otherwise it will fail.
  • The analytics tool that is also used by the data science team has a catalog of data sources to show the data science team what data is available. This needs to be kept consistent with the structure of the databases. The tool does provide a feature to refresh any data source schema in its catalog, but the team are often unaware of changes to their data sources, or simply forget to do it, and only discover the inconsistency when their models fail to run properly.

metadata supply chain scenario

The integration of these third party technologies with the open metadata ecosystem can be thought of as having four parts to it.

  1. Any changes to the database schema are extracted from Database Server 1 and published to the open metadata ecosystem.
  2. The new schema information from Database Server 1 is detected in the open metadata ecosystem and deployed to Database Server 2.
  3. The changes to Database Server 2's schema are detected and published to the open metadata ecosystem.
  4. The new schema information from Database Server 2 is detected in the open metadata ecosystem and distributed to the Analytics Workbench.

metadata supply chain integration points

Four integration steps to capture and distribute the database schema metadata from Database Server 1.

Here is another view of the process, but shown as a flow from left to right.

Metadata supply chain integration points

At each stage, there is a trigger (typically detecting something has changed), metadata is assembled, updated and when it is read, made visible through the open metadata ecosystem.

Metadata Update Specification Pattern

A three-step specification pattern of Trigger, Maintain Metadata and Make Visible.

The implementation of the three-step pattern for each part of the integration is located in an integration connector. Integration connectors are configurable components that are designed to work with a specific third party technology. There would be 4 configured integration connectors to support the scenario above. However, the implementation of the integration connectors for parts 1 and 3 would be the same implementation, just 2 instances, each configured to work with a different database server.

The integration connectors supplied with Egeria are described in the connector catalog. It is also possible to write your own integration connectors if the ones supplied by Egeria do not meet your needs.

Integration connectors run in the Integration Daemon. It is possible to have all 4 integration connectors running in the same integration daemon. Alternatively, they each may run in a different integration daemon - or any combination in between. The choice is determined by the organization of the teams that will operate the service. For example, if this metadata synchronization process was run by a centralized team, then all 4 integration connectors would probably run in the same integration daemon. If the work is decentralized, the integration connector for part 1 may be in an integration daemon operated by the same team that operates Database Server 1. The other integration connectors may run together in an integration daemon operated by the team that operates Database Server 2 and the Analytics Workbench.

The diagram below shows the decentralized option.

Decentralized deployment

This type of deployment choice keeps control of the metadata integration with the teams that own the third party technology, and so upgrades, back-ups and outages can be coordinated.

The implementation of the open metadata ecosystem that connects the integration daemons can also be centralized or decentralized. This next diagram shows two integration daemons connecting into a centralized metadata access store that provides the open metadata repository.

Centralized metadata store

Alternatively, each team could have their own metadata access store, giving them complete control over their metadata. The two metadata access stores are connected via an Open Metadata Repository Cohort (or just "cohort" for short.) The cohort enables the two metadata access stores to operate as one logical metadata store.

Decentralized metadata stores with cohort

The behaviour of the integration daemons is unaffected by the deployment choice made for the metadata access stores.

Adding lineage

In the scenario above, data from Database Server 1 is extracted, anonymized and stored in Database Server 2 by an ETL job running in an ETL engine.

Role of the ETL engine

The data scientist team want to know the source of each of the databases they are working with. The metadata that describes the source of data is called lineage. Ideally it is captured by the ETL engine to ensure it is accurate.

ETL engines have a long history of capturing lineage, since it is a common requirement in regulated industries. The diagram below shows three choices on how an ETL engine may handle its lineage metadata.

  • In the first box on the left, the ETL engine has its own metadata repository and so it is integrated into the open metadata ecosystems via the integration daemon (in the same way as the database and analytics workbench).
  • In the middle box, the ETL engine is producing lineage events that follow the OpenLineage Standard. The integration daemon has native support for this standard and so the ETL Engine can send these events directly to the integration daemon which will pass them to any integration connector that is configured to receive them.
  • The final box on the right-hand side shows an ETL engine that is part of a suite of tools that share a metadata repository. These types of third party metadata repositories often have a wide variety of metadata that is good for many use cases. So, although it is possible to integrate them through the integration connectors running in the integration daemon, it is also possible to connect them directly into the cohort via a Repository Proxy. This is a more complex integration to perform. However, it has the benefit that the metadata stored in the third party metadata repository is logically part of the open metadata ecosystem and available through any of the open metadata and governance APIs without needing to copy its metadata into a metadata access store.

Choices when integrating lineage

Summary

In this guide you have seen that integration with the open metadata ecosystem is built up iteratively using integration connectors running in an integration daemon. Open metadata is stored in metadata access stores and shared across the open metadata ecosystem using a cohort. It is also possible to plug in a third party metadata repository using a repository proxy.

Complete integration solution

Inside the Integration Daemon

Recap: The Integration Daemon is an Egeria OMAG Server that sits at the edge of the open metadata ecosystem, synchronizing metadata with third party tools. It is connected to a Metadata Access Server that provides the APIs and events to interact with the open metadata ecosystem.

Integration Daemon

The integration can be:

  • Triggered by an event from a third party technology that indicates that metadata needs to be updated in the open metadata ecosystem to make it consistent with the third party technology's configuration.

  • Triggered at regular intervals so that the consistency of the open metadata ecosystem with the third party technology can be verified and, where necessary, corrected.

  • Triggered by a change in the open metadata ecosystem indicating that changes need to be replicated to the third party technology.

Running in the integration daemon are integration connectors that each support the API of a specific third party technology. The integration daemon starts and stops the integration connectors and provides them with access to the open metadata ecosystem APIs. Its action is controlled by configuration, so you can set it up to exchange metadata with a wide range of third party technologies.

An integration connector is specialized for a particular technology. The integration daemon provides specialized services focused on different types of technology, in order to simplify the work of the integration connector. These specialized services are called the Open Metadata Integration Services (OMISs). Each integration connector is paired with an OMIS and, the OMIS is paired with a relevant Open Metadata Access Service (OMAS) running in a Metadata Access Server.

Inside Integration Daemon

Configuring the integration daemon

The integration daemon's configuration contains a list of integration connectors that the integration daemon is to run. The configuration for each integration connector describes the connector implementation to use, how often to call it and the open metadata services that it needs.

The connector catalog lists the integration connectors that are part of Egeria. They are organized by technology type. In addition, you can write your own integration connectors for third party technologies currently not supported in the catalog. The catalog entry for each integration connector provides the information needed to configure it in an integration daemon. This includes details of the integration service it uses, where the implementation is located and the configuration options it supports. See the Data Files Monitor Integration Connector entry as an example.

Further information

Validating your integration

Once the integration daemon is configured it can be started. It will read the configuration and start up the integration connectors described within.

Once it is running, the following command can be used to show the status of the integration connectors

GET {{serverURLRoot}}/servers/{{serverName}}/open-metadata/integration-daemon/users/{{userId}}/status
Here is an example of the result ... the integration connectors are organized by the integration services that are providing access to the open metadata APIs. In this case, there are three file-oriented integration connectors running with the Files Integration OMIS.

{
    "class": "IntegrationDaemonStatusResponse",
    "relatedHTTPCode": 200,
    "integrationServiceSummaries": [
        {
            "integrationServiceId": 605,
            "integrationServiceFullName": "Files Integrator OMIS",
            "integrationServiceURLMarker": "files-integrator",
            "integrationServiceDescription": "Extract metadata about files stored in a file system or file manager.",
            "integrationServiceWiki": "https://egeria-project.org/services/omis/files-integrator/overview/",
            "integrationConnectorReports": [
                {
                    "connectorId": "7f4d641d-71cc-4df6-b7b9-aed29e3bdf7f",
                    "connectorName": "OakDeneLandingAreaFilesMonitor",
                    "connection": {
                        "class": "Connection",
                        "headerVersion": 0,
                        "connectorType": {
                            "class": "ConnectorType",
                            "headerVersion": 0,
                            "connectorProviderClassName": "org.odpi.openmetadata.adapters.connectors.integration.basicfiles.DataFilesMonitorIntegrationProvider"
                        },
                        "endpoint": {
                            "class": "Endpoint",
                            "headerVersion": 0,
                            "address": "data/landing-area/hospitals/oak-dene/clinical-trials/drop-foot"
                        }
                    },
                    "connectorInstanceId": "6471c86d-0f2f-4cd1-ad90-d0f396a1d31e",
                    "connectorStatus": "RUNNING",
                    "lastStatusChange": "2022-11-06T21:48:30.032+00:00",
                    "lastRefreshTime": "2022-11-06T21:48:38.893+00:00",
                    "minMinutesBetweenRefresh": 10
                },
                {
                    "connectorId": "0b64de6f-6cb6-46ac-844a-012bd0b96a9d",
                    "connectorName": "OldMarketLandingAreaFilesMonitor",
                    "connection": {
                        "class": "Connection",
                        "headerVersion": 0,
                        "connectorType": {
                            "class": "ConnectorType",
                            "headerVersion": 0,
                            "connectorProviderClassName": "org.odpi.openmetadata.adapters.connectors.integration.basicfiles.DataFilesMonitorIntegrationProvider"
                        },
                        "endpoint": {
                            "class": "Endpoint",
                            "headerVersion": 0,
                            "address": "data/landing-area/hospitals/old-market/clinical-trials/drop-foot"
                        }
                    },
                    "connectorInstanceId": "4ab6cbec-69fd-4c8f-bea4-c92b861bfe0b",
                    "connectorStatus": "FAILED",
                    "lastStatusChange": "2022-11-06T21:48:38.898+00:00",
                    "lastRefreshTime": "2022-11-06T21:48:38.898+00:00",
                    "minMinutesBetweenRefresh": 10,
                    "failingExceptionMessage": "BASIC-FILES-INTEGRATION-CONNECTORS-404-001 The directory named data/landing-area/hospitals/old-market/clinical-trials/drop-foot does not exist"
                },
                {
                    "connectorId": "51545de2-8776-453c-af8a-0a4dadad5c46",
                    "connectorName": "DropFootClinicalTrialResultsFolderMonitor",
                    "connection": {
                        "class": "Connection",
                        "headerVersion": 0,
                        "connectorType": {
                            "class": "ConnectorType",
                            "headerVersion": 0,
                            "connectorProviderClassName": "org.odpi.openmetadata.adapters.connectors.integration.basicfiles.DataFolderMonitorIntegrationProvider"
                        },
                        "endpoint": {
                            "class": "Endpoint",
                            "headerVersion": 0,
                            "address": "data/data-lake/research/clinical-trials/drop-foot/weekly-measurements"
                        }
                    },
                    "connectorInstanceId": "c6452f6a-620d-49b2-9500-af260403feef",
                    "connectorStatus": "RUNNING",
                    "lastStatusChange": "2022-11-06T21:48:38.945+00:00",
                    "lastRefreshTime": "2022-11-06T21:48:39.000+00:00",
                    "minMinutesBetweenRefresh": 10
                }
            ]
        }
    ]
}

This table summarises the results shown above:

Connector name Status Error Message Connector implementation
OakDeneLandingAreaFilesMonitor RUNNING org.odpi.openmetadata.adapters.connectors.integration.basicfiles.DataFilesMonitorIntegrationProvider
OldMarketLandingAreaFilesMonitor FAILED BASIC-FILES-INTEGRATION-CONNECTORS-404-001 The directory named data/landing-area/hospitals/old-market/clinical-trials/drop-foot does not exist org.odpi.openmetadata.adapters.connectors.integration.basicfiles.DataFilesMonitorIntegrationProvider
DropFootClinicalTrialResultsFolderMonitor RUNNING org.odpi.openmetadata.adapters.connectors.integration.basicfiles.DataFolderMonitorIntegrationProvider

Once any problems have been corrected, the integration service can be restarted, which will restart the integration connectors:

POST {{serverURLRoot}}/servers/{{serverName}}/open-metadata/integration-daemon/users/{{userId}}/integration-services/files-integrator/restart

Further information


Raise an issue or comment below