Skip to content
In development

A component that is in development means that the Egeria community is still building the function. The code is added continuously in small pieces to help the review and socialization process. It may not run, or do something useful - it only promises not to break other function. Expect to find git issues describing the end state.

Hive Metastore Repository Connector

This repository has been created to manage artifacts and issues relating to integration with Hive Metastore (HMS). This connector is based on the same polling pattern that the File sample OMRS connector uses.

Connector details

Important notice

The gradle JAR step will include some of the dependencies into the connector JAR, making is a semi-Fat Jar. This makes sure that additional dependencies are automatically deployed together with the connector.

Configuration

Repository Proxy Connector embedded configuration

Configure a Repository proxy with an embedded native repository

Configure the event mapper connector

Any open metadata repository that supports its own API must also implement an event mapper to ensure the Open Metadata Repository Services (OMRS) is notified when metadata is added to the repository without going through the open metadata APIs.

The event mapper is a connector that listens for proprietary events from the repository and converts them into calls to the OMRS. The OMRS then distributes this new metadata.

For the Hive Metastore (HMS) Repository Proxy Connector this Event mapper currently polls for Hive Metastore content. It may be enhanced in the future to also emit granular events to track the HMS metadata as it changes.

POST - configure event mapper

{{platformURLRoot}}/open-metadata/admin-services/users/{{adminUserId}}/servers/{{serverName}}/local-repository/event-mapper-details?connectorProvider={{fullyQualifiedJavaClassName}}&eventSource={{resourceName}}

The connectorProvider should be set to the fully-qualified Java class name for the connector provider, and the eventSource should give the details for how to access the events (for example, the hostname and port number of an Apache Kafka bootstrap server).

HMS connector configuration overview

HMS connector configuration overview HMS connector configuration overview

Event mapper Endpoint address should be defined with the url of the thrift endpoint.

like this

"endpoint": {
"class": "Endpoint",
"address": "thrift://catalog.eu-de.dataengine.cloud.ibm.com:9083"
},

configurationProperties parameters

Event mapper configuration parameter name default Description
qualifiedNamePrefix empty string This is a prefix for the qualifiedName. This prefix is used on every entity that is created using this connector.
refreshTimeInterval null Poll interval in minutes. If null only poll once at connector start time.
CatalogName null This is the HMS catalog name.
DatabaseName null This is the HMS database name.
sendPollEvents true Set this to true to send events to the cohort every poll.
endpointAddress null url to access the data that this metadata describes
cacheIntoCachingRepository true Set this to false to not cache the metadata content
securedProperties null If securedProperties need to be sent on the Connection entity, specify as a json object, with string properties.
includeDeployedSchema false When set to true a DeployedDatabaseSchema entity is created between the Database and the database schema. Set this option if you know that members of the cohort expect to see a DeployedDatabaseSchema

Setting CatalogName and DatabaseName

The setting of these 2 parameters dictates the scope of metadata that is ingested from HMS. For Hive the default catalog name is hive and default database name is default.

CatalogName DatabaseName scope of HMS content to be ingested
Name of a catalog Name of a database The HMS tables under the named database in the named catalog
null null All tables in all databases under all catalogs. If the HMS implementation does not support getCatalogs API then an error will be issued and the connector stopped
null Name of a database All tables in default catalog under the named database.
Name of a catalog null The HMS tables under all the databases in the named catalog.

Using with the IBM Cloud® Data Engine service.

To use this connector with IBM Cloud® Data Engine service, the code needs to be recompiled to bring in the IBM HMS Client library. To do this the property ibmhms in the gradle build; on the command line specify:

./gradlew build -Pibmhms

The following additional security parameters need to be specified in the configurationProperties as the IBM Hive-compatible client uses a secure API to talk to the IBM Cloud®.

Event mapper configuration parameter name default Description
MetadataStoreUserId null The Data Engine service crn.
MetadataStorePassword null The API key
useSSL false Set to true
CatalogName null set to "spark"
DatabaseName null Set to "default"

Using with Hive Metastore version 4.

At the time of writing the latest HMS version 4 client is 4.0.0-alpha-2. This client is required to communicate with the HMS server 4.0.0-alpha-2. To build with this HMS 4.0.0-alpha-2 client specify property hmsv4 in the gradle build; on the command line specify:

./gradlew build -Phmsv4

The logic to extract the columns for the HMS Table.

The IBM Cloud® Data Engine has HMS tables that describe IBM Cloud® object storage data. In this case the HMS table is an external table. In this case, the columns are described in HMS table parameters in an Apache Spark™ format.

For external tables (currently only verified using IBM Cloud® Data Engine service) the HMS connector. 1) Checks table parameter spark.sql.sources.schema.numParts. If specified then this is the number of spark sql sources schema parts that are present. Thre schema parts are spark.sql.sources.schema.part.0 ...spark.sql.sources.schema.part.n - where n is one less than the number of parts. The HMS connector stitches these parts together to form valid json and then extracts the column information from that json. This is a Spark v2 format. 2) If spark.sql.sources.schema.numParts is not specified, then the HMS connector extracts the columns from spark.sql.sources.schema. This is a Spark v3 format.

For all other cases and for external tables with spark.sql.sources.schema.numParts and spark.sql.sources.schema, the code looks in the HMS table storage descriptor cols property to get the columns.

We can imagine that other scenarios might require the above logic to be amended. For example, for data stored in HDFS an exposed as a managed table then and using Spark data sources. If new scenarios occur, that can be tested by the community, then the above code can be extended.

Design

Components

The high level architecture of the connector is: Caching Repository proxy components

It shows how the event mapper polling loop:

  • Gets the Hive metastore information from the Hive metastore.
  • Adds the appropriate reference entities and relationships to the repository connector
  • Finds the entities and relationships per asset (Database)
  • Sends a batched event per asset
  • Waits for the length of time specified in the refreshTimeInterval configuration parameter.
  • repeats

Working with Hive Metastore and its APIs.

The Hive Metastore can be run as a standalone one. This standalone server jar file is also required for the Client API. The HMSClient API used is https://github.com/apache/hive/blob/master/standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java It uses the Thrift API to communicate with the Hive Metastore. At this time (July 2022) the version 3.1.3 of this Hive Metastore has vulnerabilities A number of excludes were required in the gradle build file to ensure the inappropriate vulnerable libraries are not present - as reported by Sonarscan and Lift.

HMS Client calls used:

HMS Client call Description
client.getCatalogs() Get all the catalog names
client.getAllDatabases() Get all the database names under the default Catalog.
client.getAllDatabases(<catName>); Get all the database names under the named Catalog.
client.getTables(<catName>, "*") Get all the tables from the Catalog with name <catName> and database with name <dbName>
client.getTable(<catName>, <dbName>, <tableName>) Get the table details for table named <tableName> in Catalog <catName>, the returned table contains the column details

Hive Metastore mapping to Egeria OMRS open types

Egeria has an open type called Database we are mapping this to the Hive Database. Note that at Hive 3 there are higher level concepts called catalogs that hold databases.

Entity Types

HMS concept Description Egeria open Entity type Comments
Catalog Higher level of container within Hive Not modeled The getCatalogs API is not always present in all HMS implementations
Database Lives within a Catalog Database
n/a n/a Connection This represents the connection to the instance data
n/a n/a ConnectionType This is the type of the connection
n/a n/a Endpoint This is where the endpoint information is stored
n/a n/a DeployedDatabaseSchema Deployed Schema
n/a n/a RelationalDBSchemaType Database schema type
Table Lives within a Catalog RelationalTable Relational Table
Column Lives within a Catalog RelationalColumn Relational Column

Relationship Types

Egeria open Relationship type Comments
ConnectionEndpoint Relationship between Connection and Endpoint
ConnectionConnectorType Relationship between Connection and Connector Type
ConnectionToAsset Relationship between Connection and Asset
AssetSchemaType Relationship between Database (the asset) and Schema Type
AttributeForSchema Relationship between the RelationalTable and RelationalColumn
DataContentForDataSet Relationship between DeloyedDatabaseSchema and RelationalDBSchemaType

Classification Types

HMS concept Description Egeria open Classification type Comments
Hive table Type if this is VIRTUAL_VIEW then this is a view rather than a table CalculatedValue The RelationalTable is classified with CalculatedValue for a view
for columns fieldSchema Type This is the type of the Hive column (e.g. String) TypeEmbeddedAttribute This contains the type of the column
for tables n/a n/a TypeEmbeddedAttribute The type of the Table

Using with the Hive Metastore listener

The Hive metastore listener runs in the HMS process as a HMS listener. When HMS tables are added, dropped or altered then appropriate Egeria events are issued. It sends the 'granular events. In this way the Egeria cohort members are kept up to date as the HMS metadata changes.

The HMS connector needs to have run prior to the listener, so that the Egeria cohort already has existing entities that the HMS tables can be associated with. The listener needs to be configured as the same cohort member as the connector.

HMS configuration parameter Description
EgeriaListener.metadataCollectionId This is the metadata collection id of the HMS connector
EgeriaListener.serverName This is the server name of the HMS connector
EgeriaListener.organisationName Organisation name to match the HMS connector
EgeriaListener.qualifiedNamePrefix Qualified name prefix the HMS connector
EgeriaListener.events.kafka.topicname The cohort topic name that the HMS connector writes to
EgeriaListener.events.kafka.clientId The client ID of the HMS connector
EgeriaListener.events.kafka.bootstrapServerurl The bootstrap server name for the Kafka defined for the HMS connector

You may also find these links in the Egeria documentation useful:

During 2022 we have also had a number of Webinars relating to connector choices and design:

Reference materials


License: CC BY 4.0, Copyright Contributors to the ODPi Egeria project.