Skip to content

Metadata discovery and stewardship

Metadata discovery is an automated process that extracts metadata about a digital resource. This metadata may be:

  • embedded within the asset (for example a digital photograph has embedded metadata), or
  • managed by the platform that is hosting the asset (for example, a relational database platform maintains schema information about the data store in its databases), or
  • determined by analysing the content of the asset (for example a quality tool may analyse the data content to determine the types and range of values it contains and, maybe from that analysis, determine a quality score for the data).

Some metadata discovery may occur when the digital resource is first catalogued as an asset. Integrated cataloguing typically automates the creation the basic asset entry, its connection and optionally, its schema. This is sometimes called technical metadata.

Cataloguing database with integrated cataloguing

For example, the schema of a database may be catalogued through the Data Manager OMAS API. This schema may have been automatically extracted by an integration connector hosted in Egeria's Database Integrator OMIS.

The survey action services build on this initial cataloguing. They use advanced analysis to inspect the content of a digital resource to derive new insights that can augment or validate their catalog entry.

The results of this analysis is added to a survey report linked off of the asset for the digital resource.

The analysis results documented in the survey report report can either be automatically applied to the asset's catalog entry or it can go through a stewardship process where a subject-matter expert confirms the findings (or not).

Discovery and stewardship are the most advanced form of automation for asset cataloging. Egeria provides the server runtime environment and component framework to allow third parties to create survey action services and governance action implementations. It has only simple implementations of these components, mostly for demonstration purposes. This is an area where vendors and other open source projects are expected to provide additional value.

Survey action services

An survey action service is a component that performs analysis of the contents of a digital resource on request. The aim of the survey action service is to enable a detailed picture of the properties of a resource to be built up.

Each time a survey action service runs, it creates a new survey report linked off of the digital resource's Asset metadata element that records the results of the analysis.

Asset with survey reports

Each time an survey action service runs to analyse a digital resource, a new survey report is created and attached to the resource's asset. If the survey action service is run regularly, it is possible to track how the contents are changing over time.

The survey report contains one or more sets of related properties that the survey action service has discovered about the resource, its metadata, structure and/or content. These are stored in a set of annotations linked off of the survey report.

An survey action service is designed to run at regular intervals to gather a detailed perspective on the contents of the digital resource and how they are changing over time. Each time it runs, it is given access to the results of previously run survey-action services, along with a review of these findings made by individuals responsible for the digital resource (such as stewards, owners, custodians).

Operation of an survey action service

Operation of a survey action service

  1. Each time a survey action service runs, Egeria creates a survey report to describe the status and results of the survey action service's execution. The survey action service is passed a survey context that provides access to metadata.
  2. The survey context is able to supply metadata about the asset and create a connector to the digital resource using the connection information linked to the asset. The survey action service uses the connector to access the digital resource's contents in order to perform the analysis.
  3. The survey action service creates annotations to record the results of its analysis. It adds them to the survey context which stores them in open metadata attached to the survey report.
  4. The annotations can be reviewed and commented on through an external stewardship process. This means choices from, for example, a list of potential options proposed by the survey action services, can be verified and the best one selected by an individual expert. The resulting choices are added to annotation reviews attached to the appropriate annotations.
  5. The next time the survey action service runs, a new survey report is created to link new attachments.
  6. The survey context provides access to the existing attachments for that asset along with any annotation reviews. The survey action services is able to link its new annotations to the existing annotations as an annotation extension. This means that the stewards can see the history associated with the new information.
Runtime for an survey action service

Survey action services are packaged into Survey Action Engines that run in the Survey Action OMES hosted in an Engine Host.

The metadata repository interface for metadata discovery tools is implemented by the Asset Owner OMAS that runs in a Metadata Access Server.

A survey action service may be triggered via an Engine Action, a governance action type or as part of a governance action process.

Survey Action Service

Inside the survey report

The survey report structures the annotations in two ways:

  • Annotations that describe a characteristic of the whole digital resource.
  • Annotations that describe a characteristic of a single data field within the digital resource.

The annotations for the data fields are linked off of the schema attributes created by schema extraction.

Survey report structure

Survey actions

Survey actions can be used for the following types of analysis.

Schema extraction

For digital resources that include structured data, schema extraction documents the data fields present in the digital resource and if the schema is attached to the asset, it will attempt to match the data fields it finds to its schema attributes.

Schema extraction uses the schema analysis annotation. It is linked directly off of the survey report. The schema of the data in the digital resource is defined in a SchemaType linked from the digital resource's asset using the AssetSchemaType relationship. This may be established before the survey action service runs, or may be derived by the survey action service itself.

Resource profiling

Profiling analysis looks at the data values in the resource and summarizes their characteristics. There are three types of annotations used in resource profiling.

  • Resource Profile Annotation - Capture the characteristics of the data values stored in a specific resource.
  • Resource Profile Log Annotation - Capture the named of the log files where profile characteristics are captured. This is used when the profile results are too large to store in open metadata.
  • Fingerprint Annotation - Capture the characteristics of the resource and express it as a single value.

Resource profiling

For structured data, resource profiling needs to run after schema extraction to allow the resource profiling annotations that refer to a specific data field to be linked from the appropriate data field entity.

Data class discovery

Data class discovery captures the analysis on how close a data field matches the specification defined in a data class.

Data class discovery

The recommendation for a specific data class are stored in a data class annotation linked off of the appropriate data field. Data class discovery needs to run after schema extraction. It often builds on the information provided by resource profiling.

Subsequent stewardship - either automated or with human assistance - can confirm the correct assignment using the DataClassAssignment relationship.

Semantic discovery

Semantic discovery is attempting to define the meaning of the data values in the asset. The result is a recommended glossary term stored as a semantic annotation.

Semantic discovery

These annotations are the metadata discovery equivalent of the Informal Tag shown in 0150 - Feedback in Area 1. It typically takes confirmation by a subject-matter expert to convert this into a Semantic Assignment. Semantic discovery needs to run after schema extraction. It often builds on the information provided by data profiling and data class discovery.

Classification discovery

Classification discovery adds recommendations for new classifications that should either be added to the asset, or to a schema attribute in the asset. It uses the classification annotation to describe the classification and its properties. If the classification is for the asset, the classification annotation is linked off of the discovery analysis report. If it is for a specific schema attribute, it is linked off of the corresponding data field.

Classification discovery

Calculating quality scores

Quality scores describe how well the data values, typically in a data field, conform to a specification. For example, do the values match a list of valid values. This type of annotation is often used within a data quality program to provide assessments of the data for different purposes.

Quality Scores

Relationship discovery

Relationship discovery identifies relationships between different resources (or data fields), such as two columns that have a foreign key relationship.

It is possible to create the relationship as a relationship annotation or attach a relationship advice to the discovery analysis report.

Relationship discovery

Capturing measurements

The measure annotations capture a snapshot of the physical dimensions and activity levels at a particular moment in time. For example, it may calculate the size of the resource or the number of users accessing it.

Data source measurements

Requesting stewardship action

A RequestForAction entity (RfA) is used when a survey action service performs a test on the data (such as a quality rule) or has discovered an anomaly in the data landscape compared to its metadata that potentially needs a steward or a curator's action.

Request for action

The Stewardship Action OMAS is designed to respond to the requests for actions (RfAs).

Working with external engines

Survey action services may directly implement the analysis function or may invoke an external service to create the annotations.

Initiating stewardship

Stewardship is initiated either through the creation of a Request for Action annotation or when the discovery analysis report's status changes to COMPLETE.

Raise an issue or comment below