Metadata discovery and stewardship¶
Metadata discovery is an automated process that extracts metadata about a digital resource. This metadata may be:
- embedded within the asset (for example a digital photograph has embedded metadata), or
- managed by the platform that is hosting the asset (for example, a relational database platform maintains schema information about the data store in its databases), or
- determined by analysing the content of the asset (for example a quality tool may analyse the data content to determine the types and range of values it contains and, maybe from that analysis, determine a quality score for the data).
Some metadata discovery may occur when the digital resource is first catalogued as an asset. Integrated cataloguing typically automates the creation the basic asset entry, its connection and optionally, its schema. This is sometimes called technical metadata.
Cataloguing database with integrated cataloguing
For example, the schema of a database may be catalogued through the Data Manager OMAS API. This schema may have been automatically extracted by an integration connector hosted in Egeria's Database Integrator OMIS.
The open discovery services build on this initial cataloguing. They use advanced analysis to inspect the content of a digital resource to derive new insights that can augment or validate their catalog entry.
The results of this analysis is added to a discovery analysis report linked off of the asset for the digital resource.
The analysis results documented in the discovery analysis report can either be automatically applied to the asset's catalog entry or it can go through a stewardship process where a subject-matter expert confirms the findings (or not).
Discovery and stewardship are the most advanced form of automation for asset cataloging. Egeria provides the server runtime environment and component framework to allow third parties to create discovery services and governance action implementations. It has only simple implementations of these components, mostly for demonstration purposes. This is an area where vendors and other open source projects are expected to provide additional value.
Open discovery services¶
An open discovery service is a component that performs specific analysis of the contents of a digital resource on request. The aim of the open discovery service is to enable a detailed picture of the properties of a resource to be built up.
Each time an open discovery service runs to analyse a digital resource, a new discovery analysis report is created and attached to the resource's asset. If the open discovery service is run regularly, it is possible to track how the contents are changing over time.
The discovery analysis report contains one or more sets of related properties that the discovery service has discovered about the resource, its metadata, structure and/or content. These are stored in a set of discovery annotations linked off of the discovery analysis report.
An open discovery service is designed to run at regular intervals to gather a detailed perspective on the contents of the digital resource and how they are changing over time. Each time it runs, it is given access to the results of previously run open discovery services, along with a review of these findings made by individuals responsible for the digital resource (such as stewards, owners, custodians).
Operation of an open discovery service
- Each time an open discovery service runs, Egeria creates a discovery analysis report to describe the status and results of the open discovery service's execution. The open discovery service is passed a discovery context that provides access to metadata.
- The discovery context is able to supply metadata about the asset and create a connector to the digital resource using the connection information linked to the asset. The discovery service uses the connector to access the digital resource's contents in order to perform the analysis.
- The discovery service creates discovery annotations to record the results of its analysis. It adds them to the discovery context which stores them in open metadata attached to the discovery analysis report.
- The discovery annotations can be reviewed and commented on through an external stewardship process. This means choices from, for example, a list of potential options proposed by the discovery services, can be verified and the best one selected by an individual expert. The resulting choices are added to annotation reviews attached to the appropriate annotations.
- The next time the open discovery service runs, a new discovery analysis report is created to link new attachments.
- The discovery context provides access to the existing attachments for that asset along with any annotation reviews. The discovery services is able to link its new annotations to the existing annotations as an annotation extension. This means that the stewards can see the history associated with the new information.
Runtime for an open discovery service
Open discovery pipelines¶
There is a lot of common functions that are used repeatedly during the discovery process.
An open discovery pipeline is a specialized implementation of an open discovery service that runs a set of open discovery services against a single digital resource. The implementation of the open discovery pipeline determines the order that these open discovery services are run.
Each open discovery service in the pipeline is able to access the results of the open discovery services that have run before it through the discovery context. The combined results of the open discovery pipeline are grouped into a single discovery analysis report linked off of the asset.
The aim of the open discovery pipeline is to enable reusable open discovery service implementations to be choreographed together for different types of digital resource.
Inside the discovery analysis report¶
The discovery analysis report structures the annotations in two ways:
- Annotations that describe a characteristic of the whole digital resource.
- Annotations that describe a characteristic of a single data field within the digital resource.
The annotations for the data fields are linked off of the data fields created by schema extraction.
Open discovery can be used for the following types of analysis.
For digital resources that include structured data, schema extraction documents the data fields present in the digital resource and if the schema is attached to the asset, it will attempt to match the data fields it finds to its schema attributes.
Schema extraction uses the schema analysis annotation. It is linked directly off of the discovery analysis report.
Data field entities, one for each data field in the digital resource, are then linked together to show the structure of the data in the digital resource and this structure is linked off of the schema analysis annotation.
The schema of the data in the digital resource is defined in a SchemaType linked from the digital resource's asset using the AssetSchemaType relationship. This may be established before the open discovery service runs, or may be derived by a governance action once the open discovery service has run.
If the schema is defined, the open discovery service that creates the data fields may maintain relationships between the schema and the data fields:
- The SchemaTypeDefinition links the schema analysis annotation to the top level schema type.
- The SchemaAttributeDefinition links a data field to is corresponding schema attribute.
Alternatively, these relationships can be established by a governance action that is processing the results of the schema extraction. They are useful for consumers of the asset to be able to navigate to the specific data field annotations from the schema.
Where a digital resource has a fixed structure that does not support repeating fields, such as a relational database, the schema extraction can use the schema to create the data fields since the result will always be one-to-one (assuming the schema is being actively maintained).
However, if there are repeating groups in the digital resource's data fields then the schema extraction needs to work off of the data in the digital resource.
Profiling analysis looks at the data values in the resource and summarizes their characteristics. There are three types of annotations used in data profiling.
- Data Profile Annotation - Capture the characteristics of the data values stored in a specific data field in a data source.
- Data Profile Log Annotation - Capture the named of the log files where profile characteristics of the data values stored in a specific data field. This is used when the profile results are too large to store in open metadata.
- Fingerprint Annotation - Capture the characteristics of the data values stored in a specific data field or the whole digital resource and express it as a single value.
For structured data, data profiling needs to run after schema extraction to allow the data profiling annotations that refer to a specific data field to be linked from the appropriate data field entity.
Data class discovery¶
Data class discovery captures the analysis on how close a data field matches the specification defined in a data class.
The recommendation for a specific data class are stored in a data class annotation linked off of the appropriate data field. Data class discovery needs to run after schema extraction. It often builds on the information provided by data profiling.
Subsequent stewardship - either automated or with human assistance - can confirm the correct assignment using the DataClassAssignment relationship.
Semantic discovery is attempting to define the meaning of the data values in the asset. The result is a recommended glossary term stored as a semantic annotation.
These annotations are the metadata discovery equivalent of the Informal Tag shown in 0150 - Feedback in Area 1. It typically takes confirmation by a subject-matter expert to convert this into a Semantic Assignment. Semantic discovery needs to run after schema extraction. It often builds on the information provided by data profiling and data class discovery.
Classification discovery adds recommendations for new classifications that should either be added to the asset, or to a schema attribute in the asset. It uses the classification annotation to describe the classification and its properties. If the classification is for the asset, the classification annotation is linked off of the discovery analysis report. If it is for a specific schema attribute, it is linked off of the corresponding data field.
Calculating quality scores¶
Quality scores describe how well the data values, typically in a data field, conform to a specification. For example, do the values match a list of valid values. This type of annotation is often used within a data quality program to provide assessments of the data for different purposes.
Relationship discovery identifies relationships between different resources (or data fields), such as two columns that have a foreign key relationship.
It is possible to create the relationship as a relationship annotation or attach a relationship advice to the discovery analysis report.
The measurement annotations capture a snapshot of the physical dimensions and activity levels at a particular moment in time. For example, it may calculate the size of the data source or the number of users accessing it.
Requesting stewardship action¶
A RequestForAction entity (RfA) is used when an open discovery service performs a test on the data (such as a quality rule) or has discovered an anomaly in the data landscape compared to its metadata that potentially needs a steward or a curator's action.
The Stewardship Action OMAS is designed to respond to the requests for actions (RfAs).
Working with external engines¶
Open discovery services may directly implement the analysis function or may invoke an external service to create the annotations.
Stewardship is initiated either through the creation of a Request for Action annotation or when the discovery analysis report's status changes to COMPLETE.