Metadata Governance Day¶
As metadata is shared and linked, the gaps and inconsistencies in it are exposed. In the dojo you will learn how to set up a variety of features in Egeria to actively govern and maintain your metadata environment.
Metadata Governance Dojo starts here
The importance of metadata governance (45 mins)
The importance of metadata governance¶
Data, and the metadata that describes it, enables individuals and automated processes to make decisions. As the trust grows in the availability, accuracy, timeliness and completeness of the data/metadata, its use increases and your organization sees greater value.
Trust is hard to build and easy to destroy. Maintaining trust begins with authoritative sources of data/metadata that are actively managed and distributed along well known information supply networks. This flow needs to be transparent and reliable - that is explicitly defined and verifiable through monitoring, testing and remediation.
The content of the data/metadata needs follow standards that ensure clarity both in meaning and how it should be used and managed. Its completeness and quality needs to be appropriate for the organization's uses. These uses will change over time.
Finally, the ecosystem that supplies and uses this data/metadata must evolve and adapt to the changing and growing needs of the organization because trust is required not just for today's operation but also into the future.
You can make your own choices on how to build trust in your data/metadata. Egeria provides both features and practices built from industry experiences and best practices that help in the maintenance of data/metadata. In this dojo we will cover these features and practices, enabling you to select which are appropriate to your organization and when/where to consider using them.
Types of metadata¶
Metadata is often described as data about data. However, this definition does not fully convey the breadth and depth of information that is needed to govern your digital operations.
The most commonly collected metadata is technical metadata that describes the way something is implemented. For example, technical metadata includes databases and their database schema (table and column definitions), APIs and their interface specification, events and message schemas, applications, virtual containers, and computers.
Technical metadata is the easiest type of metadata to maintain since many technologies provide APIs/events to query the technical metadata for the digital resources it is managing.
You need to either gather this metadata whenever new resources are deployed into production, monitor for events that indicate that the metadata has changed or periodically call the metadata APIs to update the metadata catalog.
Collecting and maintaining technical metadata builds an inventory of your digital resources that can be used to count each type of digital resources and act as a list to work through when regular maintenance is required. It also helps people locate specific types of digital resources.
Types of metadata repository¶
Many metadata repositories are data catalogs. They focus on gathering and organizing information about data sources. Each data source
Designing your metadata supply chains (45 mins)
- Home and reference copies - metadata provenance
- Types of standard definitions and governance classifications
Working with templates (30 mins)
When a new resource is catalogued, the catalog entry of a similar resource can be used as a template to set up the asset for the new resource. This means that the new asset can contain governance metadata attachments, not just the technical metadata extracted from the digital resource.
Templated cataloguing is useful for situations where new resources are regularly catalogued that are of the same kind.
Peter Profile is responsible for cataloguing the weekly measurements supplied by the various hospitals as part of a clinical trial. These measurements are supplied with certain terms and conditions (also known as a license) that Coco Pharmaceuticals must not only adhere to, but prove that they are doing so. For that reason, when the measurements are catalogued, the asset for the measurements data set is linked to the license as well as other elements that help to ensure that the measurements data sets are appropriately used and governed.
Figure 1 shows Peter making calls to Egeria to catalog the first set of measurements received for the clinical trial. This includes an asset to represent the data set that is linked to the license along with a connection to allow the data scientist to connect to the data set and access the data and the schema showing the structure of the data in the data set. The data fields identified in the schema each link to the glossary term that describes the meaning of the data stored in the field. There are also two classifications on the asset:
- AssetZoneMembership - The governance zones that the asset is a member of. This controls who can access the asset and its related metadata elements such as the connection and the schema.
- Ownership - The owner of the data set. This is the person who is accountable for ensuring that Coco Pharmaceuticals adheres to the license.
Figure 1: In week 1, Peter manually creates the asset and links it to the governance elements needed to ensure the data set is used and protected as laid out in the license.
Without templating, Peter would need to issue the same sequence of requests to catalog each of the weekly results rom each of the hospitals. This is a lot of work from Peter, particularly as the number of clinical trials, and participating hospitals rises. He may then make a mistake and forget one of the steps in the cataloguing process.
What if the catalog entry for the Week 1 measurements could be used as a template for cataloguing the subsequent weeks' measurements as shown in figure 2?
Figure 2: For subsequent weeks, the week 1 entry could be used as a template for cataloguing subsequent weeks. The result is an asset for each data set with a connector, a schema along with the ownership and zone membership classifications. All of the assets are linked to the license and the data fields in each schema are linked to the correct glossary terms.
This is the idea behind templated cataloguing. A template that defines the common settings for a set of digital resources is defined and this template is used when cataloguing the resources.
Figure 3 shows a set of templates used by Coco Pharmaceuticals when cataloguing their digital landscape. There are different templates for different types of digital resources. Each would include the classifications and relationships that are relevant for the resources that they catalog. They are decorated with the
Template classification to identify that they do not represent real digital resource and should be used as templates.
Figure 3: A set of templates defined to use when cataloguing digital resources
When a template is used in cataloguing a digital asset, the caller needs to supply the values that must be unique for the digital asset. This is typically the
description and may also include the
networkAddress for its connection's endpoint. These values override those in the template.
Egeria uses the anchor classification to determine which elements linked to the template are duplicated and which elements are just linked to by the new catalog entry. In figure 2, for example, the connection and schema are anchored to the asset whilst the glossary terms and license are not. This means that copies of the connection and schema elements are made for the new catalog entry whilst the glossary terms and and licence just receive new relationships to the new catalog entry.
Finally, when a template is used, it is linked to the resulting element with the
SourcedFrom relationship. This makes it easier to identity the elements that need changing if the template needs to be corrected or enhanced at a later date.
Figure 4: The
SourcedFromrelationship links a template to the elements that are created from it
- Using templates in APIs and connectors
Standard definitions (2 hours)
Creating glossaries (30 mins)
- Structure of the glossary and how to set up and distribute
- How to use the glossary in governance
Managing reference data (60 mins)
- What is reference data; how is it used
- How to represent reference data in metadata
- How to distribute and govern reference data
- How to use reference data for classification
Using open metadata archives for shared definitions (30 mins)
- Why archives are important
- How to use them
Securing your metadata (1.5 hours)
User directory (15 mins)
- Supporting users and groups
Metadata security connectors (30 mins)
- Levels of security control
Governance zones (45 mins)
- Designing and using zones for visibility
- Controlling the setting of zones
Automating metadata capture (1.5 hours)
Setting up an integration daemon (30 mins)
- The importance of automation
- How the integration daemon works
Configuring an integration connector (30 mins)
- Using the connector catalog
- Configuring connectors in the integration daemon
Validating your integration (30 mins)
- Reviewing diagnostics and resulting metadata
Using automated governance actions (3.5 hours)
Designing your governance processes (60 mins)
- What is a governance action process and governance action type
- Configuring governance services in governance engines
Setting up an engine host, governance engines and services (30 mins)
Using metadata discovery (60 mins)
- What is metadata discovery used for and how does it work
- Making use of the results
Monitoring your governance processes (60 mins)
- Using OpenLineage to capture the activity of the governance processes
Lineage preservation and use (1 hour)
- What is lineage and how it is used?
- How is it captured?
- Using the lineage warehouse
Linking metadata governance to your governance program (3 hours)
Governance definitions (30 mins)
- Governance domains and definitions
Governance by expectation (30 mins)
- Setting targets and measuring against them
Integration with DevOps (15 mins)
- Connecting governance activity together
Managing information about users, people and organizations (45 mins)
- Undestanding person roles and the resources attached to them
- Synching user roles and organization structure
Incident management (30 mins)
- What is an incident
- Designing incident management
Stewardship (30 mins)
- Connecting stewards to the automated processing