Skip to content

Metadata Governance Day

As metadata is shared and linked, the gaps and inconsistencies in it are exposed. In the dojo you will learn how to set up a variety of features in Egeria to actively govern and maintain your metadata environment.

The dojo sessions are colour-coded like ski runs:

Beginner session

Intermediate session

Advanced session

Expert session

As you progress through the dojo, the colors of the sessions show how advanced your knowledge is becoming. The later sections are not necessarily harder to understand, but they build on knowledge from earlier sections.

The schedule also includes estimated times needed to complete each session. Even though a dojo is styled as a "day of intense focus", there is a lot of information conveyed, and you may find it more profitable to only complete one session in each sitting. Whichever way you choose to tackle the dojo, have fun and good luck - and do take breaks whenever you need to!

Metadata Governance Dojo starts here
The importance of metadata governance (25 mins)

The importance of metadata governance

Data, and the metadata that describes it, enables individuals and automated processes to make decisions. As the trust grows in the availability, accuracy, timeliness, usefulness and completeness of the data/metadata, its use increases and your organization sees greater value.

Building trust

Trust is hard to build and easy to destroy. Maintaining trust begins with authoritative sources of data/metadata that are actively managed and distributed along well known information supply chains. This flow needs to be transparent and reliable - that is, explicitly defined and verifiable through monitoring, inspection, testing and remediation. It needs to be tailored to meet the specific needs of its consumers.

Consider this story ...

A tale about trust

Sam bought a new house in a little village that was 10 miles from where he worked. He is keen to live sustainably and was delighted to hear from his new neighbours that there was a good bus service that many of them used to get to work.

Sam stopped by the bus stop to look up the bus times for the next day.

Next morning he left in plenty of time. He was surprised there was no-one there waiting for the bus. No bus came for nearly an hour, making him late for work. He asked the driver what happened to the earlier bus. He said he was not aware there was any problem.

Once at work he called the bus company. The clerk who answered the phone laughed and said that the timetable at the bus stop was out of date. The local council required them to display a timetable so the one that was there was from last year. It was ok because everyone knew the timetable.

Annoyed, Sam asked where he could get an up-to-date timetable for the bus. The clerk did not know but suggested he call head office and was nice enough to give him with the telephone number. It took a number of calls to find the right person, who then willingly dictated the times to him. Sam asked how often the bus times changed and how he could keep up-to-date. He was told that changes occurred infrequently and when they did, the bus driver would announce it. Sam could also ring back to check for changes from time to time.

Is there any problem with the way that the bus company is operating? How confident do you think Sam is that the bus company will provide a reliable service to get him to work every day? What is the impact if Sam buys a car or uses a bike because he decides the bus company is not to be trusted?

Some thoughts on the case ...

Sam is likely to conclude that the bus company does not care about its customers - and is not focused on attracting new business by encouraging people to try new routes. This suggests it is doomed in the long run. If he misses the bus again because the bus times change then he is motivated to look for alternative solutions.

His perception of the quality of the bus service is as much affected by the quality of information about the buses as the actual frequency and reliability of the service itself.

If we translate this story to the digital world, organizations often provide shared services and/or data sets that other parts of the business rely on. It is not sufficient that these services/data sets are useful with high availability/quality. If potential consumers can not find out about them, or existing consumers are disrupted by unexpected change, then the number of consumers will dwindle as they seek their own solutions, and the shared service looses relevance to the business.

The team providing the shared service should consider the documentation about it to be a key part of their deliverables. Capturing this documentation as metadata and publishing it to the open metadata ecosystem increases its findability and comprehensibility.

Meanwhile, back at the bus company ...

The managing director hears of Sam's experience and wants to understand how widespread a problem it is. She discovers that no-one is specifically responsible for keeping the timetables up-to-date at the bus stops, or ensuring up-to-date timetables are available for download/pick up. She discovered a few conscientious souls that update the bus timetable at the bus stops they personally use. However, across their network, the information supplied at the bus stops was misleading because it was out-of-date.

It is often the case that if something is not being done, either:

  • no-one is responsible for it, or
  • it is very low on the priorities of the person who is responsible for it, or
  • the person responsible for it does not have the resources to do the work.

Therefore, step one in solving such a problem is to appoint someone who is responsible for it, and ensure they have the motivation and resources to do the work.

In metadata governance, we refer to the person responsible for ensuring that metadata about a service or dataset is complete, accurate and up-to-date as the owner. The owner may not personally maintain the metadata since this may be automated, or be handled by others. However, if there are problems with the metadata, the owner is the person responsible for sorting it out.

Fixing the bus timetables

The managing director considered the inaccurate bus timetable information as a serious problem that urgently needed fixing. She discovered that the manager responsible for scheduling the bus drivers' shifts was the one who made changes to the timetable. These changes typically occurred whenever there was a persistent difficulty in assigning a bus driver to a route. Whenever this happened, he briefed the affected driver with the change and asked that they disseminate the information to their passengers.

The bus driver's manager was given ownership for ensuring the bus timetables were accurate at each bus stop since he was responsible for the action that triggered a need to update the timetables. He also understood the scope of any change.

The bus driver's manager was given a new assistant. Part of the assistant's role was to make changes to the master timetables when necessary, arrange for the downloadable timetables to deliver the new version, get printed copies to the bus station and tourist information office and add an article to the local paper describing the change. They would also drive around to the affected bus stops and update the timetable as well as ensure the affected buses had new copies of the bus timetable onboard for their regular customers.

In the example above, the managing director identified where the action was occurring that should trigger the change. She assigned an owner and ensured they had the resources (ie the new assistant) to ensure the updates were done.

When a change to the timetable occurred, the bus drivers' manager triggered the assistant to update the master bus timetable. The updated timetable was then transformed into multiple formats and disseminated to all the places where their customers are likely to notice the change. Their challenge is to both create awareness that the change has happened and provide the updated information.

Metadata governance three-step process

We can generalise this process as follows, creating a reusable specification pattern for all forms of governance:

Governance Specification Pattern

A three-step specification pattern of Trigger, Take Action and Make Visible.

For metadata governance, the Take Action is typically an update to metadata. For example, if a new deployment of a database occurs in the digital world, it could trigger a metadata update to capture any schema changes and then information about these changes is disseminated to the tools and consumers that need the information.

The dissemination of specific changes to metadata can also act as a trigger for other metadata updates. For example, the publishing of changes to a database schema could trigger a data profiling process against the database contents. The data profiling process adds new metadata elements to the existing metadata, and hence the knowledge graph of metadata grows.

Linked metadata updates

The specification pattern above applies whether the governance is manual or automated. All that changes is the mechanism.

Triggers may be time-based, or an unsolicited update to metadata by an individual. For example, data profiling may be triggered once a week as well as whenever the schema changes. A comment attached to a database description that reports errors in the data may trigger a data correction initiative.

Consider metadata as a collection of linked facts making up a knowledge graph that describes the resources and their use by the organization. The role of the tools, people and open metadata technology is to build, maintain and consume this knowledge graph to improve the operation of the organization.

Enriching customer service

The managing director of the bus company picks up a changing perspective in the head office employees since the implementation of the bus timetable management process. Rather than a focus on running the buses, there is a growing understanding that they are providing a service to their customers.

Eager to build on this change, the managing director encourages her employees to bring forward ideas that increase their customer service. These ideas include disseminating information about local events (along with information about to get to the event by bus, of course :)).

As a result, the bus timetables began to include a calendar of local events. Event organizers were able to register their events with the bus company and negotiate for additional buses where necessary. New bus routes are identified and begin to operate. The effect is that the bus company became an active member of the community, with increasing bus use and associated profits.

When metadata governance is done well, a rich conversation develops between service providers and their consumers that can lead to both improvements in the quality of the services and the expansion in the variety and amount of consumption; the real value of the service is measured by consumption.

New uses for the services will then emerge ... growing the vitality and value to the organization.

The open metadata ecosystem (15 mins)

The open metadata ecosystem

The content of the data/metadata shared between teams needs to follow standards that ensure clarity both in meaning and how it should be used and managed. Its completeness and quality need to be appropriate for the organization's uses. These uses will change over time.

The ecosystem that supplies and uses this data/metadata must evolve and adapt to the changing and growing needs of the organization because trust is required not just for today's operation but also into the future.

You can make your own choices on how to build trust in your data/metadata. Egeria provides standards, mechanisms and practices built from industry experiences and best practices that help in the maintenance of data/metadata:

  • Egeria defines a standard format for storing and distributing metadata. This includes an extendable type system so that any type of metadata that you need can be supported.

  • Egeria provides technology to manage, store, distribute this standardized metadata. This technology is inherently distributed, enabling you to work across multiple cloud platforms, data centres and other distributed environments. Collectively, a deployment of this technology is referred to as the open metadata ecosystem.

  • Egeria provides connector interfaces to allow third party technology to plug into the open metadata ecosystem. These connectors translate metadata from the third party technology's native format to the open metadata format. This allows:

    • Collaboration
    • Blending automation and manual processes
    • Comprehensive security and privacy controls
  • Egeria's documentation provides guidance on how to use this technology to deliver business value.

In this dojo we will cover these mechanisms and practices, showing how they fit in the metadata update specification pattern described above. You can then select which are appropriate to your organization and when/where to consider using them.

Different categories of metadata (25 mins)

Categories of metadata

Metadata is often described as data about data. However, this definition does not fully convey the breadth and depth of information that is needed to govern your digital operations.

The categories of metadata listed below help you organize your metadata needs around specific triggers that drive your metadata ecosystem.

Technical metadata

Technical metadata

The most commonly collected metadata is technical metadata that describes the way something is implemented. For example, technical metadata for common digital resources includes:

  • The databases and their database schema (table and column definitions) configured in a database server.
  • APIs and their interface specification implemented by applications and other software services to request actions and query data.
  • The events and their schemas used to send notifications between applications, services and servers to help synchronize their activity.
  • The files stored on the file system.

Technical metadata is the easiest type of metadata to maintain since many technologies provide APIs/events to query the technical metadata for the digital resources it manages.

To keep your technical metadata up-to-date you need to consider the following types of metadata update triggers:

  • whenever new digital resources are deployed into production,
  • events that indicate that the digital resources have changed
  • regular scanning of the deployed IT environment to validate that all technical metadata has been captured (and nothing rogue has been added).

Ideally you want your metadata governance to engage in the lifecycles that drive changes in the technical metadata - for example, adding management of technical metadata as part of the software development lifecycle (SDLC) or CI/CD pipelines.

Collecting and maintaining technical metadata builds an inventory of your digital resources that can be used to count each type of digital resources and act as a list to work through when regular maintenance is required. It also helps people locate specific types of digital resources.

Data content analysis results

Data content analysis results

The technical metadata typically describes the structure and configuration for digital resources. Analysis tools can add to this information by analysing the data content of the digital resources. The results create a characterization of the data content that helps potential consumers select the digital resources best suited for their needs.

Data content analysis is often triggered periodically, based on the update frequency that the digital resource typically experiences. It can also be triggered when the technical metadata is first catalogued or updated. If the digital resource is really important, and used in Analytics/AI you need to check more often to validate that no significant changes have occurred.

Consumer metadata

Consumer metadata

Consumer metadata includes the comments, reviews, tags added by the users that are consuming the metadata and the digital resources it describes. This metadata is gathered from the tools through which the users consume the metadata and the digital resources. It is then used to assess the value and popularity of the metadata and digital resources to the broader community.

Metadata update triggers should focus around the tools where the consumer feedback is captured. Typically, each piece of consumer feedback is treated as a separate trigger. The feedback should then be distributed to the tools that are used by the owning team. This could trigger changes to the resource. Ideally, the owning team should be able to respond and demonstrate they are listening and taking action.

Subject area materials

Subject area materials

Subject areas are topics or domains of knowledge that are important to the organization. Typically, they cover content that is widely shared across the organization and there is business value in maintaining consistency. The materials for a subject area typically includes:

Assets that are managed using the subject area's materials are said to be part of the subject area's domain. This is called the Subject Area Domain and is synonymous with Data Domain - although a subject area domain may manage assets that are not data assets (such as systems and infrastructure) which is why open metadata uses a more generic name.

Updates to the subject area materials are typically made offline, collected, and then disseminated together as a new release. Therefore, the metadata update trigger is often related to the release of a collection of subject area materials.

Governance metadata

Governance metadata

Governance metadata describe the requirements of a particular Governance Domain and their associated controls, metrics and implementations. They are managed in releases in a similar way to subject area materials. Therefore, their releases act a triggers to further actions.

Organizational metadata

Organizational metadata

Organizational metadata describe the teams, people, roles, projects and communities in the organization. This metadata is used to coordinate the responsibilities and activities of the people in the organization. For example, roles can be defined for the owners of specific resources, and they can be linked to the profiles of the individuals appointed as owner. This information can be used to route requests and feedback to the right person.

Organization metadata is often managed in existing applications run by Human Resources and Corporate Security. Therefore, updates in these applications are used to trigger updates to the organizational metadata in the open metadata ecosystem.

Governance actions can trigger the creation of new roles and appointments to these roles. These elements can be then be disseminated to the appropriate applications for information, verification and/or approval.

Business context metadata

Business context metadata

An organization has capabilities, facilities and services. The digital resources it uses serve these purposes. When decisions need to be made as to which digital resources to invest in, it is helpful to understand which part of the business will be impacted.

Information about the organization's capabilities, facilities and services is called business context. Often individuals core role is focuses on these aspects. Thinking them to the digital resources that they depend on (often invisible to these individuals) helps to raise awareness of the mutual dependency and understanding of the impact/value of change at either level.

Triggers that detect change in digital resources (for example an outage) can result in information flowing to the appropriate business teams. If the change (or outage) is extensive, the linked business context can be used to prioritize the associated work.

Process metadata

Process metadata

Data is copied, combined and transformed by applications, services, processes and activities running in the digital landscape. Capturing the structure of this processing shows which components are accessing and changing the data. This Process Metadata is a key element in providing lineage, used for traceability, impact analysis and data observability.

The capture and maintenance of process metadata is typically triggered as process implementations are deployed into production.

Operational metadata

Operational metadata

Operational metadata describes the activity running in the digital landscape. For example, process metadata could describe the steps in an ETL job that copies data from one database to another. The operational metadata captures how often it runs, how many rows it processed and the errors it found.

Operational metadata is often captured in log files. As they are created, they trigger the cataloging and linking of their information into other types of metadata.

Metadata relationships and classifications

The other types of trigger to consider is when/where the metadata elements described above can be connected together and augmented.

This linking and augmentation of metadata has a multiplying effect on the value of your metadata.

You can think of the metadata described above as the facts about your organization's resources and operation. Metadata relationships that show how one element relates to another, begin to show the context in which decisions are made and these resources are used.

Metadata classifications are used to label metadata as having particular characteristics. This helps group together similar elements, or elements that represent resources that need similar processing

Summary

Hopefully the discussion above has helped to illustrate that metadata is varied and can be built into a rich knowledge base that drives organizational objectives through increased visibility, utilization and management of an organization's digital assets.

Open metadata types (25 mins)

The open metadata type system

Knowledge about data is spread amongst many people and systems. One of the roles of a metadata repository is to provide a place where this knowledge can be collected and correlated, as automated as possible. To enable different tools and processes to populate the metadata repository we need agreement on what data should be stored and in what format (structures).

Open metadata subject areas

The different subject areas of metadata that we need to support for a wide range of metadata management and governance tasks include:

Open metadata areas

This metadata may be spread across different metadata repositories that each specialize in particular use cases or communities of users.

Area Description
Area 0 describes base types and infrastructure. This includes the root type for all open metadata entities called OpenMetadataRoot and types for Asset, DataSet, Infrastructure, Process, Referenceable, SoftwareServer and Host.
Area 1 collects information from people using the data assets. It includes their use of the assets and their feedback. It also manages crowd-sourced enhancements to the metadata from other areas before it is approved and incorporated into the governance program.
Area 2 describes the data assets. These are the data sources, APIs, analytics models, transformation functions and rule implementations that store and manage data. The definitions in Area 2 include connectivity information that is used by the open connector framework (and other tools) to get access to the data assets.
Area 3 describes the glossary. This is the definitions of terms and concepts and how they relate to one another. Linking the concepts/terms defined in the glossary to the data assets in Area 2 defines the meaning of the data that is managed by the data assets. This is a key relationship that helps people locate and understand the data assets they are working with.
Area 4 defines how the data assets should be governed. This is where the classifications, policies and rules are defined.
Area 5 is where standards are established. This includes data models, schema fragments and reference data that are used to assist developers and architects in using best practice data structures and valid values as they develop new capabilities around the data assets.
Area 6 provides the additional information that automated metadata discovery engines have discovered about the data assets. This includes profile information, quality scores and suggested classifications.
Area 7 provides the structures for recording lineage and providing traceability to the business.

The following diagram provides more detail of the metadata structures in each area and how they link together:

Metadata is highly interconnected

Metadata detail within the metadata areas

Bottom left is Area 0 - the foundation of the open metadata types along with the IT infrastructure that digital systems run on such as platforms, servers and network connections. Sitting on the foundation are the assets. The base definition for Asset is in Area 0 but Area 2 (middle bottom) builds out common types of assets that an organization uses. These assets are hosted and linked to the infrastructure described in Area 0. For example, a data set could be linked to the file system description to show where it is stored.

Area 5 (right middle) focuses on defining the structure of data and the standard sets of values (called reference data). The structure of data is described in schemas and these are linked to the assets that use them.

Many assets have technical names. Area 3 (top middle) captures business and real world terminologies and organizes them into glossaries. The individual terms described can be linked to the technical names and labels given to the assets and the data fields described in their schemas.

Area 6 (bottom right) captures additional metadata captured through automated analysis of data. These analysis results are linked to the assets that hold the data so that data professionals can evaluate the suitability of the data for different purposes. Area 7 (left middle) captures the lineage of assets from a business and technical perspective. Above that in Area 4 are the definitions that control the governance of all of the assets. Finally, Area 1 (top right) captures information about users (people, automated process) their organization, such as teams and projects, and feedback.

Within each area, the definitions are broken down into numbered packages to help identify groups of related elements. The numbering system relates to the area that the elements belong to. For example, area 1 has models 0100-0199, area 2 has models 0200-299, etc. Each area's sub-models are dispersed along its range, ensuring there is space to insert additional models in the future.

Test yourself ...

Fill in the following table to map the areas of the open metadata type system to the different categories of metadata.

Open Metadata Area Categories of metadata covered by this Area - choose from Technical metadata, Data content analysis results, Consumer metadata, Subject area materials, Governance metadata, Organizational metadata, Business context metadata, Process metadata and Operational metadata.
0 - Base Types, Systems and Infrastructure
1 - Collaboration
2 - Data Assets
3 - Glossary
4 - Governance
5 - Models and Reference Data
6 - Metadata Discovery
7 - Lineage
Answer
Open Metadata Area Categories of metadata covered by this Area
0 - Base Types, Systems and Infrastructure Technical Metadata
1 - Collaboration Consumer Metadata, Organizational Metadata
2 - Data Assets Technical Metadata
3 - Glossary Subject Area Materials
4 - Governance Subject Area Materials, Governance Metadata, Operational Metadata (associated assets)
5 - Models and Reference Data Technical Metadata, Subject Area Materials
6 - Metadata Discovery Data Content Analysis Results
7 - Lineage Business Context, Process Metadata

Designing your metadata supply chains (45 mins)

Metadata supply chains

The open metadata ecosystem collects, links and disseminates metadata from many sources. However, it is designed in an iterative, agile manner, adding new use cases and capabilities over time.

Each stage of development considers a particular source of metadata and where it needs to be distributed to. Consider this scenario...

Database schema capture and distribution

There is a database server (Database Server 1) that is used to store application data that is of interest to other teams. An initiative is started to automatically capture the schemas of the databases on this database server. This schema information will be replicated to two destinations:

  • Another database server (Database Server 2) is used by a data science team as a source of data for their work. An ETL job runs every day to refresh the data in this second database with data from the first database. The data is anonymized by the ETL job, but the schema and data profile remains consistent. If the schema in the first database changes, the ETL job is updated at the same time. However, the schema in the second database is not updated because the team making the change do not have access to it. Nevertheless it must be updated consistently before the ETL job runs; otherwise it will fail.
  • The analytics tool that is also used by the data science team has a catalog of data sources to show the data science team what data is available. This needs to be kept consistent with the structure of the databases. The tool does provide a feature to refresh any data source schema in its catalog, but the team are often unaware of changes to their data sources, or simply forget to do it, and only discover the inconsistency when their models fail to run properly.

metadata supply chain scenario

The integration of these third party technologies with the open metadata ecosystem can be thought of as having four parts to it.

  1. Any changes to the database schema are extracted from Database Server 1 and published to the open metadata ecosystem.
  2. The new schema information from Database Server 1 is detected in the open metadata ecosystem and deployed to Database Server 2.
  3. The changes to Database Server 2's schema are detected and published to the open metadata ecosystem.
  4. The new schema information from Database Server 2 is detected in the open metadata ecosystem and distributed to the Analytics Workbench.

metadata supply chain integration points

Four integration steps to capture and distribute the database schema metadata from Database Server 1.

Here is another view of the process, but shown as a flow from left to right.

Metadata supply chain integration points

At each stage, there is a trigger (typically detecting something has changed), metadata is assembled, updated and when it is read, made visible through the open metadata ecosystem.

Metadata Update Specification Pattern

A three-step specification pattern of Trigger, Maintain Metadata and Make Visible.

The implementation of the three-step pattern for each part of the integration is located in an integration connector. Integration connectors are configurable components that are designed to work with a specific third party technology. There would be 4 configured integration connectors to support the scenario above. However, the implementation of the integration connectors for parts 1 and 3 would be the same implementation, just 2 instances, each configured to work with a different database server.

The integration connectors supplied with Egeria are described in the connector catalog. It is also possible to write your own integration connectors if the ones supplied by Egeria do not meet your needs.

Integration connectors run in the Integration Daemon. It is possible to have all 4 integration connectors running in the same integration daemon. Alternatively, they each may run in a different integration daemon - or any combination in between. The choice is determined by the organization of the teams that will operate the service. For example, if this metadata synchronization process was run by a centralized team, then all 4 integration connectors would probably run in the same integration daemon. If the work is decentralized, the integration connector for part 1 may be in an integration daemon operated by the same team that operates Database Server 1. The other integration connectors may run together in an integration daemon operated by the team that operates Database Server 2 and the Analytics Workbench.

The diagram below shows the decentralized option.

Decentralized deployment

This type of deployment choice keeps control of the metadata integration with the teams that own the third party technology, and so upgrades, back-ups and outages can be coordinated.

The implementation of the open metadata ecosystem that connects the integration daemons can also be centralized or decentralized. This next diagram shows two integration daemons connecting into a centralized metadata access store that provides the open metadata repository.

Centralized metadata store

Alternatively, each team could have their own metadata access store, giving them complete control over their metadata. The two metadata access stores are connected via an Open Metadata Repository Cohort (or just "cohort" for short.) The cohort enables the two metadata access stores to operate as one logical metadata store.

Decentralized metadata stores with cohort

The behaviour of the integration daemons is unaffected by the deployment choice made for the metadata access stores.

Adding lineage

In the scenario above, data from Database Server 1 is extracted, anonymized and stored in Database Server 2 by an ETL job running in an ETL engine.

Role of the ETL engine

The data scientist team want to know the source of each of the databases they are working with. The metadata that describes the source of data is called lineage. Ideally it is captured by the ETL engine to ensure it is accurate.

ETL engines have a long history of capturing lineage, since it is a common requirement in regulated industries. The diagram below shows three choices on how an ETL engine may handle its lineage metadata.

  • In the first box on the left, the ETL engine has its own metadata repository and so it is integrated into the open metadata ecosystems via the integration daemon (in the same way as the database and analytics workbench).
  • In the middle box, the ETL engine is producing lineage events that follow the OpenLineage Standard. The integration daemon has native support for this standard and so the ETL Engine can send these events directly to the integration daemon which will pass them to any integration connector that is configured to receive them.
  • The final box on the right-hand side shows an ETL engine that is part of a suite of tools that share a metadata repository. These types of third party metadata repositories often have a wide variety of metadata that is good for many use cases. So, although it is possible to integrate them through the integration connectors running in the integration daemon, it is also possible to connect them directly into the cohort via a Repository Proxy. This is a more complex integration to perform. However, it has the benefit that the metadata stored in the third party metadata repository is logically part of the open metadata ecosystem and available through any of the open metadata and governance APIs without needing to copy its metadata into a metadata access store.

Choices when integrating lineage

Summary

In this guide you have seen that integration with the open metadata ecosystem is built up iteratively using integration connectors running in an integration daemon. Open metadata is stored in metadata access stores and shared across the open metadata ecosystem using a cohort. It is also possible to plug in a third party metadata repository using a repository proxy.

Complete integration solution

Automating metadata capture (30 mins)

Automating metadata capture

People are not good at repetitive admin tasks, particularly if these tasks are not the primary focus of their work. Therefore, the open metadata ecosystem is more complete and accurate when automation supports these mundane tasks. Following our three step model, the automation needs to:

  • monitor for triggers that indicate that metadata needs to be updated,
  • drive the necessary changes to metadata and then
  • make those changes visible to downstream processing.

Metadata Update Specification Pattern

A three-step specification pattern of Trigger, Maintain Metadata and Make Visible.

Introducing the integration daemon

Inside the Integration Daemon

Recap: The Integration Daemon is an Egeria OMAG Server that sits at the edge of the open metadata ecosystem, synchronizing metadata with third party tools. It is connected to a Metadata Access Server that provides the APIs and events to interact with the open metadata ecosystem.

Integration Daemon

The integration can be:

  • Triggered by an event from a third party technology that indicates that metadata needs to be updated in the open metadata ecosystem to make it consistent with the third party technology's configuration.

  • Triggered at regular intervals so that the consistency of the open metadata ecosystem with the third party technology can be verified and, where necessary, corrected.

  • Triggered by a change in the open metadata ecosystem indicating that changes need to be replicated to the third party technology.

Running in the integration daemon are integration connectors that each support the API of a specific third party technology. The integration daemon starts and stops the integration connectors and provides them with access to the open metadata ecosystem APIs. Its action is controlled by configuration, so you can set it up to exchange metadata with a wide range of third party technologies.

An integration connector is specialized for a particular technology. The integration daemon provides specialized services focused on different types of technology, in order to simplify the work of the integration connector. These specialized services are called the Open Metadata Integration Services (OMISs). Each integration connector is paired with an OMIS and, the OMIS is paired with a relevant Open Metadata Access Service (OMAS) running in a Metadata Access Server.

Inside Integration Daemon

Further information

Subject Areas (1.5 hours)

Subject Areas

Back at the bus company ...

The new focus on ensuring bus timetable information is accurate and available has created an increase in passengers for the bus company. The assistant who is responsible for managing and disseminating the bus timetable information is not happy, however. What should be a simple task is tedious and complicated because each form of the timetable (at the bus stop, internet download, printed timetables) are each using a different format. For example, the master timetable and the timetables for the bus stop are in 24-hour clock, whereas the internet downloadable timetable and the printed timetable use am/pm. This means the assistant has to translate the bus times from one format to another in order to publish a new set of timetables. Similarly, the names of the bus stops are not consistent across the different formats. The assistant realizes that some form of standardization is needed so that the different formats can be created automatically from the master timetables. Otherwise, they are going to look for a new job!

Using standard formats, names and meanings for resources, such as data, is critical to ensure they can be shared and reused for multiple purposes. Open metadata provides the means to describe these standards and used them to create consistency across all copies, formats and uses. The effort required to author and maintain a these standards, plus the governance processes required to ensure they are used wherever appropriate, is offset by the savings in managing and using the resources associated with the subject area.

Recap:

  • Subject Areas define the standards for your resources and their use. They cover resources that are widely shared across the organization and there is business value in maintaining consistency.
  • Each subject area has an owner who is responsible for coordinating the development and maintenance of the subject area's materials.
  • Resources that are managed using the subject area's materials are said to be part of the subject area's "domain". This is called the Subject Area Domain and is synonymous with Data Domain - although since a subject area domain may manage resources that are not just data assets (such as systems and infrastructure), open metadata uses a more generic name.

The glossary

The glossary is at the heart of the materials for a subject area. Figure 1 shows that the glossary contains glossary terms. Each glossary term describes a concept used by the business. It is also possible to link two glossary terms together with a relationship. The relationship may describe a semantic relationship or a structural one.

Figure 1

Figure 1: Glossaries for describing concepts and the relationships between them

Semantic relationships include:

  • RelatedTerm is a relationship used to say that the linked glossary term may also be of interest. It is like a "see also" link in a dictionary.
  • Synonym is a relationship between glossary terms that have the same, or a very similar meaning.
  • Antonym is a relationship between glossary terms that have the opposite (or near opposite) meaning.
  • PreferredTerm is a relationship that indicates that one term should be used in place of the other term linked by the relationship.
  • ReplacementTerm is a relationship that indicates that one term must be used instead of the other. This is stronger version of the PreferredTerm.
  • Translation is a relationship that defines that the linked terms represent the same meaning but each are written in a different language. Hence, one is a translation of the other. The language of each term is defined in the Glossary that owns the term.
  • IsA is a relationship that defines that the one term is a more generic term than the other term. For example, this relationship would be used to say that "Cat" IsA "Animal".

Structural relationships in the glossary are relationships that show how terms are typically used together.

  • UsedInContext links a term to another term that describes a context. This helps to distinguish between terms that have the same name but different meanings depending on the context.
  • HasA is a term relationship between a term representing a SpineObject (see glossary term classifications below) and a term representing a SpineAttribute.
  • IsATypeOf is a term relationship between two SpineObjects saying that one is the subtype (specialisation) of the other.
  • TypedBy is a term relationship between a SpineAttribute and a SpineObject to say that the SpineAttribute is implemented using a type represented by the SpineObject

Further information

  • See Area 3 in the Open Metadata Types to understand how these concepts are represented on open metadata.
Data classes

A data class provides the specification of a data type that is important to the subject area. Date, Social Security Number and Credit Card Number are examples of data classes.

The data class specification defines how to identify data fields of its type by inspecting the data values stored in them. The specification is independent of a particular technology, which is why they are often described as logical data types. The specification may include preferred implementation types for different technologies using Implementation Snippets.

Data classes are used during metadata discovery (see below) to identify the types of data in the discovered data fields. This is an important step in understanding the meaning and business value of the data fields. They can also be used in quality rules to validate that data values match the perscribed data class.

Data classes can be linked together in part-of and is-a hierarchies to create a logical type system for a subject area. A glossary term can be linked to a data class via an ImplementedBy relationship to identify the preferred data class to use when implementing a data field with meaning described in the glossary term. A data class can be linked to glossary term that describes the meaning of the data class via a SemanticAssignment relationship.

Figure 2

Figure 2: Data classes for describing the logical data types and implementation options

Further information

  • See Model 0540 in the Open Metadata Types to understand how data classes are represented on open metadata.
  • See Model 0737 in the Open Metadata Types to understand the ImplementedBy relationship.
  • See Model 0370 in the Open Metadata Types to understand the SemanticAssignment relationship.
  • See Model 0504 in the Open Metadata Types to understand ImplementationSnippets.
Consuming the glossary in design models

Design models (such as Concept models, E-R Models, UML models) and ontologies capture similar concepts to those described in the glossary. It helps if their definitions are consistent. When a new glossary is being built, existing models and ontologies can be used to seed the glossary. The models/ontologies themselves can be loaded in open metadata and the model elements linked to their corresponding glossary terms. Then new versions of the data models/ontologies can be generated from open metadata.

Figure 3

Figure 3: Linking to models

Any linked data classes provide details of language types to use when generating compliant artifacts from the models.

Further information

  • See Model 0571 in the Open Metadata Types to understand how concept models are represented on open metadata.
  • See Model 0565 in the Open Metadata Types to understand how design models are represented on open metadata.
Schemas

Schemas document the structure of data, whether it is stored or moving through APIs, events and data feeds. A schema is made up of a linked subgraph of schema elements. A schema begins with a schema element called a schema type. This may be a single primitive field, a set of values, an array of values, a map between two sets of values or a nested structure. The nested structure is the most common. In this case the schema type has a list of schema attributes (another type of schema element) that describe the fields in the structure. Each of these schema attributes has its own schema type located in its TypeEmbeddedAttribute classification.

Figure 4 shows a simple schema structure.

Figure 4

Figure 4: Schemas for documenting the structure of data

Further information

  • See Schemas to understand how different types of schema are represented.
  • See Model 0501 in the Open Metadata Types to see the formal definition of the different types of schema elements.
  • See Model 0505 in the Open Metadata Types to understand schema attributes and the TypeEmbeddedAttribute classification.
Schemas and assets

An asset describes a valuable resource (typically digital). Such resources include databases, data files, documents, APIs, data feeds, and applications. A digital resource can be dependent on other digital resource to fulfill their implementation. This relationship is also captured in open metadata with relationships such as DataContentForDataSet. These relationships help to highlight inconsistencies in the assets' linkage to the subject area's materials, which may be due to errors in either the metadata or the implementation/deployment/use of the associated digital resources.

Figure 5

Figure 5: Dependencies between digital resources are reflected in open metadata by relationships between assets

Since schema types describe the structure of data, they can be attached to assets using the AssetSchemaType relationship to indicate that this asset's data is organized as described by the schema. Schemas are important because they show how individual data values are organized. Governance is often concerned with the meaning, correctness and use of individual data values since they are used to influence the decisions made within the organization. Therefore, even though the content of a schema bulks up the size and complexity of the metadata, it is necessary to capture this detail.

Figure 6

Figure 6: Schemas describe the structure of the data store in a digital resource (described by the asset in the catalog)

A schema is typically attached to only one asset since it is classified and linked to other elements assuming that the asset/schema combinations describes the particular collection of data stored in the associated digital resource. However, there is still a role for the subject area materials to provide preferred schema structures for software developers, data engineers and data scientists to use when they create implementations of new digital resources.

When a new asset is created, the schema definition in the subject area can be used as a template to define the schema for the asset (see figure 7). Then:

  • The digital resource can be generated from the asset/schema, or
  • Metadata discovery (see below) can be used to validate that the schema defined in the digital resource matches the schema associated with the asset.

Figure 7

Figure 7: Using a schema from a subject are as a template for a new asset

There is also an opportunity to share schemas between assets using an ExternalSchemaType. This option has the advantage that there only one copy of the schema. However, it is only used when all classifications and relationships attached to the shared part of the schema apply to all data in the associated digital resources.

Figure 8

Figure 8: Using an external schema type to share a common schema

Further information

  • See Model 0503 in the Open Metadata Types to understand the AssetSchemaType relationship.
  • See Model 0501 in the Open Metadata Types to understand how schemas are represented on open metadata.
  • See Model 0505 in the Open Metadata Types to understand how schema attributes are represented on open metadata.
Reference value assignments

The materials for a subject area may include sets of values used to label metadata elements to show that they are in a particular state or have a specific characteristic that is important in the subject area. For example, a subject area about people may include the notion of an Adult and a Child (or Minor). The age of majority is different in each country and so a simple label assigned to a Person profile that indicates that a person is an adult would allow the knowledge of how to determine if someone is an adult to be contained around the maintenance of the person profiles, while the reference data value is used in multiple places.

These labels are called reference data values and are managed in Valid Value Sets. The association between a reference data value and a metadata element is ReferenceValueAssignment.

Figure 9

Figure 9: Labelling using reference data values

Further information

Schema assignments

Figure 10 show three types of assignments between the metadata associated with a digital resource (technical metadata) and the subject area materials:

  • SemanticAssignment - Semantic assignments indicate that the data stored in the associated data field has the meaning described in the glossary term.
  • ValidValuesAssignment - Valid value sets define a list of valid values. They can be used to the values that are allowed to be stored in a particular data field if it can be described as a discrete set.
  • DataClassAssignment - A data class assignment means that the data in the data field conforms to the type described in the data class.

When these relationships are used in combination, there should be consistency between the assignments to the data field and those to the associated glossary term.

Figure 10

Figure 10: Using assignment relationships to create a rich description of the data stored in a schema attribute (data field)

Governance action classifications

Governance action classifications can be attached to most types of metadata elements. They can also be assigned to glossary terms to indicate that the classification applies to all data values associated with the glossary term. The governance action classifications have attributes that identify a particular level that applies to the attached element. The definition for each level can be linked to appropriate Governance Definitions that define how digital resources classified at that level should be governed. Governance Classification Levels are linked to Governance Definitions using the GovernedBy relationship.

Figure 11

Figure 11: Classifying glossary terms to identify the governance definitions that apply to all data values associated with the glossary term

Further information

Connectors and connections

The digital resources associated with the assets in the catalog are accessed through connectors. A Connector is a client library that applications use to access the data/function held by the digital resource. Typically, there is a specialized connector for each type of Asset/technology.

Sometimes there are multiple connectors to access a specific type of asset, each offering a different interface for the application to use.

Instances of connectors are created using the Connector Broker. The connector broker creates the connector instance using the information stored in a Connection. These can be created by the application or retrieved from the open metadata stores.

A connection is stored in the open metadata stores and linked to the appropriate asset for the digital resource.

Figure 12

Figure 12: Connection information needed to access the data held by an asset

Further information

  • See the connector catalog to understand how connectors are used in Egeria.
  • See Model 0201 in the Open Metadata Types to understand how connections are represented.
Metadata discovery

An open discovery service is a process that runs a pipeline of analytics to describe the data content of a resource. It uses statistical analysis, reference data and other techniques to determine the data class and range of values stored, potentially what the data means and its level of quality. The result of the analysis is stored in metadata objects called annotations.

Part of the discovery process is called Schema Extraction. This is where the discovery service inspects the schema in the digital resource and builds a matching structure of [DataField]/types/6/0615-Schema-Extraction/) elements in open metadata. As it goes on to analyse the content of a particular data field in the resource, it can add its results to an annotation that is attached to the DataField element. It can also maintain a link between the DataField element and its corresponding SchemaAttribute element if the schema has already been attached. Through ths process it is possible to detect any anomalies between the documented schema and what is actually implemented.

Part of the analysis of a single data field may be to identify its data class (or a list of possible data classes if the analysis is not conclusive). THe data class in turn may identify a list of possible glossary terms that could apply to the data field.

For example, there may be a data class called address. A discovery service may detect that an address is stored in a digital resource. The data class may be linked to glossary terms for Home Address, Work Location, Delivery Address, ... The discovery service may not be able to determine which glossary term is appropriate in order to establish the SemanticAssignment relationship, but providing a steward with a short list is a considerable saving.

Figure 13

Figure 13: Output from a metadata discovery service

Further information

  • See Discovery and Stewardship to understand how metadata discovery works.
  • See Area 6 in the Open Metadata Types to understand how discovery metadata is represented.
Bringing it all together

Figure 13 summarizes how the subject area materials create a rich picture around the resources used by your organization. As they link to the technical metadata, they complement and reinforce the understanding of your data. In a real-world deployment, the aim is to automate as much of this linkage as possible. This is made considerably easier if the implementation landscape is reasonable consistent. However, where the stored data values do not match the expected types defined in the schema, the metadata model reveals the inconsistencies and often requires human intervention to ensure the links are correct.

Figure 14

Figure 14: Linking the metadata together

Defining subject areas in Egeria
Coco Pharmaceuticals Scenarios

There are descriptions of creating glossaries and other materials for subject areas in the Coco Pharmaceuticals Scenarios.

Using automated governance actions (3.5 hours)

Using automated governance actions

The open metadata ecosystem collects, links and disseminates metadata from many sources. Inevitably there will be inconsistencies and errors in the metadata and there need to be mechanism that help identify errors and control how they are corrected. This section focuses on the automated processes that validate, correct and enrich the metadata in the open metadata ecosystem.

The shearing layers of governance actions (1 hour)

The building industry has the principle of shearing layers in the design of a building. This principle is as follows:

... Things that need to change frequently should be easy to change. Those aspects that change infrequently can take more effort and time.

In Egeria, the shearing layer principle is evident in the design of automated governance. An organization that is maturing their governance capability needs to be able to move fast. These automations need to be quick to create and quick to change. There is no time to wait for a software developer to code each one.

Egeria defines flexible components called governance services that can be re-configured and reused in many situations. Collectively the governance services form a pallet of configurable governance functions. The governance team link them together into a new process every time they need a new governance automation.

The advantage of this approach is the ability to rapidly scale out your governance capability. The downside is that there are more moving parts and concepts to understand.

The diagram below summarizes how Egeria's governance automation works. Descriptions of each layer follows the diagram.

Layers of governance automation

At the base are the governance service components

Governance services are coded in java and packaged in Java Archives (Jar files). They need to be passed information about the function to perform and the metadata elements on which to operate since this will different each time they are called.

Part of the implementation is a connector provider that is able to return a description of its governance service in the form of a connector type. The connector type provides information about how to configure and run the governance service. This includes:

  • A description of the connector's function
  • Names of configuration properties that can modify the behaviour of the governance service.
  • A list of request types that select which function it is to perform.
  • Names of request parameters that can be supplied (typically by the caller) that can override the configuration properties and/or provide the identifier(s) of any metadata element(s) to work on.

  • Names of supported action targets that provide links to the metadata element(s) to work on. The action target mechanism is typically used when governance services are being called in a sequence from a governance action process. The action types are used to pass details of the metadata elements to work on from service to service.

The governance service definitions

The JAR file is added to the CLASSPATH of Egeria's platform where is can be loaded and inspected. The architect extracts the connector type of the governance service implementation and creates at least one governance service definition for it. The governance service definition is metadata that includes a GovernanceService entity, a Connection entity and a ConnectorType entity (based on the connector type extracted from the implementation) linked together. The connection entity will include the settings for the various configuration properties described in the connector type. If different combinations of configuration properties are desired, they are configured in different governance service definitions.

The governance engine definitions

The architect then builds a governance engine definition. This is metadata that defines a list of governance request types. These are the names of the functions needed by the governance team.

Each governance request type is mapped to a governance service definition (defined above). The governance engine definition can include a mapping from the governance request type to a request type understood by the governance service implementation (called the serviceRequestType). Without this mapping, the governance request type is passed directly to the governance service implementation when it is called.

Typically, the governance engine definition is packaged in a open metadata archive called a governance engine pack. This can be loaded into any platform that is going to run the governance engine/services.

Governance actions

The governance engine is configured in an Engine Host server running on the platform. The governance engine can be called by creating a governance action. This is a metadata entity that describes the governance request type and request parameters to run on a specific governance engine. The governance action content is broadcast to all the running engine hosts via the Governance engine OMAS Out Topic. On receiving this event, each engine host consults their active governance engines to see if the governance request type is supported. The first engine host to detect the new governance action will claim the governance action, which changes it status from WAITING to ACTIVATING in the open metadata ecosystem. The successful Engine Host then passes the request to its governance engine to execute and the governance action's status moves to IN_PROGRESS. The results of the execution are also stored in the governance action including the final status (ACTIONED, INVALID or FAILED).

Governance services produce one or more guards when they complete. Guards describe the outcome of running the governance service. They are stored in the governance action entity that kicked off the governance service. The governance action entities provide an audit trail of the automated governance actions that were requested, and their outcome.

Governance Action statuses

Governance action processes

Governance action processes are defined in metadata using a set of linked governance action types. They are choreographed in a Metadata Access Server running the Governance Engine OMAS. When the process is called to run, the Governance Engine OMAS navigates to the first governance action type in the governance process definition. It creates a matching governance action entity. This is picked up by the engine host and executed in the governance engine just as if it was called independently. The guards are returned to the governance action entity as normal. This change is detected by Governance Engine OMAS which uses the guards to navigate to the next governance action type(s) found in the governance action process definition. A governance action is created for each of the next governance action types and the cycle is repeated until there are no more governance action types in the governance action process definition.

A governance action process can be run many times with different parameters. It can be changed, simply by updating the governance action type metadata entities in the governance action process definition. New processes can be created by creating the appropriate governance process definition.

If a desired request type can not be supported by the existing governance services, a developer is asked to extend a governance service implementation or create a new one that can be configured into a governance engine to support the desired governance request type.

Governance services supplied with Egeria

Designing your governance processes (30 mins)

Governance Action Process

A governance action process is a predefined sequence of governance actions that are coordinated by the Governance Engine OMAS.

The steps in a governance action process are defined by linked governance action types stored in the open metadata ecosystem. Each governance action type provides the specification of the governance action to run. The links between them show which guards cause the governance action to run.

The governance action process support enables governance professionals to assemble and monitor governance processes without needing to be a Java programmer.

Examples

In the two examples below, each of the rounded boxes represent a governance action and the links between them is a possible flow - where the label on the link is the guard that must be provided by the predecessor if the linked governance action is to run.

The governance actions in example 1 are all implemented using governance action services. When these services complete, they supply a completion status. If a service completed successfully, they optionally supply one or more guards and a list of action targets for the subsequent governance action(s) to process.

The first governance action in example 1 is called when a new asset is created. For example the Generic Element Watchdog Governance Action Service could be configured to monitor for new/refresh events for particular types of assets and initiate this governance process then this type of event occurs.

The first governance action to run is Validate Asset. It retrieves the asset and tests that it has the expected classifications assigned. The guards it produces control with actions follow.

Governance actions from the same governance action processes can run in parallel if the predecessor governance actions produces multiple guards.

Example 1

Example 1: Governance Action Process to validate and augment a newly created asset

Governance action processes can include any type of governance service. Example 2 shows an open discovery service amongst the governance action services.

Example 2

Example 2: Governance Action Process to validate and augment a newly created asset

Capturing lineage for a governance action process

The governance actions generated when a governance action process runs provide a complete audit trace of the governance services that ran and their results. The Governance Action Open Lineage Integration Connector is able to monitor the operation of the governance actions and produce OpenLineage events to provide operational lineage for governance action processes. Egeria is also able to capture these events (along with OpenLineage events from other technologies) for later analysis.

Governance Action Process Lifecycle

The diagram below shows a governance action process assembly tool taking in information from a governance engine pack to build a governance action process flow. This is shared with the open metadata ecosystem either through direct called to the Governance Engine OMAS or via a open metadata archive (possibly the archive that holds the governance engine definition.

Once the definition of the governance action process is available, an instance of the process can be started, either by a watchdog governance action service or through a direct call to the Governance Engine OMAS. Whichever mechanism is used, it results in the Governance Engine OMAS using the definition to choreograph the creation of governance action entities that drive the execution of the governance services in the Engine Host.

Lifecycle

Further information

Setting up an engine host, governance engines and services (30 mins)

Setting up the governance engine

Recap: A governance engine runs in an Engine Host on an OMAG Server Platform.

Like all types of OMAG Servers, the Engine Host is configured through Egeria's Administration Service and the result is a configuration document for the server.

The configuration document is loaded when the Engine Host is started. It contains a list of the governance engines that it are to run in the Engine Host. The configuration document also identifies the metadata access server that Engine Host is paired with, and where the name Governance Engine Definitions will be retrieved from.

For each governance engine name listed in the configuration document, the Engine Host calls its metadata access server to retrieve its Governance Engine Definition. Based on the contents of the Governance Engine Definition, the Engine Host starts up the appropriate Open Metadata Engine Services (OMESs) that hold the logic to run the different types of Governance Services defined in the Governance Engine Definition.

Engine Host Configuring

The Governance Engine Definition does not need to be loaded directly into the metadata access server paired with the Engine Host. It just needs to be in one of the Metadata Access Stores connected to the same cohort. A federated query is used to retrieve the Governance Engine Definition. This searches in across all the connected Metadata Access Stores. In fact, different parts of the Governance Engine Definition could be in different Metadata Access Stores. The team that build the Governance Services may publish their Governance Service Definitions to their local metadata store. An architect team responsible for building the Governance Engine Definition may have their own Metadata Access Store that holds the GovernanceEngine entity and the relationships to the GovernanceService entities in the Governance Services team's Metadata Access Store. As long as they are all connected by a cohort, they can all operate as if the whole Governance Engine Definition was in a single Metadata Access Store.

Engine Host Remote Configuring

Recap: calls to a governance engine are made by initiating a Governance Action. This can be directly through an API call to the Governance Engine OMAS running in a Metadata Access Store, or via a Governance Action Process.

A Governance Action is an entity in open metadata. When it is created in the Metadata Access Store, it sends an event to all connected Engine Hosts. If it is for a governance request type that one of its governance engine's supports, it claims the Governance Action and passes the request to its governance engine to run.

The nature of Egeria's open metadata ecosystem means that a Governance Action can be created in the paired Metadata Access Store ...

Local Governance Action

... or in a connected Metadata Access Store. Therefore, the Governance Services are available to any member of the open metadata ecosystem. They do not need to know where the Engine Hosts are deployed.

Remote Governance Action

Further information

Configuring an Engine Host

Using metadata discovery (1 hour)

Metadata discovery and stewardship

Metadata discovery is an automated process that extracts metadata about a digital resource. This metadata may be:

  • embedded within the asset (for example a digital photograph has embedded metadata), or
  • managed by the platform that is hosting the asset (for example, a relational database platform maintains schema information about the data store in its databases), or
  • determined by analysing the content of the asset (for example a quality tool may analyse the data content to determine the types and range of values it contains and, maybe from that analysis, determine a quality score for the data).

Some metadata discovery may occur when the digital resource is first catalogued as an asset. Integrated cataloguing typically automates the creation the basic asset entry, its connection and optionally, its schema. This is sometimes called technical metadata.

Cataloguing database with integrated cataloguing

For example, the schema of a database may be catalogued through the Data Manager OMAS API. This schema may have been automatically extracted by an integration connector hosted in Egeria's Database Integrator OMIS.

The open discovery services build on this initial cataloguing. They use advanced analysis to inspect the content of a digital resource to derive new insights that can augment or validate their catalog entry.

The results of this analysis is added to a discovery analysis report linked off of the asset for the digital resource.

The analysis results documented in the discovery analysis report can either be automatically applied to the asset's catalog entry or it can go through a stewardship process where a subject-matter expert confirms the findings (or not).

Discovery and stewardship are the most advanced form of automation for asset cataloging. Egeria provides the server runtime environment and component framework to allow third parties to create discovery services and governance action implementations. It has only simple implementations of these components, mostly for demonstration purposes. This is an area where vendors and other open source projects are expected to provide additional value.

Open discovery services

An open discovery service is a component that performs specific analysis of the contents of a digital resource on request. The aim of the open discovery service is to enable a detailed picture of the properties of a resource to be built up.

Each time an open discovery service runs, it creates a new discovery analysis report linked off of the digital resource's Asset metadata element that records the results of the analysis.

Asset with discovery analysis reports

Each time an open discovery service runs to analyse a digital resource, a new discovery analysis report is created and attached to the resource's asset. If the open discovery service is run regularly, it is possible to track how the contents are changing over time.

The discovery analysis report contains one or more sets of related properties that the discovery service has discovered about the resource, its metadata, structure and/or content. These are stored in a set of discovery annotations linked off of the discovery analysis report.

An open discovery service is designed to run at regular intervals to gather a detailed perspective on the contents of the digital resource and how they are changing over time. Each time it runs, it is given access to the results of previously run open discovery services, along with a review of these findings made by individuals responsible for the digital resource (such as stewards, owners, custodians).

Operation of an open discovery service

Operation of an open discovery service

  1. Each time an open discovery service runs, Egeria creates a discovery analysis report to describe the status and results of the open discovery service's execution. The open discovery service is passed a discovery context that provides access to metadata.
  2. The discovery context is able to supply metadata about the asset and create a connector to the digital resource using the connection information linked to the asset. The discovery service uses the connector to access the digital resource's contents in order to perform the analysis.
  3. The discovery service creates discovery annotations to record the results of its analysis. It adds them to the discovery context which stores them in open metadata attached to the discovery analysis report.
  4. The discovery annotations can be reviewed and commented on through an external stewardship process. This means choices from, for example, a list of potential options proposed by the discovery services, can be verified and the best one selected by an individual expert. The resulting choices are added to annotation reviews attached to the appropriate annotations.
  5. The next time the open discovery service runs, a new discovery analysis report is created to link new attachments.
  6. The discovery context provides access to the existing attachments for that asset along with any annotation reviews. The discovery services is able to link its new annotations to the existing annotations as an annotation extension. This means that the stewards can see the history associated with the new information.
Runtime for an open discovery service

Open discovery services are packaged into Open Discovery Engines that run in the Asset Analysis OMES hosted in an Engine Host.

The metadata repository interface for metadata discovery tools is implemented by the Discovery Engine OMAS that runs in a Metadata Access Server.

An open discovery service may be triggered by a REST call to the Asset Analysis OMES, via a Governance Action or as part of a governance action process.

Open Discovery Service

Open discovery pipelines

There is a lot of common functions that are used repeatedly during the discovery process.

An open discovery pipeline is a specialized implementation of an open discovery service that runs a set of open discovery services against a single digital resource. The implementation of the open discovery pipeline determines the order that these open discovery services are run.

Open discovery pipeline example

Each open discovery service in the pipeline is able to access the results of the open discovery services that have run before it through the discovery context. The combined results of the open discovery pipeline are grouped into a single discovery analysis report linked off of the asset.

The aim of the open discovery pipeline is to enable reusable open discovery service implementations to be choreographed together for different types of digital resource.

Inside the discovery analysis report

The discovery analysis report structures the annotations in two ways:

  • Annotations that describe a characteristic of the whole digital resource.
  • Annotations that describe a characteristic of a single data field within the digital resource.

The annotations for the data fields are linked off of the data fields created by schema extraction.

Discovery analysis report structure

Discovery actions

Open discovery can be used for the following types of analysis.

Schema extraction

For digital resources that include structured data, schema extraction documents the data fields present in the digital resource and if the schema is attached to the asset, it will attempt to match the data fields it finds to its schema attributes.

Schema extraction uses the schema analysis annotation. It is linked directly off of the discovery analysis report.

Data field entities, one for each data field in the digital resource, are then linked together to show the structure of the data in the digital resource and this structure is linked off of the schema analysis annotation.

Open discovery schema extraction

The schema of the data in the digital resource is defined in a SchemaType linked from the digital resource's asset using the AssetSchemaType relationship. This may be established before the open discovery service runs, or may be derived by a governance action once the open discovery service has run.

If the schema is defined, the open discovery service that creates the data fields may maintain relationships between the schema and the data fields:

  • The SchemaTypeDefinition links the schema analysis annotation to the top level schema type.
  • The SchemaAttributeDefinition links a data field to is corresponding schema attribute.

Alternatively, these relationships can be established by a governance action that is processing the results of the schema extraction. They are useful for consumers of the asset to be able to navigate to the specific data field annotations from the schema.

Schema Extraction in Report

Where a digital resource has a fixed structure that does not support repeating fields, such as a relational database, the schema extraction can use the schema to create the data fields since the result will always be one-to-one (assuming the schema is being actively maintained).

However, if there are repeating groups in the digital resource's data fields then the schema extraction needs to work off of the data in the digital resource.

Data profiling

Profiling analysis looks at the data values in the resource and summarizes their characteristics. There are three types of annotations used in data profiling.

  • Data Profile Annotation - Capture the characteristics of the data values stored in a specific data field in a data source.
  • Data Profile Log Annotation - Capture the named of the log files where profile characteristics of the data values stored in a specific data field. This is used when the profile results are too large to store in open metadata.
  • Fingerprint Annotation - Capture the characteristics of the data values stored in a specific data field or the whole digital resource and express it as a single value.

Open discovery data profiling

For structured data, data profiling needs to run after schema extraction to allow the data profiling annotations that refer to a specific data field to be linked from the appropriate data field entity.

Data class discovery

Data class discovery captures the analysis on how close a data field matches the specification defined in a data class.

Data class discovery

The recommendation for a specific data class are stored in a data class annotation linked off of the appropriate data field. Data class discovery needs to run after schema extraction. It often builds on the information provided by data profiling.

Subsequent stewardship - either automated or with human assistance - can confirm the correct assignment using the DataClassAssignment relationship.

Semantic discovery

Semantic discovery is attempting to define the meaning of the data values in the asset. The result is a recommended glossary term stored as a semantic annotation.

Semantic discovery

These annotations are the metadata discovery equivalent of the Informal Tag shown in 0150 - Feedback in Area 1. It typically takes confirmation by a subject-matter expert to convert this into a Semantic Assignment. Semantic discovery needs to run after schema extraction. It often builds on the information provided by data profiling and data class discovery.

Classification discovery

Classification discovery adds recommendations for new classifications that should either be added to the asset, or to a schema attribute in the asset. It uses the classification annotation to describe the classification and its properties. If the classification is for the asset, the classification annotation is linked off of the discovery analysis report. If it is for a specific schema attribute, it is linked off of the corresponding data field.

Classification discovery

Calculating quality scores

Quality scores describe how well the data values, typically in a data field, conform to a specification. For example, do the values match a list of valid values. This type of annotation is often used within a data quality program to provide assessments of the data for different purposes.

Quality Scores

Relationship discovery

Relationship discovery identifies relationships between different resources (or data fields), such as two columns that have a foreign key relationship.

It is possible to create the relationship as a relationship annotation or attach a relationship advice to the discovery analysis report.

Relationship discovery

Capturing measurements

The measurement annotations capture a snapshot of the physical dimensions and activity levels at a particular moment in time. For example, it may calculate the size of the data source or the number of users accessing it.

Data source measurements

Requesting stewardship action

A RequestForAction entity (RfA) is used when an open discovery service performs a test on the data (such as a quality rule) or has discovered an anomaly in the data landscape compared to its metadata that potentially needs a steward or a curator's action.

Request for action

The Stewardship Action OMAS is designed to respond to the requests for actions (RfAs).

Working with external engines

Open discovery services may directly implement the analysis function or may invoke an external service to create the annotations.

Initiating stewardship

Stewardship is initiated either through the creation of a Request for Action annotation or when the discovery analysis report's status changes to COMPLETE.

Incident reporting and management (30 mins)

Incident Reporting

When errors, unexpected situations or special requests occur, it is not always obvious how they should be handled. An incident report is a metadata entity that describes a situation that needs special attention. It provides a focus point to coordinate efforts to resolve the situation.

The incident report is created when the situation is detected. The originator and known affected resources are linked to it.

As the incident is handled, details of the cause, additional affected resources, related incident reports and actions taken are attached to the incident report to create a complete record of the incident for future analysis.

There is a status in the incident report that records the progress to resolving the situation:

  • Raised
  • Reviewed
  • Validated
  • Resolved
  • Invalid
  • Ignored
  • Other

The people working on the incident can add notes to the incident report's note log to communicate the diagnosis, steps taken and decisions made.

Incident Report

Overview of the Incident Management process

  1. When an incident occurs, an incident report is created. There is support to create an incident on most Open Metadata Access Services (OMASs), Open Metadata Integration Services (OMISs) supporting integration connectors and the Open Metadata Engine Services (OMESs) supporting governance services.
  2. the request is routed to Metadata Access Store and an IncidentReport entity linked to metadata describing the originator and any impacted resources is saved to its open metadata repository.
  3. The content of incident report is managed via the Stewardship Action OMAS's interface. This could be through direct calls to the API or via an integration connector running in the Stewardship Integrator OMIS

Incident Management

An Incident Management Example

In this example, there is a governance service called LDAP Verifier Governance Action Service that is running in an Engine Host. It is responsible for detecting whether there are any unexpected entries in Coco Pharmaceuticals' LDAP server that support access control.

It detects an unexpected member in the founders security group for a userId called matt-darker. Is this a valid entry and Egeria's list of user identities is out of date, or is this part of a cyber-attack? The governance action service creates an incident report identifying the security group and the unexpected userId.

The creation of the incident report in the open metadata ecosystem causes an event to be published by the Stewardship Action OMAS. It is picked up by an integration connector called Incident Manager Integration Connector. This integration connector publishes the incident report to Coco Phamaceuticals' security incident management system as an issue, where it is picked up by the security team to work on. Any relevant updates made to the issue in the security incident management system are detected by Incident Manager Integration Connector and reflected back in the incident report in the open metadata ecosystem.

Incident Management Example

Examples of using automated governance

Working with governance metadata (1.5 hours)
Governance Zoning (30 mins)

Governance Zoning

A Governance Zone defines a list of assets that are grouped together for a specific purpose. It may represent assets that are consumed or managed in a particular way; or should only be visible to particular groups of users, or processed by particular types of engine. There may also be zones used to indicate that the asset is in a particular state.

Figure 1

Figure 1: Examples of types of zones

For example, Coco Pharmaceuticals use a quarantine zone for data that has arrived from an external partner. It is not visible to the researchers until it has been cataloged and verified. Then it is added to the zones that others can see.

Figure 2

Figure 2: Visibility rules associated with governance zones

Zones are typically independent of one another, but they can be nested if desired.

Defining governance zones

Governance zones are defined as part of the governance program. They are stored in open metadata as GovernanceZone entities. They can be linked to governance definitions that may directly or indirectly identify the visibility rules that apply to the zone.

Figure 3

Figure 3: Visibility rules associated with governance zones

The visibility rules associated with the zones are typically executed:

Use cases

Governance zones help to support the following types of use cases:

  • Data sovereignty - by defining zones that represent the origin of data and using visibility rules that match consumer to origin.
  • Adjustments for legal jurisdiction - by defining zones for each jurisdiction and attaching appropriate rules to each.
  • Asset visibility and access control - by defining zones that group assets into
  • Data access control - using the zones assigned to the asset when setting up security tags.
  • Maintenance and backup processing - by defining zones that represent different archiving or maintenance requirements. The engines that perform the automated maintenance work through the list of assets in the relevant zones.
  • Understanding dependencies between different parts of the organizations to support the information supply chains - by defining zones for different parts of the organization and visualizing
  • Metering and billing - by defining zones that represent the cost structure of assets and using the zone membership of assets being used to determine the charge.

Membership of a zone

An asset can belong to all, one or many zones. The list of zones that an asset belongs to is configured in the zoneMembership property of its AssetZoneMembership classification. If it is blank, it means the asset logically belongs to all zones. Otherwise, it belongs only to the zones that are listed. It is added to or removed from a zone by updating the AssetZoneMembership classification.

As new assets are onboarded, use template or governance action services to ensure the asset is assigned to the right set of zones as it is added to the catalog. The zones can then be maintained as needed throughout the lifetime of the asset.

Figure 4

Figure : Examples of types of zones

When designing the governance zones for your organization, it is necessary to take care that the visibility rules associated with an asset's zones are complementary rather than conflicting.

Figure 5

Figure 5: Thinking through the meaning of belonging to multiple zones

Controlling access to assets through the OMASs

All Open Metadata Access Services (OMASs) that retrieve assets, such as Asset Catalog, Asset Consumer and Asset Owner, use the supportedZones option that is configured for the service in the in their server's configuration document. This property defines the zones of assets that can be returned by this instance of the access service.

In addition, access services that create assets, use the defaultZones option to define the list of zones set up in any new asset they create.

Finally, access services that are synchronizing assets between different third party technologies, such as the Data Manager OMAS, will also use the publishZones option to publish an asset to consumer zones once they are completely defined in the catalog.

Figure 6

Figure 6: Visibility setting in the Open Metadata Access Services (OMASs)

The meaning, purpose and governance requirements for assets within a specific zone are maintained through the Governance Program OMAS.

It is also possible to associate security access control with a governance zone.

Anchor management (30 mins)

Anchors

Anchors are Referenceable metadata entities that group other entities together. They act like containers. This means, for example, if the anchor entity is deleted, all the entities anchored to this entity are also deleted. The value of establishing this grouping is to ensure that entities that have little meaning without their anchor entity are cleaned up properly and are not left to uselessly clutter the repository when the anchor is deleted.

Example: personal messages and profiles

For example, if a personal message is attached to a personal profile then that personal profile is its anchor. If the personal profile is deleted then the personal message is deleted too.

Anchored entities are also bound by the visibility and security restrictions of their anchor.

Example: Assets

For example, Asset visibility is controlled by governance zones. An Asset is only visible through a service if it is a member of that service's supportedZones. Similarly, authorization to perform specific operations on an Asset is granted by the Open Metadata Security Services. When a SchemaType is attached to an Asset, it is anchored to that Asset. Subsequent requests to read or update the SchemaType will result in visibility and authorization checks for the requesting user being made with respect to its Asset anchor.

The anchor is also important in templated cataloguing where an existing entity is used as a template to create another. When the template entity is an anchor, it and all the other entities that are anchored to it are duplicated to create the new entry and relationships are created to all other entities that are linked to the template and its anchored entities.

Anchors classification

The Anchors classification contains the unique identifier (GUID) of an anchor entity. A GUID is attached to an anchored entity which makes it easier to navigate from an anchored entity to other entities anchored to it.

Example: SchemaElements and Comments

Figure 1 is an illustration of this example, with the addition of an Asset. The entities that have the Anchors classification are those that are anchored to the Asset. This includes entities such as Ratings, Likes and Attachments (from the Open Discovery Framework (ODF).

Figure 1

Figure 1: Examples of dependent entities that are anchored to an Asset

If a GlossaryTerm, or InformalTag is attached to an Asset through a relationship, they are not anchored to it. GlossaryTerms and InformalTags are independent entities. They are not anchored to the Asset and hence do not have an Anchors classification.

NoteLogs can be attached to many other Referenceables. They can be set up either to be anchored with a single Referenceable or to be their own anchor to allow then to be attached to and detached from many Referenceables over its lifetime.

Example: NoteLog and Referenceables

For example, these are cases where the NoteLog is anchored to another Referenceable

  • NoteLogs are used to support the personal blog linked off of the Personal Profile in Community Profile OMAS.
  • Assets may have a NoteLog to record "news" for consumers such as planned maintenance and unexpected situations.

Egeria uses the Anchors classification on a NoteLog to indicate that the NoteLog is tied to the Referenceable it is attached to. The presence of this classification would prevent it from being linked to another Referenceable.

Figure 2 is an illustration of the additional objects connecting to an asset that do not have the Anchors classification because they are not anchored to the Asset. Also notice there are two NoteLogs attached to the asset, one with the Anchors classification and one without. The one with the Anchors classification is anchored to the the Asset. The one without the Anchors classification is independent of the Asset.

Figure 2

Figure 2: Examples of other types of entities that are linked to an Asset but not necessarily anchored there

It is worthwhile maintaining the Anchors classification because reads of, and updates to the anchored entities will happen many times, and it is rare that an anchored entity will change its anchor during its lifetime.

LatestChange classification

The LatestChange classification is attached to Assets and GlossaryTerms. It is used to record the latest change to these types of entities and any of the entities anchored to them.

Example: LatestChange on Asset

So for example, if a hierarchy of SchemaElements, or a hierarchy of Comments, were anchored to an Asset, then the LatestChange classification is attached to the Asset and records changes to any of these entities. This includes changing property values, attaching or detaching entities through relationships as well as any changes to their classifications.

If a GlossaryTerm is attached to an Asset through a relationship, they are not anchored together since they each have an independent lifecycle. Any change to these entities does not reflect in the other's LatestChange classification. However, the act of attaching them to, or detaching them from, each other is recorded in both entity's LatestChange classification.

Maintaining the LatestChange classification on an Asset means that it is easier to monitor for changes affecting the Asset and any of its anchored entities. However, it also means that it must be easy to locate the Asset from any of the anchored entities when they change, even though they may not be directly connected.

Working with templates (30 mins)

When a new resource is catalogued, the catalog entry of a similar resource can be used as a template to set up the asset for the new resource. This means that the new asset can contain governance metadata attachments, not just the technical metadata extracted from the digital resource.

Templated cataloguing is useful for situations where new resources are regularly catalogued that are of the same kind.

An example

Peter Profile is responsible for cataloguing the weekly measurements supplied by the various hospitals as part of a clinical trial. These measurements are supplied with certain terms and conditions (also known as a license) that Coco Pharmaceuticals must not only adhere to, but prove that they are doing so. For that reason, when the measurements are catalogued, the asset for the measurements data set is linked to the license as well as other elements that help to ensure that the measurements data sets are appropriately used and governed.

Figure 1 shows Peter making calls to Egeria to catalog the first set of measurements received for the clinical trial. This includes an asset to represent the data set that is linked to the license along with a connection to allow the data scientist to connect to the data set and access the data and the schema showing the structure of the data in the data set. The data fields identified in the schema each link to the glossary term that describes the meaning of the data stored in the field. There are also two classifications on the asset:

  • AssetZoneMembership - The governance zones that the asset is a member of. This controls who can access the asset and its related metadata elements such as the connection and the schema.
  • Ownership - The owner of the data set. This is the person who is accountable for ensuring that Coco Pharmaceuticals adheres to the license.

Figure 1

Figure 1: In week 1, Peter manually creates the asset and links it to the governance elements needed to ensure the data set is used and protected as laid out in the license.

Without templating, Peter would need to issue the same sequence of requests to catalog each of the weekly results rom each of the hospitals. This is a lot of work from Peter, particularly as the number of clinical trials, and participating hospitals rises. He may then make a mistake and forget one of the steps in the cataloguing process.

What if the catalog entry for the Week 1 measurements could be used as a template for cataloguing the subsequent weeks' measurements as shown in figure 2?

Figure 2

Figure 2: For subsequent weeks, the week 1 entry could be used as a template for cataloguing subsequent weeks. The result is an asset for each data set with a connector, a schema along with the ownership and zone membership classifications. All of the assets are linked to the license and the data fields in each schema are linked to the correct glossary terms.

This is the idea behind templated cataloguing. A template that defines the common settings for a set of digital resources is defined and this template is used when cataloguing the resources.

Figure 3 shows a set of templates used by Coco Pharmaceuticals when cataloguing their digital landscape. There are different templates for different types of digital resources. Each would include the classifications and relationships that are relevant for the resources that they catalog. They are decorated with the Template classification to identify that they do not represent real digital resource and should be used as templates.

Figure 3

Figure 3: A set of templates defined to use when cataloguing digital resources

When a template is used in cataloguing a digital asset, the caller needs to supply the values that must be unique for the digital asset. This is typically the qualifiedName, displayName, description and may also include the networkAddress for its connection's endpoint. These values override those in the template.

Egeria uses the anchor classification to determine which elements linked to the template are duplicated and which elements are just linked to by the new catalog entry. In figure 2, for example, the connection and schema are anchored to the asset whilst the glossary terms and license are not. This means that copies of the connection and schema elements are made for the new catalog entry whilst the glossary terms and and licence just receive new relationships to the new catalog entry.

Finally, when a template is used, it is linked to the resulting element with the SourcedFrom relationship. This makes it easier to identity the elements that need changing if the template needs to be corrected or enhanced at a later date.

Figure 4

Figure 4: The SourcedFrom relationship links a template to the elements that are created from it

Securing your metadata (1 hour)

Open Metadata Security

Open Metadata Security provides fine-grained authorization services for open metadata services, types and instances. Since each organization will have different security requirements, the support is implemented through connectors. Egeria defines the interfaces and when it will call the connector. You define the behavior the connector implements, and Egeria acts on the returned decision.

The metadata-security module defines the base classes and interfaces for the open metadata security connectors as well as the server implementation to host and call them.

There are two types of connector:

  • Platform metadata security connector - secures access to the platform services that are not specific to an OMAG Server. This includes the admin services to create new servers, the ability to start and stop new servers as well as the ability to query whether a server is running, and if it is, what services are active.

  • Server metadata security connector - secures access to the specific services of an OMAG server. This includes the server itself, specific services within the server, specific Assets and Connections managed by the server and the types and instances stored in the local repository.

The 2 types of connectors are shown in Figure 1:

Figure 1

Figure 1: positioning of the security connectors

Within an OMAG Server Platform there is one instance of the open metadata platform security connector. This connector is configured once the platform is running using the admin service call:

POST {{platformURLRoot}}/open-metadata/admin-services/users/{{adminUserId}}/platform/security/connection
where the {{adminUserId}} is the administrator's userId. The connection for the connector and the platform URL root are passed in the request body. There are GET and DELETE services with the same URL to retrieve and remove this connector respectively.

The open metadata server security connector is configured for each OMAG server to allow for each server to have a different implementation. The admin services command to configure a security connector for a server is:

POST {{platformURLRoot}}/open-metadata/admin-services/users/{{adminUserId}}/servers/{{serverName}}/security/connection
where the {{adminUserId}} is the administrator's userId and {{serverName}} is the name of the server where the connector is to run. The connection for the server security connector is passed in the request body. Again, there are GET and DELETE services with the same URL to retrieve and remove this connector respectively.

The security implementation in a server potentially invokes the server security connector multiple times as the request (shown as dotted white arrow) is handled by the server code. Figure 2 shows the different layers of checks. Each layer is optional and so the server security connector can be implemented to support the most appropriate granularity of security for the situation. Details of the implementation choices are given in the security connector API.

Figure 2

Figure 2: layers of security checks within the server

The security connectors are optional. If they are not defined then there are no additional authorization checks performed inside the OMAG Server Platform nor the OMAG Servers hosted on the platform. As such, it is important that the open metadata platform security connector is configured as soon as the platform is started, and the server security connector is configured before the server is started for the first time.

Metadata security APIs

Below is a description of the API of the two Open Metadata Security Connectors.

Open metadata platform security connector interface

The connector that plugs into the platform implements the following interface.

  • OpenMetadataPlatformSecurity - provides the interface for a plugin connector that validates whether a calling user can access any service on an OMAG Server Platform. It is called within the context of a specific OMAG Server Platform request. Each OMAG Server Platform can define its own plugin connector implementation and will have its own instance of the connector.

    • validateUserForPlatform - Check that the calling user is authorized to issue a (any) request to the OMAG Server Platform.
    • validateUserAsAdminForPlatform - Check that the calling user is authorized to issue administration requests to the OMAG Server Platform.
    • validateUserAsOperatorForPlatform - Check that the calling user is authorized to issue operator requests to the OMAG Server Platform.
    • validateUserAsInvestigatorForPlatform - Check that the calling user is authorized to issue operator requests to the OMAG Server Platform.

Open metadata server security connector interface

The connector that can be defined for an OMAG Server offers a series of layers of security checks. An organization can choose which layers to make use of and which to allow all requests to pass. Figure 2 shows the layers. Each layer is implemented in a separate interface and the connector can choose which interfaces to implement. Below are the interfaces and methods for the different layers:

  • OpenMetadataServerSecurity - provides the root interface for a connector that validates access to Open Metadata services and instances for a specific user. There are other optional interfaces that define which actions should be validated.

    • validateUserForServer - Checks that the calling user is authorized to issue a (any) request to the OMAG Server.
    • validateUserAsServerAdmin - Checks that the calling user is authorized to update the configuration for a server.
    • validateUserAsServerOperator - Checks that the calling user is authorized to issue operator requests to the OMAG Server.
    • validateUserAsServerInvestigator - Checks that the calling user is authorized to issue operator requests to the OMAG Server.
  • OpenMetadataServiceSecurity - provides the interface for a plugin connector that validates whether a calling user can access a specific metadata service. It is called within the context of a specific OMAG Server. Each OMAG Server can define its own plugin connector implementation and will have its own instance of the connector. However, the server name is supplied so a single connector can use it for logging error messages and locating the valid user list for the server.

    • validateUserForService - Checks that the calling user is authorized to issue this request.
    • validateUserForServiceOperation - Checks that the calling user is authorized to issue this specific request.
  • OpenMetadataRepositorySecurity - defines security checks for accessing and maintaining open metadata types and instances in the local repository. An instance is an entity or a relationship. There is also a special method for changing classifications added to an entity.

    • validateUserForTypeCreate - Tests for whether a specific user should have the right to create a typeDef within a repository.
    • validateUserForTypeRead - Tests for whether a specific user should have read access to a specific typeDef within a repository.
    • validateUserForTypeUpdate - Tests for whether a specific user should have the right to update a typeDef within a repository.
    • validateUserForTypeDelete - Tests for whether a specific user should have the right to delete a typeDef within a repository.
    • validateUserForEntityCreate - Tests for whether a specific user should have the right to create an instance within a repository.
    • validateUserForEntityRead - Tests for whether a specific user should have read access to a specific instance within a repository. May also remove content from the entity before it is passed to caller.
    • validateUserForEntitySummaryRead - Tests for whether a specific user should have read access to a specific instance within a repository.
    • validateUserForEntityProxyRead - Tests for whether a specific user should have read access to a specific instance within a repository.
    • validateUserForEntityUpdate - Tests for whether a specific user should have the right to update an instance within a repository.
    • validateUserForEntityClassificationUpdate - Tests for whether a specific user should have the right to update the classification for an entity instance within a repository.
    • validateUserForEntityDelete - Tests for whether a specific user should have the right to delete an instance within a repository.
    • validateUserForRelationshipCreate - Tests for whether a specific user should have the right to create an instance within a repository.
    • validateUserForRelationshipRead - Tests for whether a specific user should have read access to a specific instance within a repository. May also remove content from the relationship before it is passed to caller.
    • validateUserForRelationshipUpdate - Tests for whether a specific user should have the right to update an instance within a repository.
    • validateUserForRelationshipDelete - Tests for whether a specific user should have the right to delete an instance within a repository.
    • validateEntityReferenceCopySave - Tests for whether a reference copy should be saved to the repository.
    • validateRelationshipReferenceCopySave - Tests for whether a reference copy should be saved to the repository.
  • OpenMetadataEventsSecurity - defines security checks for sending and receiving events on the open metadata repository cohorts.

    • validateInboundEvent - Validates whether an event received from another member of the cohort should be processed by this server. May also remove content from the event before it is processed by the server.
    • validateOutboundEvent - Validates whether an event should be sent to the other members of the cohort by this server. May also remove content from the event before it is sent to the cohort.
  • OpenMetadataAssetSecurity - validates what a user is allowed to do with to Assets. The methods are given access to the whole asset to allow a variety of values to be tested.

    • setSupportedZonesForUser - Provides an opportunity to override the deployed module setting of supportedZones for a user specific list.
    • validateUserForAssetCreate - Tests for whether a specific user should have the right to create an asset.
    • validateUserForAssetRead - Tests for whether a specific user should have read access to a specific asset.
    • validateUserForAssetDetailUpdate - Tests for whether a specific user should have the right to update an asset. This is used for a general asset update, which may include changes to the zones and the ownership.
    • validateUserForAssetAttachmentUpdate - Tests for whether a specific user should have the right to update elements attached directly to an asset such as schema and connections.
    • validateUserForAssetFeedback - Tests for whether a specific user should have the right to attach feedback - such as comments, ratings, tags and likes, to the asset.
    • validateUserForAssetDelete - Tests for whether a specific user should have the right to delete an asset.
  • OpenMetadataConnectionSecurity - defines the interface of a connector that is validating whether a specific user should be given access to a specific Connection object. This connection information is retrieved from an open metadata repository. It is used to create a Connector to an Asset. It may include user credentials that could enhance the access to data and function within the Asset that is far above the specific user's approval. This is why this optional check is performed by any open metadata service that is returning a Connection object (or a Connector created with the Connection object) to an external party.

    • validateUserForConnection - Tests for whether a specific user should have access to a connection.
    • validateUserForAssetConnectionList - Selects an appropriate connection for a user from the list of connections attached to an Asset.

Sample connectors

There are sample implementations of the security connectors for Coco Pharmaceuticals in the "samples" module under open-metadata-security-samples

Linking metadata governance to your governance program (1 hour)

Setting up your Governance Program

An organization's governance program described the cross-cutting initiatives that ensure the organization is operating efficiently and ethically with the most optimal use of resources. Egeria supports open metadata definitions that aid the management and exchange of information with different tools and systems that drive the governance program.

The governance program is divided into governance domains.

Governance domains

Governance domains are the areas of an organization's operation that need a specific governance focus. For example, a governance domain may be driving a transformation strategy, planning support for a regulation, driving an efficiency campaign or developing an assurance program for your organization.

A governance domain typically involves the ongoing cooperation of multiple teams from different business areas. There is often a business strategy and a set of targets associated with the domain since its purpose is to provide some level of improvement or benefit to the organization.

Each domain is typically the responsibility of a different executive in the organization. Different domains may use slightly different terminology and often run different tools but in fact they are very similar in the way that they operate. Egeria allows the teams from the different governance domains to collaborate and benefit from each other's efforts.

Examples of governance domains

The governance domains can vary in scope and importance to the business. In the example below, Corporate Governance ensures that the business operates legally. It is the key focus of the board or directors and includes financial reporting.

At the heart of the organization's operation are three governance domains that are often run separately, but in fact are highly dependent on one another:

  • Data (or Information) Governance focuses on the appropriate use and management of data.
  • Information Security (InfoSec) focuses on the security of the IT Systems (and sometimes the physical security of buildings and plant).
  • IT Infrastructure ensures systems are correctly set up and managed.so they deliver the level of service required by the business. This governance domain often uses a taylored version of IT Infrastructure Library (ITIL).

Software development needs governance to ensure it is properly designed, built and tested - and fit of purpose. This governance is typically guided by a software development method, such as Agile Development and covers all aspects of the software development lifecycle.

There are other domains that are more specialized. For example:

  • Privacy focusses on compliance with data subject privacy. It interacts with the data governance, software development lifecycle and IT Infrastrcuture domains.
  • Human Capital Management combines Human Resources (HR) and the management team of the organization to ensure the people they employ receive the training and support to build the right skills to support the business.
  • Risk Management assesses and, where necessary mittigates against the risks that may impact the organization. It is often lead by the finance team because the level of risk can affect their credit rating. However in some industries, such as banking, where the reporting and management of risk is covered by specific regulations, there is a separate organization that interfaces with the regulators.
  • Physical Asset Management manages the physical buildings, furnature, machinery, computers etc that the organization owns. It is responsible for maintenance and replacement of these items.
  • Product Assurance ensures the products that the organization sells meet regulatory requirements and match the ethos of the organization.
  • Procurement focussed on the policies and rules assicuated with the products and services bought by people across the organization. There may be a seprate procurement team that interacts with suppliers to ensure bulk discounts and quality. They would work with other leaders in the organization to ensure the approach matches the needs of the business.
  • Sustainability is a relatively new governance domain that focuses on the effective use of resources to improve long-term sustainability of the organization's operations.

Governance Domain Examples

Examples of different governance domains within an organization.

Governance domains are represented by Governance Domain Descriptions entities in open metadata. They are organized into a GovernanceDomainSet collection.

Governance Domain Descriptions

The governance domain descriptions organized in a governance domain set

The governance domain descriptions include the domainIdentifier property, a displayName and a description. The domainIdentifier property is an integer and by convention "0" means "applies to all domains". For example:

Domain Identifier Display Name Description
0 All Governance program leadership and shared definitions.
1 Data The governance of data and its use.
2 Privacy The support for data privacy.
3 Information Security (InfoSec) The governance that ensures IT systems and the data they hold are secure.
4 IT Infrastructure The governance of the configuration and management of IT infrastructure and the software that runs on it.
5 Software Development Lifecycle The governance of the software development lifecycle.
6 Corporate Governance The governance of the organization as a legal entity.
7 Physical Asset Management The governance of physical assets.
Defining governance domains

Governance domain descriptions can be defined in an Open Metadata Archive or through the Governance Program OMAS.

The egeria-samples.git repository includes a sample called Sample metadata archives for Coco Pharmaceuticals that creates open metadata archives with basic definitions for Coco Pharmaceuticals. This includes the definition of this organization's governance domains with their communities and governance officers.

Governance Leadership

A governance domain needs a leader to drive the change and focus that it demands. When an organization decides to create a new governance domain, the appointment of the leader of the domain is the first decision. This person then organizes the people and resources that will drive the definition and rollout of changes needed to make the governance domain successful.

The leadership of the governance domain is a type of governance role.

Governance Roles

Governance roles define the additional responsibilities and tasks that people need to do in order to make the governance domain successful. These roles lead initiatives or take ownership of resources. Some governance roles are full-time jobs, but most will be a few hours from time to time. The governance roles are identified and defined as the operation of the domain is worked out. This includes specifying the responsibilities so that an assessment of the cost/benefits can be made. People are appointed to the roles later, as the governance domain is put into operation.

Egeria defines a set of governance role types to provide a framework for your governance roles. Governance roles have a domainIdentifier property to identify that the role is defined for a particular domain. You create instances of these types to define the roles for the domain. For example, there is a governance role type called AssetOwner. You may want a role that is responsible for granting access to a data asset. So you may create a role of type AssetOwner called Asset access manager with responsibilities for approving requests to access the data.

Multiple people may be assigned to a role. So you can choose to define generic roles and appoint multiple people to them. Alternatively, if you want to define precisely which resources they are responsible for, then you need more fine-grained roles.

For example, you may have 100 data assets and each needs at least one person to grant access to it. Here are some different choices on how you could set this up.

  • You may choose to have one Asset access manager role and appoint a small group of people to the role. These people can grant access to any of the 100 data sets.

  • You may want to organize the data assets into different groups and appoint different people to grant access to the assets in each group. In this case, you would have a governance role instance for each group. They would be linked to the element that represents the group via the AssignmentScope relationship.

Governance roles with specific assignment scopes

  • You may want to have specific roles for each data asset. This could be a lot of overhead to defined roles explicitly. You could choose to have a generic role where the appointment and scope is handled via the Ownership classification.

Using the ownership classification to create an explicit assignment of responsibility

These different approaches allow you to have enough detail in your open metadata definitions to configure tools and report on governance activity, whilst minimising effort to keep the definitions up to date.

Representing governance roles in open metadata

Governance roles are types of PersonRole. They link to a governance responsibility (a type of governance definition using the GovernanceResponsibilityAssignment relationship). The profile of a person is linked to the governance role using the PersonRoleAppointment.

Governance Roles

The Governance Program OMAS supports the setting up of governance roles. The Community Profile OMAS supports the appointment of people to roles.

Governance Leadership Communities

The governance domain covers activity occurring in different parts of the business. The people working for the domain will not be reporting directly to the governance domain leader - they continue to report through their line of business. Therefore, the governance domain leader needs a mechanism to bring people together, share progress and the latest information. This can be done through a community.

Governance Domain Communities

Each governance domain would typically have a community, led by the governance domain leader and with a membership consisting of the people appointed to the governance roles supporting the governance domain.

Governance Communities

As the governance roles are defined, they are added to the governance domain community using the CommunityMembership relationship. As people are appointed to the roles, they automatically become a member of the community.

Governance Leaders Community

Often the leaders of the governance domains need a forum to share ideas and collaborate. This can be achieved by setting up a community for the governance leaders. This means the governance domain leader is the head of their governance domain community and is a member of the governance leadership community.

Governance Leadership Community

Defining the governance communities' membership in Egeria means that as people are appointed or removed from roles, Egeria can automatically maintain access control lists and email list for the membership.

Governance domain management process

Most activity within each governance domain is iteratively developed and reviewed. Managing the governance domain includes:

  • Designing how the governance domain will operation:

    • Understanding the business drivers and regulations that provide the motivation and direction to the governance program.

    • Laying down the governance policies (principles, obligations and approaches) that frame the governance program.

    • Planning and defining the governance controls that detail how these governance policies will be implemented in the organization, and enumerating the implications of these decisions and the expected outcomes.

    • Defining the organization's roles and responsibilities that will support the governance program.

    • Defining the classifications and governance zones that will organize the assets being governed.

    • Defining the subject areas that will organize the data-oriented definitions such as glossary terms, valid values and quality rules.

    • Defining the governance metrics that will measure the success of the governance domain.

    • Defining the execution points that identify how the governance domain's decisions and actions taken are to be implemented.

    • Planning the rollout of changes to both the organization, processes and technology that will drive the governance domain.

  • Reviewing the impact of the governance activity against the goals of the governance domain.

    • adjusting governance implementation as necessary.
  • Reviewing the strategy, business and regulatory landscape.

  • adjusting the governance definitions and metrics as necessary.

Governance Definitions

A Governance Definition is a metadata element that describes the context or purpose for an activity that supports the organization's operation. The picture shows the main types of governance definition and how they link together to create a coherent response to a business strategy or regulation.

Traceability through governance definitions

Using governance definitions to provide traceability from business drivers and regulations to actions.

Subject areas

Subject areas are topics or domains of knowledge that are important to the organization. Typically, they cover types of assets (such as data) that are widely shared across the organization and there is business value in maintaining consistency in the data values in each copy.

The role of a subject area definition is to act as a collection point for all the subject area materials. This includes:

The materials that are part of the subject area are classified as such using the SubjectArea classification.

Subject Area Definition

Defining a subject area

Each subject area has an owner (see SubjectAreaOwner) who is responsible for the materials relating to the subject area. Often the subject area owner is a senior person in the organization with expertise in the subject area. He/she coordinates other subject-matter experts to author and maintain the materials and their use. It is helpful to set up a community of people working on the subject area's materials, to coordinates email distribution lists, news and events.

Subject Area Community

People working on a subject area come together in a community

The subject area definition can be linked to governance definitions via the GovernanceBy relationship.

The organization of the subject areas is orthogonal to the governance domains. Some subject areas are common to multiple governance domains; others are specialized within a governance domain. Similarly, an organization can create governance definitions that are applicable to all subject areas, or are specific to the subject area they are linked to. Typically, they will have a mixture of these.

Further information

Governance classification, tagging and linking

One of the ways to reduce the cost of governance is to define groups of similar assets/resources along with the governance definitions that apply to members of the group. This avoids having to make decisions on how to manage each asset/resource. The cataloguing process just needs to work out which group(s) to place the asset in. Labels such as classifications, and tags of different types are used to identify these group assignments. When a governance process is operating on the asset/resource, it looks up the labels and follows the governance definitions for the group.

Figure 4

Figure 4: Different types of tags used to group assets for governance

The different types of labels used to group assets/resources are used for different purposes and may indicate how official they are:

  • Governance Zones group assets according to their use. They are typically is used for controlling visibility to the resource's asset definition.

  • Governance Classifications define the groups used for specific types of governance.

    • Confidence Governance Classification defines the level of confidence that should be placed in the accuracy of related data items. This limits the scope that the data can be used in.
    • Confidentiality Governance Classification defines the level of confidentiality or secrecy needed with particular data.
    • Criticality Governance Classification defines how critical the related resources are to the continued operation of the organization.
    • Impact Governance Classification defines how much of an impact a particular situation is to the operation of the organization.
    • Retention Governance Classification defines how long a resource (typically data) must be retained by the organization.
  • License Types define the contract aka (terms and conditions) that define how the asset/resource can be used.

  • Certification Types define specific characteristics of an asset/resource that has been verified for a particular span of time.

  • SecurityTags identify labels and properties that are used in determining which data protection rules should be executed when particular data is requests. They can be attached to assets or schema elements depending on the scope of data that the security tags apply to. The synchronized access control feature describes how security tags are set up and used.

The labels may be assigned directly to the asset, or to elements, such as schemas and glossary terms that are linked to the asset.

Setting up the levels for your governance classifications

The values used in governance classifications show the specific group that the classified asset belongs to. Often an organization has their own levels defined, and they can be set up in GovernanceClassificationLevel definitions.

Figure 5

Figure 5: Governance classifications that use governance level definitions

Egeria has a set of default values that can be set up using the createStandardGovernanceClassificationLevels method.

Measures and metrics

As important aspect of the governance program is the ability to measure its effectiveness and identify the assets that are delivering the highest value, or operating with the greatest efficiency etc.

A value that should be captured to demonstrate the effectiveness of the governance program is documented using the GovernanceMetric entity. It is linked to the appropriate governance definition and can be linked to a data set where the specific measurements are being gathered.

The calculation of governance metrics is often a summary of many other measurements associated with specific resources (such as data sources and processes) operating under the scope of the governance program. These resources are catalogued as Assets.

Figure 6

Figure 6: Measuring governance through an external data set

The definition of their expected behavior or content can be captured using the GovernanceExpectations classification attached to the Asset. The measurements that support the assessment of a particular resource can be gathered and stored in a GovernanceMeasurements classification attached to its Asset.

Figure 7

Figure 7: Setting expectations and gathering results in classifications

The measurement classification may be attached to a related element that describes an aspect for its operation. For example, in figure 8 the measurement is attached to a process instance that captures a specific run of a process. The expected values are attached to its parent process.

Figure 8

Figure 8: Attaching the measurements to related elements

Execution points

A governance execution point defines specific activity that is supporting governance.

There are three types:

  • A Control Point is a place in the processing where a decision needs to be made. It may be a choice on whether to tolerate a reported situation or to resolve it - or it may be a decision on how to solve it.

  • A Verification Point describes processing that is testing if a desired condition is true. Quality rules are examples of verification points. The result of a verification point is the output of the test. It may, for example, be a boolean, classification or a set of invalid values.

  • An Enforcement Point describes processing that enforces a specific condition. For example, data may need to be encrypted at a certain point in the processing. The encryption processing is an enforcement point.

The ExecutionPointDefinition elements are created during the design of the governance program. They characterize the types of execution points that are needed to support the governance requirements. They are linked to the Governance Definition that they support using the ExecutionPointUse relationship. Typically, the governance definitions linked to the governance execution point definitions are:

  • Governance Processes
  • Governance Procedures

Often execution points need to be integrated with the normal activity of the business, but they may also represent additional standalone activity.

The classifications ControlPoint, VerificationPoint and EnforcementPoint are used to label governance implementation elements with the type of execution point and the qualified name of the corresponding definition if any. They are often found on element such as:

  • Governance Action Types
  • Governance Actions

These classifications help in the review of the implementation of the governance program and can be used to drive additional audit logging.

Figure 9

Implementing the actions defined in your governance definitions.

Governance Rollout

Governance Campaigns

Governance Projects

Open metadata implementation

How the Open Metadata Access Services (OMASs) support the governance program
  • The Governance Program OMAS supports the setting up of governance domain and its associated definition elements.

  • The Community Profile OMAS supports the definition of the profiles for people and teams that will support the governance program. These are linked to the governance roles defined by the governance program.

  • The Project Management OMAS supports the rollout of the governance program by commissioning campaigns and projects to implement the governance controls and the collection of measurements to assess the success of the program.

  • The Digital Architecture OMAS provides the setup of the digital landscape that supports the governance program. This includes the definitions of the information supply chains and solution components that support the organization's activities.

  • The Digital Service OMAS documents the business capabilities along with their digital services that are supported by the governance program.

  • The Governance Engine OMAS supports the implementation of technical controls and the choreography of their execution.

  • The Stewardship Action OMAS supports the stewards as they manage the exceptions detected to the governance program.

  • The Data Privacy OMAS supports the operational aspects of managing privacy as part of the organization's activities.

  • The Subject Area OMAS supports the definitions of the vocabularies associated with a subject area.

  • The Data Manager OMAS support the automated cataloging of assets and configuration of technology that is managing them.

  • The Security Manager OMAS support the configuration of technology that is managing the security of assets.

  • The Security Officer OMAS support the definitions of users and groups and related definitions that make up the user directory.

  • The Asset Manager OMAS supports the automated exchange of governance definitions between catalogs and asset managers to create a consistent rollout of governance requirements.

  • The Asset Owner OMAS supports the linking of governance definitions and classifications to assets to define how they should be governed.

  • The Asset Consumer OMAS supports the visibility of the governance definitions and classification by consumers of the assets.

The egeria-samples.git repository includes a sample called Sample metadata archives for Coco Pharmaceuticals that creates open metadata archives with basic definitions for Coco Pharmaceuticals. This includes the definition of this organization's governance domains with their communities and governance officers.