Building Subject Area materials¶
Egeria provides a comprehensive set of open metadata types for managing common data definitions. These types provide a common language and format for exchanging these definitions between tools and metadata repositories. Each tool/repository provides a mapping to the Egeria types and Egeria manages the exchange of metadata between these parties.
The glossary is at the heart of the materials for a subject area. Figure 1 shows that the glossary contains glossary terms. Each glossary term describes a concept used by the business. It is also possible to link two glossary terms together with a relationship. The relationship may describe a semantic relationship or a structural one.
Figure 1: Glossaries for describing concepts and the relationships between them
Semantic relationships include:
- RelatedTerm is a relationship used to say that the linked glossary term may also be of interest. It is like a "see also" link in a dictionary.
- Synonym is a relationship between glossary terms that have the same, or a very similar meaning.
- Antonym is a relationship between glossary terms that have the opposite (or near opposite) meaning.
- PreferredTerm is a relationship that indicates that one term should be used in place of the other term linked by the relationship.
- ReplacementTerm is a relationship that indicates that one term must be used instead of the other. This is stronger version of the PreferredTerm.
- Translation is a relationship that defines that the linked terms represent the same meaning but each are written in a different language. Hence, one is a translation of the other. The language of each term is defined in the Glossary that owns the term.
- IsA is a relationship that defines that the one term is a more generic term than the other term. For example, this relationship would be used to say that "Cat" IsA "Animal".
Structural relationships in the glossary are relationships that show how terms are typically used together.
- UsedInContext links a term to another term that describes a context. This helps to distinguish between terms that have the same name but different meanings depending on the context.
- HasA is a term relationship between a term representing a SpineObject (see glossary term classifications below) and a term representing a SpineAttribute.
- IsATypeOf is a term relationship between two SpineObjects saying that one is the subtype (specialisation) of the other.
- TypedBy is a term relationship between a SpineAttribute and a SpineObject to say that the SpineAttribute is implemented using a type represented by the SpineObject
- See Area 3 in the Open Metadata Types to understand how these concepts are represented on open metadata.
A data class provides the specification of a data type that is important to the subject area. Date, Social Security Number and Credit Card Number are examples of data classes.
The data class specification defines how to identify data fields of its type by inspecting the data values stored in them. The specification is independent of a particular technology, which is why they are often described as logical data types. The specification may include preferred implementation types for different technologies using Implementation Snippets.
Data classes are used during metadata discovery (see below) to identify the types of data in the discovered data fields. This is an important step in understanding the meaning and business value of the data fields. They can also be used in quality rules to validate that data values match the perscribed data class.
Data classes can be linked together in part-of and is-a hierarchies to create a logical type system for a subject area. A glossary term can be linked to a data class via an ImplementedBy relationship to identify the preferred data class to use when implementing a data field with meaning described in the glossary term. A data class can be linked to glossary term that describes the meaning of the data class via a SemanticAssignment relationship.
Figure 2: Data classes for describing the logical data types and implementation options
- See Model 0540 in the Open Metadata Types to understand how data classes are represented on open metadata.
- See Model 0737 in the Open Metadata Types to understand the ImplementedBy relationship.
- See Model 0370 in the Open Metadata Types to understand the SemanticAssignment relationship.
- See Model 0504 in the Open Metadata Types to understand ImplementationSnippets.
Consuming the glossary in design models
Design models (such as Concept models, E-R Models, UML models) and ontologies capture similar concepts to those described in the glossary. It helps if their definitions are consistent. When a new glossary is being built, existing models and ontologies can be used to seed the glossary. The models/ontologies themselves can be loaded in open metadata and the model elements linked to their corresponding glossary terms. Then new versions of the data models/ontologies can be generated from open metadata.
Figure 3: Linking to models
Any linked data classes provide details of language types to use when generating compliant artifacts from the models.
- See Model 0571 in the Open Metadata Types to understand how concept models are represented on open metadata.
- See Model 0565 in the Open Metadata Types to understand how design models are represented on open metadata.
Schemas document the structure of data, whether it is stored or moving through APIs, events and data feeds. A schema is made up of a linked subgraph of schema elements. A schema begins with a schema element called a schema type. This may be a single primitive field, a set of values, an array of values, a map between two sets of values or a nested structure. The nested structure is the most common. In this case the schema type has a list of schema attributes (another type of schema element) that describe the fields in the structure. Each of these schema attributes has its own schema type located in its TypeEmbeddedAttribute classification.
Figure 4 shows a simple schema structure.
Figure 4: Schemas for documenting the structure of data
- See Schemas to understand how different types of schema are represented.
- See Model 0501 in the Open Metadata Types to see the formal definition of the different types of schema elements.
- See Model 0505 in the Open Metadata Types to understand schema attributes and the TypeEmbeddedAttribute classification.
Schemas and assets
An asset describes a valuable resource (typically digital). Such resources include databases, data files, documents, APIs, data feeds, and applications. A digital resource can be dependent on other digital resource to fulfill their implementation. This relationship is also captured in open metadata with relationships such as DataContentForDataSet. These relationships help to highlight inconsistencies in the assets' linkage to the subject area's materials, which may be due to errors in either the metadata or the implementation/deployment/use of the associated digital resources.
Figure 5: Dependencies between digital resources are reflected in open metadata by relationships between assets
Since schema types describe the structure of data, they can be attached to assets using the AssetSchemaType relationship to indicate that this asset's data is organized as described by the schema. Schemas are important because they show how individual data values are organized. Governance is often concerned with the meaning, correctness and use of individual data values since they are used to influence the decisions made within the organization. Therefore, even though the content of a schema bulks up the size and complexity of the metadata, it is necessary to capture this detail.
Figure 6: Schemas describe the structure of the data store in a digital resource (described by the asset in the catalog)
A schema is typically attached to only one asset since it is classified and linked to other elements assuming that the asset/schema combinations describes the particular collection of data stored in the associated digital resource. However, there is still a role for the subject area materials to provide preferred schema structures for software developers, data engineers and data scientists to use when they create implementations of new digital resources.
When a new asset is created, the schema definition in the subject area can be used as a template to define the schema for the asset (see figure 7). Then:
- The digital resource can be generated from the asset/schema, or
- Metadata discovery (see below) can be used to validate that the schema defined in the digital resource matches the schema associated with the asset.
Figure 7: Using a schema from a subject are as a template for a new asset
There is also an opportunity to share schemas between assets using an ExternalSchemaType. This option has the advantage that there only one copy of the schema. However, it is only used when all classifications and relationships attached to the shared part of the schema apply to all data in the associated digital resources.
Figure 8: Using an external schema type to share a common schema
- See Model 0503 in the Open Metadata Types to understand the AssetSchemaType relationship.
- See Model 0501 in the Open Metadata Types to understand how schemas are represented on open metadata.
- See Model 0505 in the Open Metadata Types to understand how schema attributes are represented on open metadata.
Reference value assignments
The materials for a subject area may include sets of values used to label metadata elements to show that they are in a particular state or have a specific characteristic that is important in the subject area. For example, a subject area about people may include the notion of an Adult and a Child (or Minor). The age of majority is different in each country and so a simple label assigned to a Person profile that indicates that a person is an adult would allow the knowledge of how to determine if someone is an adult to be contained around the maintenance of the person profiles, while the reference data value is used in multiple places.
These labels are called reference data values and are managed in Valid Value Sets. The association between a reference data value and a metadata element is ReferenceValueAssignment.
Figure 9: Labelling using reference data values
- Reference Data Management describes different uses of valid value sets.
Figure 10 show three types of assignments between the metadata associated with a digital resource (technical metadata) and the subject area materials:
- SemanticAssignment - Semantic assignments indicate that the data stored in the associated data field has the meaning described in the glossary term.
- ValidValuesAssignment - Valid value sets define a list of valid values. They can be used to the values that are allowed to be stored in a particular data field if it can be described as a discrete set.
- DataClassAssignment - A data class assignment means that the data in the data field conforms to the type described in the data class.
When these relationships are used in combination, there should be consistency between the assignments to the data field and those to the associated glossary term.
Figure 10: Using assignment relationships to create a rich description of the data stored in a schema attribute (data field)
Governance action classifications
Governance action classifications can be attached to most types of metadata elements. They can also be assigned to glossary terms to indicate that the classification applies to all data values associated with the glossary term. The governance action classifications have attributes that identify a particular level that applies to the attached element. The definition for each level can be linked to appropriate Governance Definitions that define how digital resources classified at that level should be governed. Governance Classification Levels are linked to Governance Definitions using the GovernedBy relationship.
Figure 11: Classifying glossary terms to identify the governance definitions that apply to all data values associated with the glossary term
- Setting up your Governance Program describes how different types of governance metadata are used.
Connectors and connections
The digital resources associated with the assets in the catalog are accessed through connectors. A Connector is a client library that applications use to access the data/function held by the digital resource. Typically, there is a specialized connector for each type of Asset/technology.
Sometimes there are multiple connectors to access a specific type of asset, each offering a different interface for the application to use.
Instances of connectors are created using the Connector Broker. The connector broker creates the connector instance using the information stored in a Connection. These can be created by the application or retrieved from the open metadata stores.
A connection is stored in the open metadata stores and linked to the appropriate asset for the digital resource.
Figure 12: Connection information needed to access the data held by an asset
- See the connector catalog to understand how connectors are used in Egeria.
- See Model 0201 in the Open Metadata Types to understand how connections are represented.
An open discovery service is a process that runs a pipeline of analytics to describe the data content of a resource. It uses statistical analysis, reference data and other techniques to determine the data class and range of values stored, potentially what the data means and its level of quality. The result of the analysis is stored in metadata objects called annotations.
Part of the discovery process is called Schema Extraction. This is where the discovery service inspects the schema in the digital resource and builds a matching structure of [DataField]/types/6/0615-Schema-Extraction/) elements in open metadata. As it goes on to analyse the content of a particular data field in the resource, it can add its results to an annotation that is attached to the DataField element. It can also maintain a link between the DataField element and its corresponding SchemaAttribute element if the schema has already been attached. Through ths process it is possible to detect any anomalies between the documented schema and what is actually implemented.
Part of the analysis of a single data field may be to identify its data class (or a list of possible data classes if the analysis is not conclusive). THe data class in turn may identify a list of possible glossary terms that could apply to the data field.
For example, there may be a data class called address. A discovery service may detect that an address is stored in a digital resource. The data class may be linked to glossary terms for Home Address, Work Location, Delivery Address, ... The discovery service may not be able to determine which glossary term is appropriate in order to establish the SemanticAssignment relationship, but providing a steward with a short list is a considerable saving.
Figure 13: Output from a metadata discovery service
- See Discovery and Stewardship to understand how metadata discovery works.
- See Area 6 in the Open Metadata Types to understand how discovery metadata is represented.
Bringing it all together
Figure 13 summarizes how the subject area materials create a rich picture around the resources used by your organization. As they link to the technical metadata, they complement and reinforce the understanding of your data. In a real-world deployment, the aim is to automate as much of this linkage as possible. This is made considerably easier if the implementation landscape is reasonable consistent. However, where the stored data values do not match the expected types defined in the schema, the metadata model reveals the inconsistencies and often requires human intervention to ensure the links are correct.
Figure 14: Linking the metadata together
Raise an issue or comment below