Skip to content
Stable

This component is complete and can be used. The interfaces will be supported until the function is removed from the project via the deprecation process. There will be ongoing extensions to this function, but it will be done to ensure backward compatibility as far as possible. If there is a need to break backward compatibility, this will be discussed and reviewed in the community, with a documented timeline.

Apache Atlas Survey Action Service

Connector summary

Overview

Apache Atlas is a metadata catalog originally designed for the Hadoop ecosystem. It offers integration services called Hooks and Bridges to capture the schemas and data sets of data platforms such as Apache Hive, Apache HBase and Apache Hadoop Distributed File System (HDFS) along with the processes for creating and maintaining data sets on these platforms. The metadata descriptions of these data sets and processes are linked together using lineage relationships, allowing an understanding of how data is flowing through a Hadoop deployment. Apache Atlas also supports glossaries and a tagging system that can be used both in searches and to control access to data through Apache Ranger (using the TagSync integration).

In recent years, Apache Atlas has been embedded in popular data catalogs such as Microsoft Purview and Atlan increasing the interest in being able to integrate with this metadata catalog.

The Apache Atlas Survey Action Service builds a survey report that describes the types defined in the Apache Atlas server and the numbers of instances that are found for these types.

This survey action service is described in an survey action engine. The engine is configured to run in an Engine Host running the Survey Action OMES service. The Survey Action OMES is configured with the network address of a Metadata Access Server running the Asset Owner OMAS service.

Figure 1

Figure 1: Operation of the Apache Atlas Survey Action Service

Once installed in the engine host, the survey action service can be called either by:

Each time the survey action service starts, the Survey Action OMES creates a new Survey Report via a call to the Asset Owner OMAS. As the survey action service runs, it is retrieving metadata, and storing annotations, via its survey context. The Survey Action OMES routes these requests to the Asset Owner OMAS which has access to the open metadata repositories.

Survey action service function

The Apache Atlas Survey Action Service provides a summary of the contents of the Apache Atlas repository found at the time it was run.

It has three analysis steps:

  1. Measure Resource - Retrieves the overall metrics from the Apache Atlas server. These are stored in a ResourceMeasureAnnotation entity linked to the SurveyReport entity generated for each run of the Apache Atlas Survey Action Service.
  2. Schema Extraction - Retrieves the types from Apache Atlas and organizes them in a linked graph of Schema Attributes entities. All the graph schema attributes are linked to a GraphSchemaType entity which is in turn linked to the SurveyReport entity.
  3. Profile Data - Retrieves each entity in the Apache Atlas server and adds the following counts to ResourceProfileAnnotation entities linked from the appropriate data field entities:

    • The number of instances of each entity type.
    • The number of classifications of a particular type attached to each type of entity.
    • The number of relationships of a particular type attached to each type of entity.
    • The number of each type of label attached to each type of entity.
    • The number of business metadata properties of a particular type attached to each type of entity.

Each analysis step builds on the work of its predecessor. The processing requirements increase with each step, so you can choose to stop the processing after any step using the finalAnalysisStep property. This can be set as a configuration property in the connection object for this survey action service, or as a request parameter passed when the Apache Atlas Survey Action Service is run.

The default value for finalAnalysisStep is Profile Data.

Metadata Setup

Prior to running the Apache Atlas Survey Action Service, an asset and connection must be created for the Apache Atlas server that is to be analysed.

Typically, the asset for Apache Atlas is of type SoftwareServer with a deployedImplementationType set to Apache Atlas Server. The networkAddress in the connection's endpoint is the hostname and port of the Apache Atlas server. For example, http://localhost:21000.

Figure 2

Figure 2: Metadata added to the open metadata repository

Survey Reports

Each time the Apache Atlas Survey Action Service runs, there is a new survey report created.

Figure 3

Figure 3: Survey reports linked from Apache Atlas's asset

Figure 4 shows the structure of the survey report. The annotations are labelled with the analysis steps that create them.

Figure 4

Figure 4: Analysis stages performed by the survey action service

Resource Measurements Annotation

The resource measurements annotation is created in the Measure Resource analysis step. It sets up the following properties in the dataSourceProperties map:

  • entityInstanceCount - number of active entity instances
  • entityInstanceCount:typeName - number of active entity instance of this type
  • entityWithSubtypesInstanceCount:typeName - number of active entity instances of this type and all subtypes.
  • classificationCount - number of classifications added to entity instances.
  • typeCount - number of defined types (and their versions).
  • typeUnusedCount - number of types with no instances.

This analysis is achieved using two REST API calls and so has minimum impact on the Apache Atlas Server.

Schema Analysis Annotation

The schema analysis annotation is created in the Schema Extraction analysis step. It identifies the name/type of the schema created.

Apache Atlas Types as a schema

In the Schema Extraction analysis step, the apache atlas types extracted from the Apache Atlas server are used to create a schema that describes the graph structure of the metadata found in the Apache Atlas server:

  • A GraphVertex entity is created for each Apache Atlas entity type, business metadata type and classification type.
  • A GraphEdge entity is created for each Apache Atlas relationship type, and each permitted use of a classification type by an entity type.

GraphVertex and GraphEdge are types of SchemaAttribute. The graph schema attributes are connected together using GraphEdgeLink relationships that connect each GraphEdge schema attribute to two GraphVertex schema attributes.

  • The relationship type graph edges are each attached to two entity type graph vertices: one for the type of entity that can be attached at end 1 of the relationship; the other for the type of entity that can be attached at end 2.
  • The classification type permitted use graph edges are linked to each of the associated entity type graph vertices.

All the graph vertices are linked to a GraphSchemaType entity using the AttributeForSchema relationship. The GraphSchemaType entity is linked to the asset for the Apache Atlas server using the AssetSchemaType relationship.

Figure 5

Figure 5: Linkage of graph schema elements based on Apache Atlas type.

Resource Profile Annotation

This survey action service attaches multiple resource profile annotations to each graph schema attribute depending on their category (entity, relationship, classification or business metadata).

Figure 6

Figure 6: Details of the resource profile annotations attached to each type of data field

It sets up the following fields in each resource profile annotation:

  • analysisStep - this is always set to Profile Resource.
  • annotationType - this identifies the type of values that the annotation contains.
  • explanation - this provides more information about the annotation type.
  • valueCount - this is a map of typeName to count. For example, if this annotation was counting the classifications attached to the DataSet entity type, then the map would include an entry for each type of classification attached to this type of entity and a count of how many times it is used.
  • additionalProperties - contains the count of instances for the particular type that the data field represents.

The table summarizes the values in each of the resource profile annotations depending on the category of the data field it is attached to.

Atlas Type Category Annotation Type Explanation Value Count Instance count in AdditionalProperties
Entity Apache Atlas Attached Classification Types Count of classification types attached to this type of entity. Classification Name to Count Entity instances for this type
Entity Apache Atlas End 1 Attached Relationship Types Count of different types of relationships attached to this type of entity at End 1. Relationship Name to Count Entity instances for this type
Entity Apache Atlas End 2 Attached Relationship Types Count of different types of relationships attached to this type of entity at End 2. Relationship Name to Count Entity instances for this type
Entity Apache Atlas Attached Labels Count of the different labels attached to this type of entity. Label Name to Count Entity Instances for this type
Entity Apache Atlas Attached Business Metadata Types Count of the different types of business metadata properties attached to this type of entity. Business Metadata Type Name to Count Entity instances for this type
Classification Apache Atlas Attached Entity Types Count of entities where this classification is attached, organized by entity type. Entity Type Name to Count Classification Instances for this type
Business Metadata Apache Atlas Attached Entity Types Count of entities where this type of business metadata properties are attached, organized by entity type. Entity Type Name to Count Business metadata instances for this type
Relationship Apache Atlas Attached End 1 Entity Types Count of entity types attached at end 1 of this type of relationship. Entity Type Name to Count Relationship instances for this type
Relationship Apache Atlas Attached End 2 Entity Types Count of entity types attached at end 2 of this type of relationship. Entity Type Name to Count Relationship instances for this type
Relationship Apache Atlas Attached Entity Type Pairs Count of entity type pairs for this type of relationship. Entity Type Name to Count Relationship instances for this type

Raise an issue or comment below