0220 Files and Folders¶
A metadata catalog typically contains information about the data files that can be processed and their location. Files and folders describe physical files and how they are organized on the file system.
DataFile¶
DataFile catalogs a physical file. It inherits from DataStore to declare that it is a physical artifact. It adds the following attributes to DataFile:
- pathName - this is the fully qualified path name for the file.
- fileName - this is the file name for the file including file extension.
- fileType - this is the name of the file type. The values for this attribute can be managed in a file type valid value set.
- fileExtension - this is the actual file extension for this file. The values for this attribute can be managed in a valid value set.
There are subtypes for DataFile that identify the format of the file:
- CSVFile contains comma-separated values.
- AvroFile is organized according to the Apache Avro specification.
- JSONFile is encoded using JavaScript Object Notation (JSON).
- ParquetFile is encoded using the Apache Parquet format.
- SpreadsheetFile is a file containing tabular data and formula.
FileFolder¶
A FileFolder entity represents a folder or directory used to group related files together. It adds the pathName property which contains the fully qualified path name of the folder.
FolderHierarchy¶
The FolderHierarchy relationship links FileFolder elements together to form a hierarchical organization.
NestedFile¶
The NestedFile relationship links a file to a folder.
LinkedFile¶
Files can also have a symbolic link (LinkedFile relationship) to an element to show that it logically belongs to the other content in the element.
DataFolder¶
DataFolder is a special case of FileFolder for cataloguing directories that are contained a collection of data. The files and nested folders within it collectively make up the data content. They are not individually catalogued.
Hierarchical file structures¶
The diagram below illustrates the structure of a file system.
The FileSystem is typically a Software Capability. The root folders (of type FileFolder) are connected to it using the ServerAssetUse relationship. Beneath that are FileFolder entities with DataFile entities nested beneath them.
Raise an issue or comment below