Skip to content

Commit 526286e

Browse files
Andrea Latellaandr3a87
authored andcommitted
update DP specification and example.yaml
1 parent 6134bd2 commit 526286e

File tree

2 files changed

+101
-43
lines changed

2 files changed

+101
-43
lines changed

README.md

Lines changed: 30 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,6 @@ This repository wants to define an open specification to define data products wi
88
With an open specification will be possible to create services for automatic deployment and interoperable components to build a Data Mesh platform.
99

1010

11-
1211
# Data Product structure
1312

1413
The DP is composed by a general section with DP level information and four sub-structures:
@@ -33,7 +32,9 @@ The fixed structure must be technology agnostic.
3332
* `Version: [String]` this is representing the version of the DP, because we consider the DP as an indipendent unit of deployment, so if a breaking change is needed, we create a brand new versionof the DP. If we introduce a new feature or patch it is not necessary create a new version, but we can change Y (new feature) or Z patch. Displayed as X.Y.Z where X is major version, Y is minor and Z is patch. Major version(X) is also shown in the ID and those 2 fields(version and ID) are always aligned with one another.
3433
* * Constraints:
3534
* * * Major version of the data product is always the same as the major version of the components and it is the same version that is shown in both data product ID and component ID
36-
* `DataProductOwner: [String]` Data Product Owner, the actual user that receives the notifications about data product
35+
* `Kind: [String]` type of component. Allowed values: [dataproduct | outputport | workload | storage | resource]
36+
* `DataProductOwner: [String]` Data Product Owner, the id of actual user that receives the notifications about data product
37+
* `DataProductOwnerDisplayName [String]`: The human readable version of `DataProductOwner`
3738
* `Email: [String]` Point of contact, it could be the owner or a distribution list, but must be reliable and responsive.
3839
* `InformationSLA: [String]` Describe what SLA the DP team is providing to answer additional information requests about the DP
3940
* `Status: [String]` This is an enum representing the status of this version of the DP `[Draft|Published|Retired]`
@@ -45,22 +46,22 @@ The fixed structure must be technology agnostic.
4546
The **unique identifier** of a DataProduct is the concatenation of Domain, Name and Version. So we will refer to the `DP_UK` as a string composed in the following way `$DPDomain.$DPID.$DPVersion`
4647

4748

48-
49-
5049
### Output Ports
5150

5251
* `ID: [String]` the unique identifier of the output port --> not modifiable
5352
* * Constraints:
5453
* * * Allowed characters are `[a-zA-Z]` and `[_-]`
5554
* * * Output port ID is made of `$DPDomain.$DPIdentifier.$DPMajorVersion.$OutputPortIdentifier`
56-
* `Name: [String]` the name of the DP
55+
* `Name: [String]` the name of the Output Port
5756
* `FullyQualifiedName: [String]` Human-readable that uniquely identifies an entity
58-
* `ResourceType: [String]` the kind of output port: Files - SQL - Events. This should be extendible with GraphQL or others.
57+
* `OutputPortType: [String]` the kind of output port: Files - SQL - Events. This should be extendible with GraphQL or others.
5958
* `Technology: [String]` the underlying technology is useful for the consumer to understand better how to consume the output port and also needed for self serve provisioning specific stuff.
59+
* `Platform`: This represents the vendor: Azure, GCP, AWS, CDP on AWS, etc. It is a free field but it is useful to understand better the platform where the component will be running
6060
* `Description: [String]` detailed explanation about the function and the meaning of the output port
6161
* `Version: [String]` Specific version of the output port. Displayed as X.Y.Z where X is the major version of the data product, Y is minor feature and Z is patch. Major version(X) is also shown in the component ID and those 2 fields(version and ID) are always aligned with one another.
6262
* * Constraints:
6363
* * * Major version of the data product is always the same as the major version of the components and it is the same version that is shown in both data product ID and component ID
64+
* `Kind: [String]` type of component. Allowed values: [dataproduct | outputport | workload | storage | resource]
6465
* `CreationDate: [String]` when this output port has been created
6566
* `StartDate: [String]` the first business date present in the dataset, leave it null for events or we can use some standard semantic like: "-7D, -1Y"
6667
* `ProcessDescription: [String]` what is the underlying process that contributes to generate the data exposed by this output port
@@ -71,14 +72,17 @@ The **unique identifier** of a DataProduct is the concatenation of Domain, Name
7172
* `IntervalOfChange: [String]` How often changes in the data are reflected
7273
* `Timeliness: [String]` The skew between the time that a business fact occuts and when it becomes visibile in the data
7374
* `Endpoint: [URL]` this is the API endpoint that self-describe the output port and provide insightful information at runtime about the physical location of the data, the protocol must be used, etc
74-
* `Allow: [Array[String]]` It is an array of user/role/group related to the specific technology ( each technology will have an associated authentication system ( Azure AD, AWS IAM, etc ). This field is defining who has access in read-only to this specific output port
75+
* `Allows: [Array[String]]` It is an array of user/role/group related to LDAP/AD user. This field is defining who has access in read-only to this specific output port
76+
* `Owners: [Array[String]]` It is an array of user/role/group related to LDAP/AD user. This field defines who has all permissions on this specific output port
77+
* `InfrastructureTemplateId` the id of the microservice responsible for provisioning the component. A microservice may be capable of provisioning several UseCaseTemplateId
78+
* `UseCaseTemplateId` the id of the template used in the builder to create the component
7579
* `DependsOn: [Array[String]]` An output port could depend on other output ports or storage areas, for example a SQL Output port could be dependent on a Raw Output Port because it is just an external table.
7680
* * Constraints:
7781
* * * This array will only contain ID-s
7882
* `Tags: [Array[Yaml]]` Free tags at OutputPort level ( please refer to OpenMetadata https://docs.open-metadata.org/openmetadata/schemas/entities/tagcategory )
7983
* `SampleData: [Yaml]` - Provide a sample data of your outputport. See OpenMetadata specification: https://docs.open-metadata.org/openmetadata/schemas/entities/table#tabledata
8084
* `Schema: [Array[Yaml]]` When it comes to describe a schema we propose to leverage OpenMetadata specification: Ref https://docs.open-metadata.org/openmetadata/schemas/entities/table#column. Each column can have a tag array and you can choose between simples LabelTags, ClassificationTags or DescriptiveTags. Here an example of classification Tag https://github.com/open-metadata/OpenMetadata/blob/main/catalog-rest-service/src/main/resources/json/data/tags/piiTags.json
81-
* `SemanticLinking: [Yaml]` Here we can express semantic relationships between this output port and other outputports ( also coming from other domains and data products )
85+
* `SemanticLinKind: [Yaml]` Here we can express semantic relationships between this output port and other outputports ( also coming from other domains and data products )
8286
* `Specific: [Yaml]` this is a custom section where we can put all the information strictly related to a specific technology or dependent from a standard/policy defined in the federated governance.
8387

8488

@@ -90,14 +94,17 @@ The **unique identifier** of a DataProduct is the concatenation of Domain, Name
9094
* * * Workload ID is made of `$DPDomain.$DPIdentifier.$DPMajorVersion.$WorkloadIdentifier`
9195
* `Name: [String]` the name of the workload
9296
* `FullyQualifiedName: [String]` Human-readable that uniquely identifies an entity
93-
* `Description: [String]` detailed description about the process, its purpose and characteristics
94-
* `ResourceType: [String]` explain what type of workload is: Ingestion ETL, Streaming, Internal Process, etc.
95-
* `Type: [String]` This is an enum `[HouseKeeping|DataPipeline]`, `Housekeeping` is for all the workloads that are acting on internal data without any external dependency. `DataPipeline` instead is for workloads that are reading from outputport of other DP or external systems.
96-
* `Technology: [String]` this is a list of technologies: Airflow, Spark, Scala. It is a free field but it is useful to understand better how it is behaving
9797
* `Description: [String]` detailed explaination about the purpose of the workload, what sources is reading, what business logic is apllying, etc
98+
* `Kind: [String]` type of component. Allowed values: [dataproduct | outputport | workload | storage | resource]
99+
* `WorkloadType: [String]` explain what type of workload is: Ingestion ETL, Streaming, Internal Process, etc.
100+
* `ConnectionType: [String]` This is an enum `[HouseKeeping|DataPipeline]`, `Housekeeping` is for all the workloads that are acting on internal data without any external dependency. `DataPipeline` instead is for workloads that are reading from outputport of other DP or external systems.
101+
* `Technology: [String]` this is a list of technologies: Airflow, Spark, Scala. It is a free field but it is useful to understand better how it is behaving
102+
* `Platform`: This represents the vendor: Azure, GCP, AWS, CDP on AWS, etc. It is a free field but it is useful to understand better the platform where the component will be running
98103
* `Version: [String]` Specific version of the workload. Displayed as X.Y.Z where X is the major version of the data product, Y is minor feature and Z is patch. Major version(X) is also shown in the component ID and those 2 fields(version and ID) are always aligned with one another.
99104
* * Constraints:
100105
* * * Major version of the data product is always the same as the major version of the components and it is the same version that is shown in both data product ID and component ID
106+
* `InfrastructureTemplateId` the id of the microservice responsible for provisioning the component. A microservice may be capable of provisioning several UseCaseTemplateId
107+
* `UseCaseTemplateId` the id of the template used in the builder to create the component
101108
* `Tags: [Array[Yaml]]` Free tags at Workload level ( please refer to OpenMetadata https://docs.open-metadata.org/openmetadata/schemas/entities/tagcategory )
102109
* `ReadsFrom: [Array[String]]` This is filled only for `DataPipeline` workloads and it represents the list of output ports or external systems that is reading. Output Ports are identified with `DP_UK.OutputPort_ID`, while external systems will be defined by a string `EX_$systemdescription`. Here we can elaborate a bit more and create a more semantic struct.
103110
* * Constraints:
@@ -107,11 +114,21 @@ The **unique identifier** of a DataProduct is the concatenation of Domain, Name
107114

108115
### Storage Area
109116

110-
* `ID: [String]` the unique identifier of the Storage Area
117+
* `ID: [String]` the unique identifier of the Storage Area.
118+
* * Constraints:
119+
* * * Allowed characters are `[a-zA-Z]` and `[_-]`
120+
* * * Output port ID is made of `$DPDomain.$DPIdentifier.$DPMajorVersion.$OutputPortIdentifier`
111121
* `Name: [String]` the name of the Storage Area
112122
* `FullyQualifiedName: [String]` Human-readable that uniquely identifies an entity
123+
* `StorageType: [String]` the kind of storage: Files - SQL - Events.
113124
* `Technology: [String]` this is a list of technologies: S3, ADLS, SQLServer, Kafka.
125+
* `Platform`: This represents the vendor: Azure, GCP, AWS, CDP on AWS, etc. It is a free field but it is useful to understand better the platform where the component will be running
114126
* `Description: [String]` detailed explanation about the function and the meaning of this storage area
127+
* `Kind: [String]` type of component. Allowed values: [dataproduct | outputport | workload | storage | resource]
128+
* `Allows: [Array[String]]` It is an array of user/role/group related to LDAP/AD user. This field is defining who has access in read-only to this specific storage area
129+
* `Owners: [Array[String]]` It is an array of user/role/group related to LDAP/AD user. This field defines who has all permissions on this specific storage area
130+
* `InfrastructureTemplateId` the id of the microservice responsible for provisioning the component. A microservice may be capable of provisioning several UseCaseTemplateId
131+
* `UseCaseTemplateId` the id of the template used in the builder to create the component
115132
* `Tags: [Array[Yaml]]` Free tags at Storage area level ( please refer to OpenMetadata https://docs.open-metadata.org/openmetadata/schemas/entities/tagcategory )
116133
* `Specific: [Yaml]` this is a custom section where we can put all the information strictly related to a specific technology or dependent from a standard/policy defined in the federated governance.
117134

example.yaml

Lines changed: 71 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,50 +1,82 @@
1-
name: my_data_product
1+
id: my_domain.my_data_product.1
2+
name: my data product
3+
fullyQualifiedName: My Data Product
24
domain: my_domain
35
description: this data product is representing the xxx functional entity
46
version: 1.0.0
5-
owner: Tom Smith
6-
email: mailto:tom.smith@corp.com
7+
kind: dataproduct
8+
dataProductOwner: user:tom_smith_corp.com
9+
dataProductOwnerDisplayName: Tom Smith
10+
email: mailto:distribution_list@corp.com
711
informationSLA: 2WD
8-
status: work in progress
9-
environment:
10-
name: develop
11-
specific: {}
12-
outputPorts:
13-
- name: my_raw_s3_port
14-
resourceType: raw
12+
status: DRAFT
13+
maturity: Strategic
14+
billing: {}
15+
tags: []
16+
specific: {}
17+
components:
18+
- id: my_domain.my_data_product.1.raw
19+
name: my_raw_s3_port
20+
fullyQualifiedName: My Raw S3 Port
21+
outputPortType: Files
1522
technology: s3_cdp
23+
platform: CDP on AWS
1624
description: s3 raw output port
17-
issueDate: 20210901
18-
startDate: 20110101
19-
expirationData: null
25+
version: 1.0.1
26+
kind: outputport
27+
creationDate: 05-04-2022 16:53:00
28+
startDate: null
2029
process_description: this output port is generated by a Spark Job scheduled every day at 2AM and it lasts for approx 2 hours
2130
billing_policy: 5$ for each full scan
2231
security_policy: In order to consume this output port an additional security check with compliance must be done
2332
consumer_policy: This is only for HR department and not suitable for institutional reporting.
33+
SLO: {
34+
intervalOfChange: 1 hours,
35+
timeliness: 1 minutes
36+
}
2437
endpoint: /develop/my_domain/my_data_product/1.0.0/my_raw_s3_port
25-
allow:
26-
- user-1
27-
owner: user-2
38+
allows: [user-1]
39+
owners: [user-2]
2840
dependsOn: []
41+
infrastructureTemplateId: microservice-id-1
42+
useCaseTemplateId: template-id-1
43+
dependsOn: []
44+
tags: []
45+
sampleData: {}
46+
schema: []
47+
semanticLinKind: {}
2948
specific:
3049
directory: history
3150
bucket: ms-datamesh-s3
32-
- name: my_view_impala_port
33-
resourceType: view
51+
- id: my_domain.my_data_product.1.impala
52+
name: my_view_impala_port
53+
fullyQualifiedName: My View Impala Port
54+
OutputPortType: SQL
3455
technology: impala_cdp
56+
platform: CDP on AWS
3557
description: impala view output port
36-
issueDate: 20210901
37-
startDate: 20110101
38-
expirationData: null
58+
version: 1.1.0
59+
kind: outputPort
60+
creationDate: 05-04-2022 17:00:00
61+
startDate: null
3962
billing_policy:
4063
process_description:
4164
security_policy:
4265
consumer_policy:
43-
endpoint: /develop/my_domain/my_data_product/1.0.0/my_view_impala_port
44-
allow:
45-
- user-1
46-
owner: user-2
47-
dependsOn: [my_raw_s3_port]
66+
SLO: {
67+
intervalOfChange: 1 hours,
68+
timeliness: 1 minutes
69+
}
70+
allows: [user-1]
71+
owners: [user-2]
72+
dependsOn: []
73+
infrastructureTemplateId: microservice-id-2
74+
useCaseTemplateId: template-id-2
75+
dependsOn: [my_domain.my_data_product.1.raw]
76+
tags: []
77+
sampleData: {}
78+
schema: []
79+
semanticLinKind: {}
4880
specific:
4981
database: my_database
5082
table: my_table
@@ -54,11 +86,20 @@ outputPorts:
5486
lastName: string
5587
format: PARQUET
5688
workloads:
57-
- name: my_spark_workload
58-
resourceType: batch
59-
technology: spark
89+
- id: my_domain.my_data_product.1.batch1
90+
name: my_spark_workload
91+
fullyQualifiedName: My Spark workload
6092
description: spark batch workload
61-
dependsOn: [my_raw_s3_port]
93+
kind: workload
94+
workloadType: batch
95+
connectionType: DataPipeline
96+
technology: spark
97+
platform: CDP on AWS
98+
version: 1.1.1
99+
infrastructureTemplateId: microservice-id-3
100+
useCaseTemplateId: template-id-3
101+
tags: []
102+
readsFrom: [my_domain.my_data_product.1.raw]
62103
specific:
63104
artifactory: ms-datamesh-s3
64105
artefact: /path/to/my/spark/workload.jar

0 commit comments

Comments
 (0)