Skip to content

Commit 7e171f5

Browse files
authored
data contracts and sharing agreements (#10)
1 parent 4226bc2 commit 7e171f5

File tree

2 files changed

+54
-30
lines changed

2 files changed

+54
-30
lines changed

README.md

Lines changed: 19 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,8 @@ The fixed structure must be technology agnostic. The first fields of teh fixed s
3737
* `DataProductOwner: [String]` Data Product owner, the unique identifier of the actual user that owns, manages, and receives notifications about the Data Product. To make it technology independent it is usually the email address of the owner.
3838
* `DataProductOwnerDisplayName [String]`: the human readable version of `DataProductOwner`.
3939
* `Email: [Option[String]]` point of contact between consumers and maintainers of the Data Product. It could be the owner or a distribution list, but must be reliable and responsive.
40+
* `OwnerGroup [String]`: LDAP user/group that is owning the data product
41+
* `DevGroup [String]`: LDAP user/group that is in charge to develop and maintain the data product
4042
* `InformationSLA: [Option[String]]` describes what SLA the Data Product team is providing to answer additional information requests about the Data Product itself.
4143
* `Status: [Option[String]]` this is an enum representing the status of this version of the Data Product. Allowed values are: `[Draft|Published|Retired]`. This is a metadata that communicates the overall status of the Data Product but is not reflected to the actual deployment status.
4244
* `Maturity: [Option[String]]` this is an enum to let the consumer understand if it is a tactical solution or not. It is really useful during migration from Data Warehouse or Data Lake. Allowed values are: `[Tactical|Strategic]`.
@@ -62,8 +64,6 @@ Constraints:
6264
* Major version of the Data Product is always the same as the major version of all of its components and it is the same version that is shown in both Data Product ID and component ID.
6365
* `InfrastructureTemplateId: [String]` the id of the microservice responsible for provisioning the component. A microservice may be capable of provisioning several components generated from different use case templates.
6466
* `UseCaseTemplateId: [Option[String]]` the id of the template used in the builder to create the component. Could be empty in case the component was not created from a builder template.
65-
* `Allows: [Array[String]]` It is an array of user/role/group related to LDAP/AD user. This field is defining who has access in read-only to this specific output port.
66-
* `Owners: [Array[String]]` It is an array of user/role/group related to LDAP/AD user. This field defines who has all permissions on this specific output port.
6767
* `DependsOn: [Array[String]]` An output port could depend on other output ports or storage areas, for example a SQL Output port could be dependent on a Raw Output Port because it is just an external table.
6868
Constraints:
6969
* This array will only contain IDs of other components.
@@ -73,16 +73,24 @@ Constraints:
7373
* `CreationDate: [Optional[String]]` when this output port has been created.
7474
* `StartDate: [Optional[String]]` the first business date present in the dataset, leave it empty for events or we can use some standard semantic like: "-7D, -1Y".
7575
* `ProcessDescription: [Option[String]]` what is the underlying process that contributes to generate the data exposed by this output port.
76-
* `BillingPolicy: [Option[String]]` how a consumer will be charged back when it consumes this output port.
77-
* `SecurityPolicy: [Option[String]]` additional information related to security aspects, like restrictions, maskings, sensibile information.
78-
* `ConsumptionPolicy: [Option[String]]` any other information needed by the consumer in order to effectively consume the data, it could be related to technical stuff, regulation, security, etc.
79-
* `SLO:[Yaml]`
80-
* `IntervalOfChange: [Option[String]]` how often changes in the data are reflected.
81-
* `Timeliness: [Option[String]]` the skew between the time that a business fact occuts and when it becomes visibile in the data.
82-
* `Endpoint: [Option[URL]]` this is the API endpoint that self-describe the output port and provide insightful information at runtime about the physical location of the data, the protocol must be used, etc.
83-
* `Tags: [Array[Yaml]]` Tag labels at OutputPort level ( please refer to OpenMetadata https://docs.open-metadata.org/metadata-standard/schemas/types/taglabel).
76+
* `DataContract: [Yaml]`: In case something is going to change in this section, it represents a breaking change because the producer is breaking the contract, this will require to create a new version of the data product to keep backward compatibility
77+
* `Schema: [Array[Yaml]]` when it comes to describe a schema we propose to leverage OpenMetadata specification: Ref https://docs.open-metadata.org/metadata-standard/schemas/entities/table#column. Each column can have a tag array and you can choose between simples LabelTags, ClassificationTags or DescriptiveTags. Here an example of classification Tag https://github.com/open-metadata/OpenMetadata/blob/main/catalog-rest-service/src/main/resources/json/data/tags/piiTags.json.
78+
* `SLA: [Yaml]` Service Level Agreement, describe the quality of data delivery and the output port in general. It represents the producer's overall promise to the consumers.
79+
* `IntervalOfChange: [Option[String]]` how often changes in the data are reflected.
80+
* `Timeliness: [Option[String]]` the skew between the time that a business fact occuts and when it becomes visibile in the data.
81+
* `UpTime: [Option[String]]` the percentage of port availability.
82+
* `TermsAndConditions: [Option[String]]` If the data is usable only in specific environments.
83+
* `Endpoint: [Option[URL]]` this is the API endpoint that self-describe the output port and provide insightful information at runtime about the physical location of the data, the protocol must be used, etc.
84+
* `DataSharingAgreement: [Yaml]` This part is covering usage, privacy, purpose, limitations and is indipendent by the data contract.
85+
* `Purpose: [Option[String]]` what is the goal of this data set.
86+
* `Billing: [Option[String]]` how a consumer will be charged back when it consumes this output port.
87+
* `Security: [Option[String]]` additional information related to security aspects, like restrictions, maskings, sensibile information and privacy.
88+
* `IntendedUsage: [Option[String]]` any other information needed by the consumer in order to effectively consume the data, it could be related to technical stuff (e.g. extract no more than one year of data for good performances ) or to business domains (e.g. this data is only useful in the marketing domains).
89+
* `Limitations: [Option[String]]` If any limitation is present it must be made super clear to the consumers.
90+
* `LifeCycle: [Option[String]]` Describe how the data will be historicized and how and when it will be deleted.
91+
* `Confidentiality: [Option[String]]` Describe what a consumer should do to keep the information confidential, how to process and store it. Permission to share or report it.
92+
* `Tags: [Array[Yaml]]` Tag labels at OutputPort level, here we can have security classification for example (please refer to OpenMetadata https://docs.open-metadata.org/metadata-standard/schemas/types/taglabel).
8493
* `SampleData: [Option[Yaml]]` provides a sample data of your Output Port. See OpenMetadata specification: https://docs.open-metadata.org/openmetadata/schemas/entities/table#tabledata
85-
* `Schema: [Array[Yaml]]` when it comes to describe a schema we propose to leverage OpenMetadata specification: Ref https://docs.open-metadata.org/metadata-standard/schemas/entities/table#column. Each column can have a tag array and you can choose between simples LabelTags, ClassificationTags or DescriptiveTags. Here an example of classification Tag https://github.com/open-metadata/OpenMetadata/blob/main/catalog-rest-service/src/main/resources/json/data/tags/piiTags.json.
8694
* `SemanticLinking: [Option[Yaml]]` here we can express semantic relationships between this output port and other outputports (also coming from other domains and data products). For example we could say that column "customerId" of our SQL Output Port references the column "id" of the SQL Output Port of the "Customer" Data Product.
8795
* `Specific: [Yaml]` this is a custom section where we must put all the information strictly related to a specific technology or dependent from a standard/policy defined in the federated governance.
8896

example.yaml

Lines changed: 35 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@ environment: development
99
dataProductOwner: tom_smith_corp.com
1010
dataProductOwnerDisplayName: Tom Smith
1111
email: mailto:distribution_list@corp.com
12+
ownerGroup: dataproduct1_corp.com
13+
devGroup: dataproduct1_dev_corp.com
1214
informationSLA: 2WD
1315
status: DRAFT
1416
maturity: Strategic
@@ -24,25 +26,32 @@ components:
2426
version: 1.0.1
2527
infrastructureTemplateId: microservice-id-1
2628
useCaseTemplateId: template-id-1
27-
allows: [user-1]
28-
owners: [user-2]
2929
dependsOn: []
3030
platform: CDP on AWS
3131
technology: s3_cdp
3232
outputPortType: Files
3333
creationDate: 05-04-2022 16:53:00
3434
startDate:
3535
processDescription: this output port is generated by a Spark Job scheduled every day at 2AM and it lasts for approx 2 hours
36-
billingPolicy: 5$ for each full scan
37-
securityPolicy: In order to consume this output port an additional security check with compliance must be done
38-
consumerPolicy: This is only for HR department and not suitable for institutional reporting.
39-
SLO:
40-
intervalOfChange: 1 hours
41-
timeliness: 1 minutes
42-
endpoint: https://myurl/development/my_domain/my_data_product/1.0.0/my_raw_s3_port
36+
dataContract:
37+
schema: []
38+
SLA:
39+
intervalOfChange: 1 hours
40+
timeliness: 1 minutes
41+
upTime: 99.9%
42+
termsAndConditions: only usable in development environment
43+
endpoint: https://myurl/development/my_domain/my_data_product/1.0.0/my_raw_s3_port
44+
dataSharingAgreements:
45+
purpose: this output port want to provide a rich set of profitability KPIs related to the customer
46+
billing: 5$ for each full scan
47+
security: In order to consume this output port an additional security check with compliance must be done
48+
intendedUsage: the dataset is huge so it is reccomended to extract maximum 1 year of data and to use these KPIs in the marketing or sales domain, but not for customer care
49+
limitations: is not possible to use this data without a compliance check
50+
lifeCycle: the maximum retention is 10 years, and eviction is happening on the first of january
51+
confidentiality: if you want to store this data somewhere else, PII columns must be masked
4352
tags: []
4453
sampleData: {}
45-
schema: []
54+
4655
semanticLinking: {}
4756
specific:
4857
directory: history
@@ -55,22 +64,29 @@ components:
5564
version: 1.1.0
5665
infrastructureTemplateId: microservice-id-2
5766
useCaseTemplateId: template-id-2
58-
allows: [user-1]
59-
owners: [user-2]
6067
dependsOn: [urn:dmb:cmp:my_domain.my_data_product.1.my_raw_s3_port]
6168
platform: CDP on AWS
6269
technology: impala_cdp
6370
outputPortType: SQL
6471
creationDate: 05-04-2022 17:00:00
6572
startDate:
6673
processDescription:
67-
billingPolicy:
68-
securityPolicy:
69-
consumerPolicy:
70-
SLO:
71-
intervalOfChange: 1 hours
72-
timeliness: 1 minutes
73-
endpoint:
74+
dataContract:
75+
schema: []
76+
SLA:
77+
intervalOfChange: 1 hours
78+
timeliness: 1 minutes
79+
upTime: 99.9%
80+
termsAndConditions: only usable in development environment
81+
endpoint: https://myurl/development/my_domain/my_data_product/1.0.0/my_raw_s3_port
82+
dataSharingAgreements:
83+
purpose: this output port want to provide a rich set of profitability KPIs related to the customer
84+
billing: 5$ for each full scan
85+
security: In order to consume this output port an additional security check with compliance must be done
86+
intendedUsage: the dataset is huge so it is reccomended to extract maximum 1 year of data and to use these KPIs in the marketing or sales domain, but not for customer care
87+
limitations: is not possible to use this data without a compliance check
88+
lifeCycle: the maximum retention is 10 years, and eviction is happening on the first of january
89+
confidentiality: if you want to store this data somewhere else, PII columns must be masked
7490
tags: []
7591
sampleData: {}
7692
schema: []

0 commit comments

Comments
 (0)