Data Discovery in GCP: Dataplex Tags, Tag Templates, Entry Groups and Entries
A data-driven organization is crucial for business and Efficient data discovery is crucial for business success. As data stewards and owners, our primary responsibility is defining business metadata to meet organizational needs.
The synergy of a well-structured tagging system and organized entry groups establishes a robust environment for comprehensive data exploration, understanding, and utilization. The combination of organized entry groups and a well-structured tagging system in Google Cloud creates a powerful environment, driving organizations towards data-driven excellence by optimizing the full potential of their data assets.
Tags
Tags are business metadata fields attached to data entries in platforms like BigQuery and Cloud Storage, providing essential context for understanding and categorizing data.
Tag Template
Tag templates serve as the cornerstone in crafting a comprehensive data governance strategy. With the ability to create and manage metadata for data assets in a centralized manner, tag templates act as a database schema for your metadata
By utilizing tag templates, data stewards can create a standardized framework for tagging, promoting consistency and coherence across the entire data landscape. Whether it’s public tags supporting simple searches or private tags enabling predicate-based searches, the flexibility offered ensures that data discovery is not only efficient but also tailored to the specific needs of the organization.
- Public tags are useful for a broad set of scenarios and these tags are intuitive to use. Public tags support simple search and search with predicates while private tags support only search with predicates.
1. Create a Tag Template
Lets start with creating a data governance template.
1.1 Create a Tag Template employeeDataGovernanceTagTemplate.json
{
"name":"CUSTOMER_DATA_GOVERNANCE_TAG_TEMPLATE",
"displayName":"Customer Data Governance Tag Template",
"fields":{
"num_rows":{
"displayName":"Number of rows in data asset",
"isRequired":"false",
"type":{
"primitiveType":"DOUBLE"
}
},
"source":{
"displayName":"Source",
"isRequired":"true",
"type":{
"primitiveType":"STRING"
}
},
"tier":{
"displayName":"Tier that this data asset belong to",
"isRequired":"true",
"type":{
"primitiveType":"STRING"
}
},
"data_rentention_policy":{
"displayName":"Data Retention Policy",
"isRequired":"true",
"type":{
"primitiveType":"STRING"
}
},
"data_classification":{
"displayName":"Data Classification",
"isRequired":"true",
"type":{
"enumType":{
"allowedValues":[
{
"displayName":"INTERNAL"
},
{
"displayName":"PUBLIC"
},
{
"displayName":"SENSITIVE"
},
{
"displayName":"CONFIDENTIAL"
},
{
"displayName":"NONE"
}
]
}
}
},
"has_pii":{
"displayName":"Has PII",
"isRequired":"false",
"type":{
"primitiveType":"BOOL"
}
},
"pii_type":{
"displayName":"PII type",
"isRequired":"false",
"type":{
"enumType":{
"allowedValues":[
{
"displayName":"CUSTOMER_ID"
},
{
"displayName":"EMAIL_ADDRESS"
},
{
"displayName":"PHONE_NUMBER"
},
{
"displayName":"NONE"
}
]
}
}
}
}
}
1.2 Create a Tag `employee_data_governance_tag`
API:
https://datacatalog.googleapis.com/v1beta1/projects/data-proc-poc/locations/us-west1/tagTemplates\?tagTemplateId\=employee_data_governance_tag_template
curl -X POST \
-H "Content-Type: application/json; charset=utf-8" \
-H "X-goog-api-key: <API_KEY>" \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-d @EmployeeDataGovernanceTagTemplate.json \
https://datacatalog.googleapis.com/v1beta1/projects/data-proc-poc/locations/us-west1/tagTemplates\?tagTemplateId\=employee_data_governance_tag_template
2. Lookup the Data Catalog entry-id
for your BigQuery table
curl -X GET \
-H "Content-Type: application/json; charset=utf-8" \
-H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
https://datacatalog.googleapis.com/v1beta1/entries:lookup\?linkedResource\=//bigquery.googleapis.com/projects/data-proc-poc/datasets/employeedb/tables/employee
3. Create a tag from the template and attach it to your BigQuery table
3.1 Request Body
{
"name": "projects/project-id/locations/US/entryGroups/@bigquery/entries/bigquery/entries/<entry-id>/tags/employee_data_governance",
"template": "projects/data-proc-poc/locations/us-west1/tagTemplates/employee_data_governance_tag_template",
"fields": {
"pii_type": {
"displayName": "PII Columns",
"enumValue": {
"displayName": "BONUS"
}
},
"has_pii": {
"displayName": "Has PII",
"boolValue": true
},
"source": {
"displayName": "Source",
"stringValue": "External"
},
"data_rentention_policy": {
"displayName": "Data Rentention",
"stringValue": "4 Years"
},
"tier": {
"displayName": "Data tier",
"stringValue": "2"
},
"data_classification": {
"displayName": "Data Classification",
"enumValue": {
"displayName": "INTERNAL"
}
}
},
"templateDisplayName": "employee data governance"
}
3.2: Post the Tags
curl -X POST \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-d @employee_tag.json \
https://datacatalog.googleapis.com/v1beta1/projects/data-proc-poc/locations/us/entryGroups/@bigquery/entries/<entry-id>/tags
3.3 Check the UI for the Tags published and Attached
Entry Groups
Entry Groups: Organizing Data for Intuitive Discovery
In the realm of Data Discovery, creation of custom entry groups to manage entries for Cloud Storage Filesets or custom data resource types ensures that data is organized logically, further enhancing the efficiency of data discovery processes.
1. Creating a Custom Entry Group
1.1 Create an Entry Group `cloud-storage-fileset`
Request body in cloud-storage-entry-group.json
{
"description": "Cloud Storage fileset entries",
"displayName": "Cloud Storage Fileset",
"name": "cloud-storage-fileset"
}
1.2: Post the `cloud-storage-entry-group.json`
curl -X POST \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-d @cloud-storage-entry-group.json \
https://datacatalog.googleapis.com/v1/projects/data-proc-poc/locations/us-west1/entryGroups?entryGroupId=cloud_storage_fileset
1.3 Open the dataplex UI and to Entry Groups
2. 1. Create an entry under custom entry group
2.1.1. Create a json body for a `Fileset` entry `custom_entry.json`
{
"fullyQualifiedName": "gcs:<bucket_name>",
"type": "FILESET",
"dataSource": {
"resource": "",
"service": "CLOUD_STORAGE",
"storageProperties": {
"fileType": "CSV",
"filePattern": [
"data/input/employee.csv"
]
}
},
"linkedResource": "storage.googleapis.com/storage/v1/b/<bucket_name>",
"labels": {
"appid": "app1234"
},
"gcsFilesetSpec": {
"filePatterns": [
"gs://<bucket_name>/data/input/employee.csv"
]
},
"schema": {
"columns": [
{
"column": "employee_name",
"type": "STRING",
"ordinalPosition": 0
},
{
"column": " department",
"type": "STRING",
"ordinalPosition": 1
},
{
"column": "state",
"type": "STRING",
"ordinalPosition": 2
},
{
"column": "salary",
"type": "INTEGER",
"ordinalPosition": 3
},
{
"column": "age",
"type": "INTEGER",
"ordinalPosition": 4
},
{
"column": "bonus",
"type": "INTEGER",
"ordinalPosition": 5
},
{
"column": " employee_id",
"type": "STRING",
"ordinalPosition": 6
}
]
}
}
2.1.2 Create an Entry `employee_entry`.
curl -X POST \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-d @custom_entry.json \
https://datacatalog.googleapis.com/v1/projects/data-proc-poc/locations/us-west1/entryGroups/cloud_storage_fileset/entries\?entryId\=employee_entry
2.1.3 Open the dataplex UI, Entry Groups and Entries
Conclusion
As data stewards and data owners, it is our responsibility to define the business metadata for the data that serves our business needs. By following the outlined steps for Tags, Tag Templates, and Entry Groups in Google’s Data Catalog, we can effectively enhance data governance, ensuring the proper management and contextualization of our valuable data assets.