Data Discovery in GCP: Dataplex Tags, Tag Templates, Entry Groups and Entries

Mahendran
5 min readJan 16, 2024

--

A data-driven organization is crucial for business and Efficient data discovery is crucial for business success. As data stewards and owners, our primary responsibility is defining business metadata to meet organizational needs.

The synergy of a well-structured tagging system and organized entry groups establishes a robust environment for comprehensive data exploration, understanding, and utilization. The combination of organized entry groups and a well-structured tagging system in Google Cloud creates a powerful environment, driving organizations towards data-driven excellence by optimizing the full potential of their data assets.

Tags

Tags are business metadata fields attached to data entries in platforms like BigQuery and Cloud Storage, providing essential context for understanding and categorizing data.

Tag Template

Tag templates serve as the cornerstone in crafting a comprehensive data governance strategy. With the ability to create and manage metadata for data assets in a centralized manner, tag templates act as a database schema for your metadata

By utilizing tag templates, data stewards can create a standardized framework for tagging, promoting consistency and coherence across the entire data landscape. Whether it’s public tags supporting simple searches or private tags enabling predicate-based searches, the flexibility offered ensures that data discovery is not only efficient but also tailored to the specific needs of the organization.

  • Public tags are useful for a broad set of scenarios and these tags are intuitive to use. Public tags support simple search and search with predicates while private tags support only search with predicates.

1. Create a Tag Template

Lets start with creating a data governance template.

1.1 Create a Tag Template employeeDataGovernanceTagTemplate.json

{
"name":"CUSTOMER_DATA_GOVERNANCE_TAG_TEMPLATE",
"displayName":"Customer Data Governance Tag Template",
"fields":{
"num_rows":{
"displayName":"Number of rows in data asset",
"isRequired":"false",
"type":{
"primitiveType":"DOUBLE"
}
},
"source":{
"displayName":"Source",
"isRequired":"true",
"type":{
"primitiveType":"STRING"
}
},
"tier":{
"displayName":"Tier that this data asset belong to",
"isRequired":"true",
"type":{
"primitiveType":"STRING"
}
},
"data_rentention_policy":{
"displayName":"Data Retention Policy",
"isRequired":"true",
"type":{
"primitiveType":"STRING"
}
},
"data_classification":{
"displayName":"Data Classification",
"isRequired":"true",
"type":{
"enumType":{
"allowedValues":[
{
"displayName":"INTERNAL"
},
{
"displayName":"PUBLIC"
},
{
"displayName":"SENSITIVE"
},
{
"displayName":"CONFIDENTIAL"
},
{
"displayName":"NONE"
}
]
}
}
},
"has_pii":{
"displayName":"Has PII",
"isRequired":"false",
"type":{
"primitiveType":"BOOL"
}
},
"pii_type":{
"displayName":"PII type",
"isRequired":"false",
"type":{
"enumType":{
"allowedValues":[
{
"displayName":"CUSTOMER_ID"
},
{
"displayName":"EMAIL_ADDRESS"
},
{
"displayName":"PHONE_NUMBER"
},
{
"displayName":"NONE"
}
]
}
}
}
}
}

1.2 Create a Tag `employee_data_governance_tag`

API:

https://datacatalog.googleapis.com/v1beta1/projects/data-proc-poc/locations/us-west1/tagTemplates\?tagTemplateId\=employee_data_governance_tag_template
curl -X POST \
-H "Content-Type: application/json; charset=utf-8" \
-H "X-goog-api-key: <API_KEY>" \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-d @EmployeeDataGovernanceTagTemplate.json \
https://datacatalog.googleapis.com/v1beta1/projects/data-proc-poc/locations/us-west1/tagTemplates\?tagTemplateId\=employee_data_governance_tag_template
Response tag

2. Lookup the Data Catalog entry-id for your BigQuery table

 curl -X GET \
-H "Content-Type: application/json; charset=utf-8" \
-H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
https://datacatalog.googleapis.com/v1beta1/entries:lookup\?linkedResource\=//bigquery.googleapis.com/projects/data-proc-poc/datasets/employeedb/tables/employee

3. Create a tag from the template and attach it to your BigQuery table

3.1 Request Body

{
"name": "projects/project-id/locations/US/entryGroups/@bigquery/entries/bigquery/entries/<entry-id>/tags/employee_data_governance",
"template": "projects/data-proc-poc/locations/us-west1/tagTemplates/employee_data_governance_tag_template",
"fields": {
"pii_type": {
"displayName": "PII Columns",
"enumValue": {
"displayName": "BONUS"
}
},
"has_pii": {
"displayName": "Has PII",
"boolValue": true
},
"source": {
"displayName": "Source",
"stringValue": "External"
},
"data_rentention_policy": {
"displayName": "Data Rentention",
"stringValue": "4 Years"
},
"tier": {
"displayName": "Data tier",
"stringValue": "2"
},
"data_classification": {
"displayName": "Data Classification",
"enumValue": {
"displayName": "INTERNAL"
}
}
},
"templateDisplayName": "employee data governance"
}

3.2: Post the Tags

 curl -X POST \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-d @employee_tag.json \
https://datacatalog.googleapis.com/v1beta1/projects/data-proc-poc/locations/us/entryGroups/@bigquery/entries/<entry-id>/tags

3.3 Check the UI for the Tags published and Attached

Employee Data Governance Tag

Entry Groups

Entry Groups: Organizing Data for Intuitive Discovery

In the realm of Data Discovery, creation of custom entry groups to manage entries for Cloud Storage Filesets or custom data resource types ensures that data is organized logically, further enhancing the efficiency of data discovery processes.

1. Creating a Custom Entry Group

1.1 Create an Entry Group `cloud-storage-fileset`

Request body in cloud-storage-entry-group.json

{
"description": "Cloud Storage fileset entries",
"displayName": "Cloud Storage Fileset",
"name": "cloud-storage-fileset"
}

1.2: Post the `cloud-storage-entry-group.json`

curl -X POST \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-d @cloud-storage-entry-group.json \
https://datacatalog.googleapis.com/v1/projects/data-proc-poc/locations/us-west1/entryGroups?entryGroupId=cloud_storage_fileset
Response

1.3 Open the dataplex UI and to Entry Groups

2. 1. Create an entry under custom entry group

2.1.1. Create a json body for a `Fileset` entry `custom_entry.json`

{
"fullyQualifiedName": "gcs:<bucket_name>",
"type": "FILESET",
"dataSource": {
"resource": "",
"service": "CLOUD_STORAGE",
"storageProperties": {
"fileType": "CSV",
"filePattern": [
"data/input/employee.csv"
]
}
},
"linkedResource": "storage.googleapis.com/storage/v1/b/<bucket_name>",
"labels": {
"appid": "app1234"
},
"gcsFilesetSpec": {
"filePatterns": [
"gs://<bucket_name>/data/input/employee.csv"
]
},
"schema": {
"columns": [
{
"column": "employee_name",
"type": "STRING",
"ordinalPosition": 0
},
{
"column": " department",
"type": "STRING",
"ordinalPosition": 1
},
{
"column": "state",
"type": "STRING",
"ordinalPosition": 2
},
{
"column": "salary",
"type": "INTEGER",
"ordinalPosition": 3
},
{
"column": "age",
"type": "INTEGER",
"ordinalPosition": 4
},
{
"column": "bonus",
"type": "INTEGER",
"ordinalPosition": 5
},
{
"column": " employee_id",
"type": "STRING",
"ordinalPosition": 6
}
]
}
}

2.1.2 Create an Entry `employee_entry`.

curl -X POST \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-d @custom_entry.json \
https://datacatalog.googleapis.com/v1/projects/data-proc-poc/locations/us-west1/entryGroups/cloud_storage_fileset/entries\?entryId\=employee_entry
Response

2.1.3 Open the dataplex UI, Entry Groups and Entries

Conclusion

As data stewards and data owners, it is our responsibility to define the business metadata for the data that serves our business needs. By following the outlined steps for Tags, Tag Templates, and Entry Groups in Google’s Data Catalog, we can effectively enhance data governance, ensuring the proper management and contextualization of our valuable data assets.

References

  1. How-to-Track-Lineage
  2. Tags And Tag Templates
  3. Data Catalog API docs
  4. Data Catalog REST API
  5. Method: entries.lookup
  6. Entries And Entry Groups

--

--

Mahendran
Mahendran

Written by Mahendran

A Software/Data Engineer, Photographer, Mentor, and Traveler

No responses yet