Data Governance and Data Management in GCP: Dataplex Lakes, Zones, Assets

Mahendran
4 min readJan 5, 2024

--

What is a Dataplex?

Dataplex is a data fabric that unifies distributed data and automates data management and governance for that data.

Why do you use Dataplex?

Enterprises have data that distributed across data lakes, data warehouses, and data marts. Using Dataplex, you can do the following:

  • Discover data
  • Curate data
  • Unify data without any data movement
  • Organize data based on your business needs
  • Centrally manage, monitor, and govern data

Glossary:

Lake: A Lake is a data domain or a business unit. For example:- Customer, Product, Retail, Sales, Finance etc.

Zone: A Zone is a subdomain. Identifies if the data is a RAW or Curated.

Asset: Maps Zones to data stored in Cloud Storage or Bigquery. You can map data stored in separate Google Cloud projects as assets into a single zone within a lake.

Entity: Represents metadata for structured and semi-structured data (table) and unstructured data (fileset).

Prerequisite

1. Enable the Dataplex API

2. Create a bucket

gcloud storage buckets create gs://<BUCKET_NAME>/ \
--project <PROJECT_ID> \
--default-storage-class STANDARD

3. Create a Dataproc Metastore

gRpc endpoint must be enabled for metastore instance

Get the name of the Metastore

gcloud metastore services describe <METASTORE_ID> \
--location us-west1 --format="get(name)"

4. Add Permissions

You can choose either add permissions to the customer role if you have created one or add a role to the service account

4.1 Add a role to the service account

gcloud projects add-iam-policy-binding <PROJECT_ID> \
--member='serviceAccount:<SERVICE-ACCOUNT>@<PROJECT_ID>.iam.gserviceaccount.com' \
--role='roles/dataplex.admin'

4.2 Add Permission to the Custom Role

gcloud iam roles describe roles/dataplex.admin

4.3 Add these permissions to the customer Role

Update the custom role,

For instance ROLE_ID is dataproc_sa_role

 gcloud iam roles update \
--project=<PROJECT_ID> <ROLE_ID> \
--file=dataproc_sa_role.yaml

5. API KEY if you’re using REST (How to create an API Key?)

Lake

1. Create a Lake

 gcloud dataplex lakes create <LAKE_ID> \
--location=us-west1 \
--labels='user=JohnSmith,appid=APP12345' \
--metastore-service=projects/<PROJECT_ID>/locations/<LOCATION_ID>/services/<METASTORE_ID>

1.1. Describe a Lake

gcloud dataplex lakes describe <LAKE_ID> \
--location us-west1 --format="get(name)"

2. Zone

Create a Zone

2.1. Add the body of the request

{
"displayName": "customer_raw_data_set",
"type": "RAW",
"resourceSpec": {
"locationType": "SINGLE_REGION"
},
"description": "First Zone under the lake ",
"assetStatus": {
"activeAssets": 1,
"securityPolicyApplyingAssets": 0
},
"discoverySpec": {
"csvOptions": {
"delimiter": ",",
"headerRows": 1
},
"schedule": "*, 1,*,*,*"
},
"labels": {
"appid": "app12345",
"user": "mahen"
}
}

2.2. Rest API to create a zone

<API_KEY> is required { How to create an API Key }

curl -X POST \
-H "Content-Type: application/json; charset=utf-8" \
-H "X-goog-api-key: <API_KEY>" \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-d @zone.json \
https://dataplex.googleapis.com/v1/projects/<PROJECT_ID>/locations/<LOCATION_ID>/lakes/<LAKE_ID>/zones/\?zoneId\=input-data-zone
Response

2.3. Describe the zone created

gcloud dataplex zones describe <ZONE ID> \
--lake <LAKE ID> \
--location us-west1

Asset

3. Create an Asset

3.1. Request body

{
"description": "First asset",
"discoverySpec": {
"csvOptions": {
"delimiter": ",",
"headerRows": 1
},
"enabled": true,
"schedule": "16 * * * *"
},
"discoveryStatus": {
"message": "discoveryStatus",
"state": "SCHEDULED",
"stats": {
"dataItems": 0,
"dataSize": 0,
"filesets": 0,
"tables": 0
}
},
"displayName": "intput_data_asset",
"labels": {
"appid": "app12345",
"user":"mahendran"
},
"resourceSpec": {
"name": "projects/{project-id}/buckets/{bucket-name}",
"readAccessMode": "DIRECT",
"type": "STORAGE_BUCKET"
},
"resourceStatus": {
"message": "resourceStatus",
"state": "READY"
},
"securityStatus": {
"state": "READY",
"message": "securityStatus"
}
}

3.2. Create an Asset

<API_KEY> is required { How to create an API Key }

curl -X POST \
-H "Content-Type: application/json; charset=utf-8" \
-H "X-goog-api-key: <API_KEY>" \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-d @asset.json \
https://dataplex.googleapis.com/v1/projects/<PROJECT_ID>/locations/<LOCATION_ID>/lakes/<LAKE_ID>/zones/<ZONE_ID>/assets/\?assetId\=<ASSET_ID>

3.3. Describe the Asset

gcloud dataplex assets describe <ASSET_ID> \
--lake <LAKE_ID> \
--zone <ZONE_ID> \
--location us-west1

Delete Lake, Zone and Asset

Delete the Asset

gcloud dataplex assets delete <ASSET_ID> \
--lake <LAKE_ID> \
--zone <ZONE_ID> \
--location us-west1

Delete the Zone

gcloud dataplex zones delete <ZONE_ID> \
--lake <LAKE_ID> \
--location us-west1

Delete the lake

gcloud dataplex lakes delete <LAKE_ID> --location us-west1

Delete the Metastore

As this is an expensive service, it’s also better to delete them when not in use.

gcloud metastore services delete <METASTORE_ID> \
--location=us-west1

Conclusion

Dataplex’s unique capability to manage and govern data without necessitating its movement across projects represents a superpower-like advantage in modern data handling.

This capability not only streamlines operations but also enhances data utilization and security, contributing to more efficient and insightful data-driven decision-making processes within enterprises.

References

  1. Cloud Dataplex API
  2. Create Lake
  3. API references
  4. Java API References

--

--

Mahendran
Mahendran

Written by Mahendran

A Software/Data Engineer, Photographer, Mentor, and Traveler