Data Governance and Data Management in GCP: Dataplex Lakes, Zones, Assets
What is a Dataplex?
Dataplex is a data fabric that unifies distributed data and automates data management and governance for that data.
Why do you use Dataplex?
Enterprises have data that distributed across data lakes, data warehouses, and data marts. Using Dataplex, you can do the following:
- Discover data
- Curate data
- Unify data without any data movement
- Organize data based on your business needs
- Centrally manage, monitor, and govern data
Glossary:
Lake: A Lake is a data domain or a business unit. For example:- Customer, Product, Retail, Sales, Finance etc.
Zone: A Zone is a subdomain. Identifies if the data is a RAW or Curated.
Asset: Maps Zones to data stored in Cloud Storage or Bigquery. You can map data stored in separate Google Cloud projects as assets into a single zone within a lake.
Entity: Represents metadata for structured and semi-structured data (table) and unstructured data (fileset).
Prerequisite
1. Enable the Dataplex API
2. Create a bucket
gcloud storage buckets create gs://<BUCKET_NAME>/ \
--project <PROJECT_ID> \
--default-storage-class STANDARD
3. Create a Dataproc Metastore
gRpc endpoint must be enabled for metastore instance
Get the name of the Metastore
gcloud metastore services describe <METASTORE_ID> \
--location us-west1 --format="get(name)"
4. Add Permissions
You can choose either add permissions to the customer role if you have created one or add a role to the service account
4.1 Add a role to the service account
gcloud projects add-iam-policy-binding <PROJECT_ID> \
--member='serviceAccount:<SERVICE-ACCOUNT>@<PROJECT_ID>.iam.gserviceaccount.com' \
--role='roles/dataplex.admin'
4.2 Add Permission to the Custom Role
gcloud iam roles describe roles/dataplex.admin
4.3 Add these permissions to the customer Role
Update the custom role,
For instance ROLE_ID is dataproc_sa_role
gcloud iam roles update \
--project=<PROJECT_ID> <ROLE_ID> \
--file=dataproc_sa_role.yaml
Lake
1. Create a Lake
gcloud dataplex lakes create <LAKE_ID> \
--location=us-west1 \
--labels='user=JohnSmith,appid=APP12345' \
--metastore-service=projects/<PROJECT_ID>/locations/<LOCATION_ID>/services/<METASTORE_ID>
1.1. Describe a Lake
gcloud dataplex lakes describe <LAKE_ID> \
--location us-west1 --format="get(name)"
2. Zone
Create a Zone
2.1. Add the body of the request
{
"displayName": "customer_raw_data_set",
"type": "RAW",
"resourceSpec": {
"locationType": "SINGLE_REGION"
},
"description": "First Zone under the lake ",
"assetStatus": {
"activeAssets": 1,
"securityPolicyApplyingAssets": 0
},
"discoverySpec": {
"csvOptions": {
"delimiter": ",",
"headerRows": 1
},
"schedule": "*, 1,*,*,*"
},
"labels": {
"appid": "app12345",
"user": "mahen"
}
}
2.2. Rest API to create a zone
<API_KEY> is required { How to create an API Key }
curl -X POST \
-H "Content-Type: application/json; charset=utf-8" \
-H "X-goog-api-key: <API_KEY>" \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-d @zone.json \
https://dataplex.googleapis.com/v1/projects/<PROJECT_ID>/locations/<LOCATION_ID>/lakes/<LAKE_ID>/zones/\?zoneId\=input-data-zone
2.3. Describe the zone created
gcloud dataplex zones describe <ZONE ID> \
--lake <LAKE ID> \
--location us-west1
Asset
3. Create an Asset
3.1. Request body
{
"description": "First asset",
"discoverySpec": {
"csvOptions": {
"delimiter": ",",
"headerRows": 1
},
"enabled": true,
"schedule": "16 * * * *"
},
"discoveryStatus": {
"message": "discoveryStatus",
"state": "SCHEDULED",
"stats": {
"dataItems": 0,
"dataSize": 0,
"filesets": 0,
"tables": 0
}
},
"displayName": "intput_data_asset",
"labels": {
"appid": "app12345",
"user":"mahendran"
},
"resourceSpec": {
"name": "projects/{project-id}/buckets/{bucket-name}",
"readAccessMode": "DIRECT",
"type": "STORAGE_BUCKET"
},
"resourceStatus": {
"message": "resourceStatus",
"state": "READY"
},
"securityStatus": {
"state": "READY",
"message": "securityStatus"
}
}
3.2. Create an Asset
<API_KEY> is required { How to create an API Key }
curl -X POST \
-H "Content-Type: application/json; charset=utf-8" \
-H "X-goog-api-key: <API_KEY>" \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-d @asset.json \
https://dataplex.googleapis.com/v1/projects/<PROJECT_ID>/locations/<LOCATION_ID>/lakes/<LAKE_ID>/zones/<ZONE_ID>/assets/\?assetId\=<ASSET_ID>
3.3. Describe the Asset
gcloud dataplex assets describe <ASSET_ID> \
--lake <LAKE_ID> \
--zone <ZONE_ID> \
--location us-west1
Delete Lake, Zone and Asset
Delete the Asset
gcloud dataplex assets delete <ASSET_ID> \
--lake <LAKE_ID> \
--zone <ZONE_ID> \
--location us-west1
Delete the Zone
gcloud dataplex zones delete <ZONE_ID> \
--lake <LAKE_ID> \
--location us-west1
Delete the lake
gcloud dataplex lakes delete <LAKE_ID> --location us-west1
Delete the Metastore
As this is an expensive service, it’s also better to delete them when not in use.
gcloud metastore services delete <METASTORE_ID> \
--location=us-west1
Conclusion
Dataplex’s unique capability to manage and govern data without necessitating its movement across projects represents a superpower-like advantage in modern data handling.
This capability not only streamlines operations but also enhances data utilization and security, contributing to more efficient and insightful data-driven decision-making processes within enterprises.