Dataproc Metastore: Create A Fully Managed Hive Metastore on GCP
Dataproc Metastore is a fully managed, highly available, autohealing, serverless, Apache Hive metastore (HMS) that runs on Google Cloud.
It helps you manage your Data Lake, metadata and provides interoperability between the various data processing engines and tools.
Dataplex: You need to have a gRPC-enabled Dataproc Metastore (version 3.1.2 or higher) associated with the Dataplex lake.
A metastore is required for using the Data Exploration Workbench functionality. You can access Dataplex metadata using Hive Metastore in Spark queries by associating a Dataproc Metastore service instance with your Dataplex lake.
Learn how to create Setup Dataplex and Create Lakes, Zones, and Assets
Dataproc Metastore services take about 10 minutes to create.
1. Prerequisite
- Enable Metastore API
- Add roles
roles/metastore.editor
orroles/metastore.admin.
Refer dataproc metastore IAM roles
2.1 Add the role to the service account
gcloud projects add-iam-policy-binding PROJECT_ID \
--member=PRINCIPAL \
--role=METASTORE_ROLE
2.2 Add Permissions to the custom role. For instance, add roles/metastore.admin permissions to the custome role attached to a service account
gcloud iam roles describe roles/metastore.admin
2.3 Update these permissions to the customer Role
gcloud iam roles update \
--project=<PROJECT_ID> <ROLE_ID> \
--file=dataproc_sa_role.yaml
2. Create a Dataproc Metastore service
2. 1. Metastore with Thrift
gcloud metastore services create <METASTORE_ID> \
--location=us-west1 \
--port=9083 \
--endpoint-protocol=thrift \
--database-type=mysql \
--hive-metastore-version=3.1.2 \
--data-catalog-sync \
--release-channel=stable \
--labels "name=metastore_label" \
--hive-metastore-configs="hive.metastore.warehouse.dir=gs://{BUCKET_NAME}/{KEY}/hive-warehouse"
2. 2. Metastore with gRPC
gcloud metastore services create <METASTORE_ID> \
--location=us-west1 \
--endpoint-protocol=grpc \
--database-type=mysql \
--hive-metastore-version=3.1.2 \
--data-catalog-sync \
--release-channel=stable \
--labels "name=metastore_label" \
--hive-metastore-configs="hive.metastore.warehouse.dir=gs://{BUCKET_NAME}/{KEY}/hive-warehouse"
3. Describe the Metastore
gcloud metastore services describe <METASTORE_ID> \
--location us-west1
3.1. To get the endpointUri
gcloud metastore services describe <METASTORE_ID> \
--location us-west1 --format="get(endpointUri)"
3.2. To get the name
gcloud metastore services describe <METASTORE_ID> \
--location us-west1 --format="get(name)"
Conclusion
In summary, Dataproc Metastore offers robust functionality through various endpoints, such as Thrift and gRPC, ensuring seamless compatibility with popular data processing tools like Spark, Hive, and Presto. While Thrift endpoint support is available for these services, the creation of a Dataproc Metastore with a gRPC endpoint is imperative for optimal DataLake operations.