Dataproc Metastore: Create A Fully Managed Hive Metastore on GCP

Mahendran
3 min readJan 8, 2024

--

Dataproc Metastore is a fully managed, highly available, autohealing, serverless, Apache Hive metastore (HMS) that runs on Google Cloud.

It helps you manage your Data Lake, metadata and provides interoperability between the various data processing engines and tools.

Dataplex: You need to have a gRPC-enabled Dataproc Metastore (version 3.1.2 or higher) associated with the Dataplex lake.

A metastore is required for using the Data Exploration Workbench functionality. You can access Dataplex metadata using Hive Metastore in Spark queries by associating a Dataproc Metastore service instance with your Dataplex lake.

Learn how to create Setup Dataplex and Create Lakes, Zones, and Assets

Dataproc Metastore services take about 10 minutes to create.

1. Prerequisite

  1. Enable Metastore API
  2. Add roles roles/metastore.editor or roles/metastore.admin.Refer dataproc metastore IAM roles
roles

2.1 Add the role to the service account

gcloud projects add-iam-policy-binding PROJECT_ID \
--member=PRINCIPAL \
--role=METASTORE_ROLE

2.2 Add Permissions to the custom role. For instance, add roles/metastore.admin permissions to the custome role attached to a service account

gcloud iam roles describe roles/metastore.admin
Permissions under roles/metastore.admin

2.3 Update these permissions to the customer Role

gcloud iam roles update \
--project=<PROJECT_ID> <ROLE_ID> \
--file=dataproc_sa_role.yaml

2. Create a Dataproc Metastore service

2. 1. Metastore with Thrift

gcloud metastore services create <METASTORE_ID> \
--location=us-west1 \
--port=9083 \
--endpoint-protocol=thrift \
--database-type=mysql \
--hive-metastore-version=3.1.2 \
--data-catalog-sync \
--release-channel=stable \
--labels "name=metastore_label" \
--hive-metastore-configs="hive.metastore.warehouse.dir=gs://{BUCKET_NAME}/{KEY}/hive-warehouse"
Metastore Configuration with Thrift endpoint

2. 2. Metastore with gRPC

gcloud metastore services create <METASTORE_ID> \
--location=us-west1 \
--endpoint-protocol=grpc \
--database-type=mysql \
--hive-metastore-version=3.1.2 \
--data-catalog-sync \
--release-channel=stable \
--labels "name=metastore_label" \
--hive-metastore-configs="hive.metastore.warehouse.dir=gs://{BUCKET_NAME}/{KEY}/hive-warehouse"
Metastore Configuration with gRPC endpoint

3. Describe the Metastore

gcloud metastore services describe <METASTORE_ID> \
--location us-west1

3.1. To get the endpointUri

gcloud metastore services describe <METASTORE_ID> \
--location us-west1 --format="get(endpointUri)"
Endpoint URI

3.2. To get the name

gcloud metastore services describe <METASTORE_ID> \
--location us-west1 --format="get(name)"

Conclusion

In summary, Dataproc Metastore offers robust functionality through various endpoints, such as Thrift and gRPC, ensuring seamless compatibility with popular data processing tools like Spark, Hive, and Presto. While Thrift endpoint support is available for these services, the creation of a Dataproc Metastore with a gRPC endpoint is imperative for optimal DataLake operations.

References

  1. Hive Metastore
  2. Dataproc metastore IAM roles
  3. Data-catalog-sync
  4. How to search with data catalog

--

--

Mahendran

A Software/Data Engineer, Photographer, Mentor, and Traveler