What are partitions and buckets in Hive?

12 months ago

Benjamin Taylor

1 minute

Partitioning and bucketing in Hive are two techniques used to enhance query performance and manage data.

Partitioning: Partitioning is a technique for grouping and storing data in a table based on specific columns. By partitioning a table, it is possible to only scan the data in a specific partition during queries, thereby improving query performance. Partitions can be based on a single column or a combination of multiple columns. In Hive, you can specify partition columns when creating a table using the PARTITION BY clause, and specify the partition values when loading data using the PARTITION keyword.
Bucketing: Bucketing is a technique used to group data in a table based on the results of a hash function. By bucketing a table, data can be evenly distributed into multiple buckets, which helps reduce the amount of data scanned during queries. In Hive, you can specify a bucket column and the number of buckets using the CLUSTERED BY clause when creating a table, and then load data into the buckets using the INSERT OVERWRITE TABLE … CLUSTER BY … statement.