What are partitions and buckets in Hive?
Partitioning and bucketing in Hive are two techniques used to enhance query performance and manage data.
- Partitioning: Partitioning is a technique for grouping and storing data in a table based on specific columns. By partitioning a table, it is possible to only scan the data in a specific partition during queries, thereby improving query performance. Partitions can be based on a single column or a combination of multiple columns. In Hive, you can specify partition columns when creating a table using the PARTITION BY clause, and specify the partition values when loading data using the PARTITION keyword.
- Bucketing: Bucketing is a technique used to group data in a table based on the results of a hash function. By bucketing a table, data can be evenly distributed into multiple buckets, which helps reduce the amount of data scanned during queries. In Hive, you can specify a bucket column and the number of buckets using the CLUSTERED BY clause when creating a table, and then load data into the buckets using the INSERT OVERWRITE TABLE … CLUSTER BY … statement.