How to perform data aggregation operations in Pig?

1 year ago

Sophia Anderson

1 minute

In Pig, data aggregation operations are typically performed using the GROUP BY statement. Here is a simple example:

Suppose we have a dataset containing names and ages, and we want to group the data by name and calculate the average age for each name.

-- 加载数据集
data = LOAD 'input.txt' USING PigStorage(',') AS (name:chararray, age:int);

-- 按姓名分组并计算平均年龄
grouped_data = GROUP data BY name;
result = FOREACH grouped_data GENERATE group AS name, AVG(data.age) AS avg_age;

-- 输出结果
DUMP result;

In the example above, first load the dataset, then use the GROUP BY statement to group the data by name. Next, use the FOREACH statement to calculate the average age for each group and store the results in a new relation. Finally, use the DUMP statement to output the results.

In addition to the AVG function, Pig also provides other aggregate functions such as SUM, MIN, MAX, etc., allowing users to choose the appropriate function based on specific needs for data aggregation operations.