熊猫 删除重复行- drop_duplicates() 函数

Pandas drop_duplicates() 函数语法

Pandas的drop_duplicates()函数从DataFrame中删除重复的行。其语法如下:

drop_duplicates(self, subset=None, keep="first", inplace=False)
  • subset: column label or sequence of labels to consider for identifying duplicate rows. By default, all the columns are used to find the duplicate rows.
  • keep: allowed values are {‘first’, ‘last’, False}, default ‘first’. If ‘first’, duplicate rows except the first one is deleted. If ‘last’, duplicate rows except the last one is deleted. If False, all the duplicate rows are deleted.
  • inplace: if True, the source DataFrame is changed and None is returned. By default, source DataFrame remains unchanged and a new DataFrame instance is returned.

熊猫删除重复行示例

让我们看一些从DataFrame对象中删除重复行的例子。

保留第一个,删除重复的行

当没有传递任何参数时,这是默认行为。

import pandas as pd

d1 = {'A': [1, 1, 1, 2], 'B': [2, 2, 2, 3], 'C': [3, 3, 4, 5]}

source_df = pd.DataFrame(d1)
print('Source DataFrame:\n', source_df)

# keep first duplicate row
result_df = source_df.drop_duplicates()
print('Result DataFrame:\n', result_df)

输出: 只需要一个选项

Source DataFrame:
    A  B  C
0  1  2  3
1  1  2  3
2  1  2  4
3  2  3  5
Result DataFrame:
    A  B  C
0  1  2  3
2  1  2  4
3  2  3  5

源数据帧的第0行和第1行是重复的。保留第一次出现的重复行,删除其余的重复行。

2. 删除重复数据并保留最后一行

result_df = source_df.drop_duplicates(keep='last')
print('Result DataFrame:\n', result_df)

输出:

Result DataFrame:
    A  B  C
1  1  2  3
2  1  2  4
3  2  3  5

在输出中删除索引“0”,并保留最后一个重复行“1”。

从DataFrame中删除所有重复的行。

result_df = source_df.drop_duplicates(keep=False)
print('Result DataFrame:\n', result_df)

输出:

Result DataFrame:
    A  B  C
2  1  2  4
3  2  3  5

在结果DataFrame中,重复的行‘0’和‘1’都被删除了。

4. 根据特定的列识别重复的行

import pandas as pd

d1 = {'A': [1, 1, 1, 2], 'B': [2, 2, 2, 3], 'C': [3, 3, 4, 5]}

source_df = pd.DataFrame(d1)
print('Source DataFrame:\n', source_df)

result_df = source_df.drop_duplicates(subset=['A', 'B'])
print('Result DataFrame:\n', result_df)

输出结果:

Source DataFrame:
    A  B  C
0  1  2  3
1  1  2  3
2  1  2  4
3  2  3  5
Result DataFrame:
    A  B  C
0  1  2  3
3  2  3  5

列“A”和“B”用于识别重复行。因此,行0、1和2是重复的。因此,行1和2被从输出中去除。

5. 去除重复行并在原地进行处理

source_df.drop_duplicates(inplace=True)
print(source_df)

输出:

   A  B  C
0  1  2  3
2  1  2  4
3  2  3  5

参考文献

  • Python Pandas Module Tutorial
  • Pandas DataFrame drop_duplicates() API Doc
发表回复 0

Your email address will not be published. Required fields are marked *


广告
将在 10 秒后关闭
bannerAds