熊猫 删除重复行- drop_duplicates() 函数
Pandas drop_duplicates() 函数语法
Pandas的drop_duplicates()函数从DataFrame中删除重复的行。其语法如下:
drop_duplicates(self, subset=None, keep="first", inplace=False)
- subset: column label or sequence of labels to consider for identifying duplicate rows. By default, all the columns are used to find the duplicate rows.
- keep: allowed values are {‘first’, ‘last’, False}, default ‘first’. If ‘first’, duplicate rows except the first one is deleted. If ‘last’, duplicate rows except the last one is deleted. If False, all the duplicate rows are deleted.
- inplace: if True, the source DataFrame is changed and None is returned. By default, source DataFrame remains unchanged and a new DataFrame instance is returned.
熊猫删除重复行示例
让我们看一些从DataFrame对象中删除重复行的例子。
保留第一个,删除重复的行
当没有传递任何参数时,这是默认行为。
import pandas as pd
d1 = {'A': [1, 1, 1, 2], 'B': [2, 2, 2, 3], 'C': [3, 3, 4, 5]}
source_df = pd.DataFrame(d1)
print('Source DataFrame:\n', source_df)
# keep first duplicate row
result_df = source_df.drop_duplicates()
print('Result DataFrame:\n', result_df)
输出: 只需要一个选项
Source DataFrame:
A B C
0 1 2 3
1 1 2 3
2 1 2 4
3 2 3 5
Result DataFrame:
A B C
0 1 2 3
2 1 2 4
3 2 3 5
源数据帧的第0行和第1行是重复的。保留第一次出现的重复行,删除其余的重复行。
2. 删除重复数据并保留最后一行
result_df = source_df.drop_duplicates(keep='last')
print('Result DataFrame:\n', result_df)
输出:
Result DataFrame:
A B C
1 1 2 3
2 1 2 4
3 2 3 5
在输出中删除索引“0”,并保留最后一个重复行“1”。
从DataFrame中删除所有重复的行。
result_df = source_df.drop_duplicates(keep=False)
print('Result DataFrame:\n', result_df)
输出:
Result DataFrame:
A B C
2 1 2 4
3 2 3 5
在结果DataFrame中,重复的行‘0’和‘1’都被删除了。
4. 根据特定的列识别重复的行
import pandas as pd
d1 = {'A': [1, 1, 1, 2], 'B': [2, 2, 2, 3], 'C': [3, 3, 4, 5]}
source_df = pd.DataFrame(d1)
print('Source DataFrame:\n', source_df)
result_df = source_df.drop_duplicates(subset=['A', 'B'])
print('Result DataFrame:\n', result_df)
输出结果:
Source DataFrame:
A B C
0 1 2 3
1 1 2 3
2 1 2 4
3 2 3 5
Result DataFrame:
A B C
0 1 2 3
3 2 3 5
列“A”和“B”用于识别重复行。因此,行0、1和2是重复的。因此,行1和2被从输出中去除。
5. 去除重复行并在原地进行处理
source_df.drop_duplicates(inplace=True)
print(source_df)
输出:
A B C
0 1 2 3
2 1 2 4
3 2 3 5
参考文献
- Python Pandas Module Tutorial
- Pandas DataFrame drop_duplicates() API Doc