使用Tidyr中的Fill函数在R中填充缺失值。

1 年 ago

科, 雅

3 minutes

缺失数据或值是指变量中缺少数据记录。如果不正确处理，这将在数据建模过程中造成严重问题。最重要的是，大多数算法对于缺失数据并不适应。

在R中处理缺失数据的方法有很多。你可以删除这些记录。但是，请记住当你这样做时，你会丢失信息，可能会失去建模的潜在优势。另一方面，你可以用数据的平均值和中位数填补缺失数据。在本文中，我们将使用Tidyr包来填补R中的缺失值。

Tidyr是一个R包，它提供了许多功能来帮助您整理数据。数据质量越好，模型越好！

在R中的缺失数据

Missing values can be denoted by many forms – NA, NAN and more.
It is a missing record in the variable. It can be a single value or an entire row.
Missing values can occur both in numerical and categorical data.
R offers many methods to deal with missing data
Tidyr package helps in filling missing data using the Top down or bottom up approach.

2. 在R中的tidyr包

The Tidyr package in R is used to clean the raw data in R.
If offers functions for cleaning, organizing, filling missing values and more.
We will be using tidyr with R pipes.

要在R中安装Tidyr包，请运行以下代码。

#Install tidyr package

install.packages('tidyr')


#Load the library

library(tidyr)

package ‘tidyr’ successfully unpacked and MD5 sums checked

当上述演示成功加载tidyr后，您将收到确认信息。

创建一个数据框架

是的，我们需要创建一个简单的样本数据框，其中含有缺失值。这将帮助我们使用tidyr的填充函数来填补缺失数据。

#Create a dataframe

a <- c('A','B','C','D','E','F','G','H','I','J')
b <- c('Roger','Carlo','Durn','Jessy','Mounica','Rack','Rony','Saly','Kelly','Joseph')
c <- c(86,NA,NA,NA,88,NA,NA,86,NA,NA)

df <- data.frame(a,b,c)
df

   a       b  c
1  A   Roger 86
2  B   Carlo NA
3  C    Durn NA
4  D   Jessy NA
5  E Mounica 88
6  F    Rack NA
7  G    Rony NA
8  H    Saly 86
9  I   Kelly NA
10 J  Joseph NA

嗯，我们得到了一个数据框，但是有很多缺失值。因此，在这些情况下，当你的数据有越来越多的缺失值时，你可以利用R中的填充函数来填充对应的值/邻近值以替代缺失数据。

4. 两种不同的方法

是的，你可以按照我之前说的填写数据。这个过程包括两个方法-

Up – While filling the missing values, you have to specify the direction of filling of values. If you choose Up, then the filling process will be bottom-up.
Down – In this method, you have to set the direction of filling to down.

没明白吗？

不用担心。我们将会演示一些例子来说明同样的事情，你会了解事物如何运作的。

5. 填充缺失值 – ‘向上’

在这个过程中，我们有一个包含3列和10条数据记录的数据框。在使用填充函数处理缺失数据之前，你必须确保一些事情。

有时在收集数据时，人们可能会将一个值输入作为某些值的代表，因为它们是相同的。
例如：在收集年龄时，如果有10个人的年龄都是25岁，您可以在最后一个人这里标记25，表示所有10个人的年龄都是25岁。
请注意，这不是您经常遇到的情况。但是，这样做的目的是确保当您遇到这种情况时，您可以使用填充功能来处理它。

#Dataframe

   a       b  c
1  A   Roger 86
2  B   Carlo NA
3  C    Durn NA
4  D   Jessy NA
5  E Mounica 88
6  F    Rack NA
7  G    Rony NA
8  H    Saly 86
9  I   Kelly NA
10 J  Joseph NA


#Creste new dataframe by filling missing values (Up)
df1 <- df %>% fill(c, .direction = 'up')
df1

   a       b  c
1  A   Roger 86
2  B   Carlo 88
3  C    Durn 88
4  D   Jessy 88
5  E Mounica 88
6  F    Rack 86
7  G    Rony 86
8  H    Saly 86
9  I   Kelly NA
10 J  Joseph NA

你可以观察到，填充函数是通过从下往上的方式填充了缺失值。

You can see that there are 2 NA values in the last rows. This is because the fill function first encounters the NA value and fills it to the next NA value as the direction is UP.

6. 填充缺失值 – ‘向下’

好的，在这里我们将使用“向下”方法来填充数据中的缺失值。始终确保理解我在前面部分提到的一些假设，以便了解您正在做什么以及结果将会是什么。

#Data


   a       b  c
1  A   Roger 86
2  B   Carlo NA
3  C    Durn NA
4  D   Jessy NA
5  E Mounica 88
6  F    Rack NA
7  G    Rony NA
8  H    Saly 86
9  I   Kelly NA
10 J  Joseph NA


#Creates new dataframe by filling missing values (Down) - (Top-Down approach)

df1 <- df %>% fill(c, .direction = 'down')
df1

   a       b  c
1  A   Roger 86
2  B   Carlo 86
3  C    Durn 86
4  D   Jessy 86
5  E Mounica 88
6  F    Rack 88
7  G    Rony 88
8  H    Saly 86
9  I   Kelly 86
10 J  Joseph 86

Here, there are no missing values. This is because the fill function first encounters valid data values which are 86. It will fill the 86 into the next NA regions until it finds a valid data record.

7. 总结

在分析具有空缺值的任何数据时，在R中填充缺失值是最重要的过程。对你来说可能会有些困难，但请确保阅读本文一两次，以便简明地理解。它并不难理解！

希望这个方法在你未来的作业中对你有所帮助。暂时就这些了。R愉快！ 🙂

更多阅读：R中填充函数