如何使用R中的sample()函数进行采样？

2 年 ago

宇, 华

3 minutes

让我们来了解一下在R中经常使用的函数之一，sample()。在数据分析中，取样分析是分析师们最常见的操作过程。为了研究和理解数据，有时候取一个样本是最好的方法，尤其对于大数据而言，这一点尤为正确。

R提供了标准的sample()函数来从数据集中进行抽样。在许多商业和数据分析问题中，需要对数据进行抽样。在此过程中，可以选择带或不带替换地生成随机数据，具体示例如下。

开始进入这个话题吧！

R中sample()函数的语法

sample(x, size, replace = FALSE, prob = NULL)

x – vector or a data set.
size – sample size.
replace – with or without replacement of values.
replace – with or without replacement of values.
prob – probability weights

用替换的方式进行取样

你可能会想知道，“带放回抽样”是什么意思？

当你从列表或数据中取样品时，如果你指定replace=TRUE或T，那么函数将允许重复值的出现。

请按照下面的例子进行操作，该例子清楚地解释了这种情况。

#sample range lies between 1 to 5
x<- sample(1:5)
#prints the samples
x
Output -> 3 2 1 5 4


#samples range is 1 to 5 and number of samples is 3
x<- sample(1:5, 3)
#prints the samples (3 samples)
x
Output -> 2 4 5


#sample range is 1 to 5 and the number of samples is 6
x<- sample(1:5, 6)
x
#shows error as the range should include only 5 numbers (1:5)
Error in sample.int(length(x), size, replace, prob) : 
  cannot take a sample larger than the population when 'replace = FALSE'

#specifing replace=TRUE or T will allow repetition of values so that the function will generate 6 samples in the range 1 to 5. Here 2 is repeated.
 
x<- sample(1:5, 6, replace=T)
Output -> 2 4 2 2 4 3

在R中进行无重复样本抽样

这种情况下，我们将进行无重复抽样。整个概念如下所示。

在这种无重复情况下，使用参数replace=F，它将不允许值的重复。

#samples without replacement 
x<-sample(1:8, 7, replace=F)
x
Output -> 4 1 6 5 3 2 7
x<-sample(1:8, 9, replace=F)
Error in sample.int(length(x), size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'


#here the size of the sample is equal to range 'x'. 
x<- sample(1:5, 5, replace=F)
x
Output -> 5 4 1 3 2

使用函数set.seed()进行采样

当你获取样本时，可能会发现每次样本都是随机的并且会发生变化。为了避免这种情况或者如果你不想每次都获取不同的样本，你可以使用set.seed()函数。

设定随机种子（set.seed()）- 当你运行set.seed函数时，它将产生相同的序列。

下面是一个示例，执行下面的代码可以每次获得相同的随机样本。

#set the index 
set.seed(5)
#takes the random samples with replacement
sample(1:5, 4, replace=T)
2 3 1 3

set.seed(5)
sample(1:5, 4, replace=T)
2 3 1 3

set.seed(5)
sample(1:5, 4, replace=T)
2 3 1 3

从数据集中提取样本

在这一部分中，我们将在Rstudio中从数据集中生成样本。

这段代码将从‘ToothGrowth’数据集中选取10行作为样本，并进行显示。通过这种方式，你可以从数据集中取得所需大小的样本。

#reads the dataset 'Toothgrwoth' and take the 10 rows as sample
df<- sample(1:nrow(ToothGrowth), 10)
df
--> 53 12 16 26 37 27  9 22 28 10
#sample 10 rows
ToothGrowth[df,]

    len supp dose
53 22.4   OJ  2.0
12 16.5   VC  1.0
16 17.3   VC  1.0
26 32.5   VC  2.0
37  8.2   OJ  0.5
27 26.7   VC  2.0
9   5.2   VC  0.5
22 18.5   VC  2.0
28 21.5   VC  2.0
10  7.0   VC  0.5

使用set.seed()函数从数据集中获取样本。

在本节中，我们将使用set.seed()函数从数据集中提取样本。

使用set.seed()执行下面的代码来从数据集中生成样本。

#set.seed function
set.seed(10)
#taking sample of 10 rows from the iris dataset. 
x<- sample(1:nrow(iris), 10)
x
--> 137  74 112  72  88  15 143 149  24  13
#displays the 10 rows
iris[x, ]
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
137          6.3         3.4          5.6         2.4  virginica
74           6.1         2.8          4.7         1.2 versicolor
112          6.4         2.7          5.3         1.9  virginica
72           6.1         2.8          4.0         1.3 versicolor
88           6.3         2.3          4.4         1.3 versicolor
15           5.8         4.0          1.2         0.2     setosa
143          5.8         2.7          5.1         1.9  virginica
149          6.2         3.4          5.4         2.3  virginica
24           5.1         3.3          1.7         0.5     setosa
13           4.8         3.0          1.4         0.1     setosa

当您多次执行代码时，您将获得相同的行。由于我们使用了set.seed()函数，该值不会改变。

使用R中的sample()函数生成一个随机样本。

好吧，我们将通过一个问题来帮助理解这个概念。

问题：一家礼品店决定给其中一位顾客一个惊喜礼物。为此，他们收集了一些名字。选择的要点是从名单中随机选择一个名字。

提示：使用sample()函数生成随机样本。

提示：使用sample()函数来生成随机样本。

你可以看到下面的代码，每次运行它都会生成一个随机样本的参与者姓名。

#creates a list of names and generates one sample from this list
sample(c('jack','Rossie','Kyle','Edwards','Joseph','Paloma','Kelly','Alok','Jolie'),1)
--> "Rossie"
 sample(c('jack','Rossie','Kyle','Edwards','Joseph','Paloma','Kelly','Alok','Jolie'),1)
--> "Jolie"

sample(c('jack','Rossie','Kyle','Edwards','Joseph','Paloma','Kelly','Alok','Jolie'),1)
--> "jack"

sample(c('jack','Rossie','Kyle','Edwards','Joseph','Paloma','Kelly','Alok','Jolie'),1)
--> "Edwards"

sample(c('jack','Rossie','Kyle','Edwards','Joseph','Paloma','Kelly','Alok','Jolie'),1)
--> "Kyle"

通过设置概率进行取样

通过上述的例子和概念的帮助，你已经了解如何生成随机样本并从数据集中提取特定数据。

如果我说R语言可以让您设置概率，这可能会解决很多问题，你们中的一些人可能会感到放松。让我们通过一个简单的例子来看看它是如何工作的。

让我们想象一个能够制造10个手表的公司。在这10个手表中，有20%是有缺陷的。让我们通过以下代码来说明这一点。

#creates a probability of 80% good watches an 20% effective watches.
 sample (c('Good','Defective'), size=10, replace=T, prob=c(.80,.20))
 
"Good"      "Good"      "Good"      "Defective" "Good"      "Good"     
"Good"      "Good"      "Defective" "Good"

您还可以尝试下面的不同概率调整选项。

 sample (c('Good','Defective'), size=10, replace=T, prob=c(.60,.40))
 
--> "Good"      "Defective" "Good"      "Defective" "Defective" "Good"     
 "Good"      "Good"      "Defective" "Good"

总结

Note: This is a direct translation of “wrapping up” in its general sense of summarizing or concluding. If you are referring to the literal act of wrapping something up (e.g. a present), please let me know and I will provide a different translation.

在这个教程中，你学会了如何从数据集、向量和列表中生成样本，可以选择是否有放回地抽样。当你需要生成相同序列的样本时，set.seed()函数会很有帮助。

尝试从R中的各种数据集中取样，同时您也可以导入一些CSV文件并进行概率调整来进行取样。

更多学习：R文档