【问题标题】:I need to create a variable that selects, among some specific columns in a dataset, the one that is closest to another specific column我需要创建一个变量,在数据集中的某些特定列中选择最接近另一个特定列的变量
【发布时间】:2019-11-11 17:46:08
【问题描述】:

我有一个与此类似的数据集:

data= data.frame(a=c(33,44,55), b= c(99,77,NA,66), 
      var1=c(1,2,3,NA),var2=c(5,6,NA,7),var3=c(8,9,10,NA), x = c(6,5,4,3))

我需要创建一个列,为每一行输出列 var1、var2 和 var3 中最接近 x 列的值,忽略 var1:var3 中的 NA。

类似:

closest_x
  5
  6
  3
  7

在我的实际问题中,我的列比这多得多,所以我想使用starts_with 选择要与 X 比较的列(上面表示为“var1”等的列)。

我尝试使用 X 列和“var”列之间的模块化差异创建列,然后我尝试了类似的方法:

data %>% mutate(pmin = pmin(starts_with("var")))

mutate(data, C = pmin(starts_with("var")))

还有

data %>% with(pmin(starts_with("var")))

它表示未设置变量上下文。除此之外,如果我不必使用这种模数差异创建许多其他变量,并且直接使用最接近 X 列的值,那会更好。

我在这篇文章中找到了一些非常接近我需要的内容: Closest value to a specific column in R

但是,我不知道如何应用与我的问题类似的内容,因为我有更多列并且我只想选择那些以特定单词开头的列。

编辑:我需要将变量中的 NA 与“x”进行比较以被忽略。

编辑 2:我的真实数据集的代码过去运行良好。现在我尝试再次运行它,但它不能正常工作。我试图找出发生了什么变化,甚至是否有任何包发生了变化,但似乎并非如此。

下面有一段代码可以生成我的真实数据的小样本。而不是 var1、var2 等。我有 ideolparty_A:ideolparty_I 而不是 x(要比较的变量)我有 ideol_self

max.col 的解决方案直到几个月前才有效,代码如下:


temp_df <- -abs(cses_pr[cols] - cses_pr$ideol_self)
cses_pr$closest <- cses_pr[cols][cbind(1:nrow(cses_pr), 
                                       max.col(replace(temp_df, is.na(temp_df), -Inf)))]

但现在它会产生以下代码:Error: Subscript `cbind(...)` is a matrix, it must be of type logical.,然后我才能运行最后一行代码:

cses_pr &lt;- cses_pr %&gt;% mutate (cong_closest = abs(closest-ideol_self))

structure(list(election = c("PER_2000", "PER_2006", "PER_2006", 
"USA_2008", "MEX_2012", "ROU_1996", "MEX_2012", "TWN_2008", "USA_1996", 
"PER_2016", "ARG_2015", "FRA_2012", "MEX_2012", "SRB_2012", "USA_1996", 
"ROU_2014", "ROU_2004", "ROU_2009", "RUS_2000", "ROU_2014", "CHL_1999", 
"BRA_2006", "RUS_2004", "BRA_2002", "TWN_2012", "MEX_2012", "TWN_2008", 
"SRB_2012", "USA_2004", "BRA_2002", "PER_2000", "USA_2008", "ARG_2015", 
"FRA_2012", "PHL_2016", "TWN_2012", "LTU_1997", "URY_2009", "BRA_2006", 
"PER_2006", "MEX_2012", "CHL_1999", "BRA_2010", "PER_2016", "MEX_2000", 
"BRA_2002", "PER_2011", "ROU_2009", "FRA_2012", "TWN_2012", "FRA_2002", 
"PER_2000", "CHL_1999", "PER_2011", "MEX_2006", "ROU_2009", "ROU_1996", 
"BRA_2014", "ROU_1996", "ROU_2014", "ROU_2014", "FRA_2012", "PER_2016", 
"MEX_2006", "USA_2012", "ROU_2009", "ROU_2009", "BRA_2014", "KEN_2013", 
"PHL_2016", "BLR_2001", "BRA_2006", "PER_2016", "FRA_2012", "CHL_2005", 
"CHL_2009", "LTU_1997", "RUS_2000", "ROU_2014", "TWN_2012", "BRA_2006", 
"USA_2008", "USA_2004", "MEX_2012", "ROU_2004", "TWN_2012", "BRA_2014", 
"USA_2008", "TWN_2004", "PER_2000", "MEX_2006", "PHL_2004", "BRA_2002", 
"PER_2011", "CHL_2005", "PER_2006", "RUS_2000", "ARG_2015", "BRA_2010", 
"TWN_2012", "MEX_2006", "ARG_2015", "BRA_2014", "TWN_2004", "BRA_2006", 
"PER_2016", "PHL_2016", "URY_2009", "RUS_2000", "PER_2006", "FRA_2002", 
"BRA_2002", "KEN_2013", "RUS_2004", "PER_2006", "TWN_2012", "PER_2011", 
"PHL_2010", "PER_2006", "FRA_2012", "PHL_2016", "MEX_2000", "RUS_2000", 
"TWN_2004", "BRA_2002", "ARG_2015", "FRA_2012"), ideol_self = c(10, 
NA, 0, 6, 10, NA, 5, 5, 8, 2, 5, 5, 3, NA, 3, 5, 5, 10, 5, NA, 
10, 3, 6, 6, NA, NA, 5, 10, 5, 5, NA, NA, NA, 2, 5, NA, 10, 8, 
5, 6, 10, 5, 10, 0, 10, 3, NA, 9, 5, NA, 10, 6, 5, 7, NA, 6, 
NA, NA, NA, 9, NA, 2, 9, 10, 10, NA, 5, 7, NA, 8, NA, 8, NA, 
5, 6, 0, 6, 0, 7, NA, NA, 3, 2, NA, 7, NA, 4, 1, 4, NA, 6, 6, 
NA, 4, NA, 10, 5, 9, NA, NA, 1, 5, NA, 5, 3, 7, 3, 3, 0, 8, 4, 
0, 5, 6, 5, NA, 6, 10, NA, 7, 7, NA, 3, NA, NA, 4, 1), ideolparty_A = c(5, 
5, 0, 7, 10, NA, NA, 5, NA, 2, 3, 2, 9, 9, NA, 9, 0, 10, NA, 
NA, NA, 6, 7, 2, NA, 9, NA, 8, 7, 6, 5, NA, NA, 0, 8, NA, NA, 
2, NA, 5, 10, NA, 0, NA, 0, 4, NA, 8, 2, NA, 5, 3, NA, 3, 10, 
6, NA, NA, NA, 2, NA, 4, 10, 0, 10, NA, 10, NA, NA, 6, NA, 4, 
NA, 3, 10, 10, NA, NA, 1, NA, NA, 6, 10, NA, 3, NA, NA, 1, 2, 
NA, 8, 6, 3, 3, NA, 7, NA, 9, 6, NA, 10, 4, NA, 3, 7, 6, 5, 3, 
NA, 1, 7, 1, 10, 7, NA, NA, 0, 0, 2, 1, 9, NA, NA, NA, 8, 5, 
1), ideolparty_B = c(9, 5, 10, 5, 1, NA, NA, 5, NA, 7, 6.5, 8, 
1, 5, NA, 5, 10, 0, NA, NA, NA, 6, 2, 7, NA, 9, NA, 6, 5, 4, 
8, NA, NA, 10, 10, NA, NA, 9, NA, 4, 10, NA, 10, NA, 0, 6, NA, 
9, 5, NA, 10, 0, NA, 5, 6, 3, NA, NA, NA, 9, NA, 8, 6, 0, 0, 
NA, 0, NA, NA, 7, NA, 2, NA, 7, 8, 10, NA, NA, 10, NA, NA, 4, 
4, NA, 8, NA, NA, 10, 8, NA, 4, 7, NA, 5, NA, 8, NA, 2.5, 7, 
NA, 0, 8.5, NA, 5, 1, 8, 4, 10, NA, 10, 10, 6, 4, 0, NA, NA, 
4, 10, 0, 8, 1, NA, NA, NA, 10, 8.5, 8), ideolparty_C = c(7, 
7, 10, NA, 1, NA, NA, NA, NA, 2, 5, 3, 0, 0, NA, 8, 10, 0, NA, 
NA, NA, 6, 2, 0, NA, 2, NA, 2, NA, 4, 4, NA, NA, 7, NA, NA, 10, 
5, NA, 4, 0, NA, 7, 0, 10, 2, NA, 9, 10, NA, 3, NA, NA, 5, 10, 
7, NA, NA, NA, 3, NA, 10, 0, 10, NA, NA, 10, NA, NA, NA, NA, 
8, NA, 8, 6, 5, 8, NA, NA, NA, NA, NA, 9, NA, 9, NA, NA, NA, 
7, NA, 5, 6, NA, 7, NA, 0, NA, 4, 3, NA, 0, 4, NA, 6, 7, 0, NA, 
10, NA, 1, 5, NA, 8, 0, NA, NA, 7, 10, 8, 10, NA, NA, NA, NA, 
NA, 6, 10), ideolparty_D = c(7, 6, NA, NA, NA, NA, NA, NA, NA, 
5, NA, 3, 9, 6, NA, NA, 0, 0, NA, NA, NA, 6, 4, 8, NA, 9, NA, 
5, NA, 4, 3, NA, NA, 4, 3, NA, 4, NA, NA, 1, 10, NA, NA, NA, 
10, 7, NA, 3, 2, NA, 7, 0, NA, 6, 7, 0, NA, NA, NA, 2, NA, 2, 
9, 0, NA, NA, 5, NA, NA, 7, NA, 6, NA, 3, 10, 5, 6, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, 3, NA, 5, 5, NA, 7, NA, 0, NA, 
NA, NA, NA, 0, NA, NA, 4, 10, 8, 5, 10, NA, 1, 9, 2, 2, 5, NA, 
NA, 10, 10, NA, 1, 0, NA, NA, NA, NA, NA, 0), ideolparty_E = c(5, 
5, 0, NA, 1, NA, NA, NA, NA, NA, NA, 5, 0, NA, NA, 9, 10, 10, 
NA, NA, NA, 6, 4, NA, NA, 2, NA, 1, NA, NA, 4, NA, NA, 5, 3, 
NA, 8, NA, NA, 0, 0, NA, 10, NA, 0, NA, NA, 6, 5, NA, NA, 0, 
NA, 5, 5, NA, NA, NA, NA, 3, NA, NA, NA, 0, NA, NA, 5, NA, NA, 
7, NA, 4, NA, 4, 5, 2, 6, NA, 10, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, 3, NA, 2, 4, NA, 7, NA, 8, NA, 5, NA, NA, 0, 7, NA, 3, 
5, NA, 4, 3, NA, 2, 1, NA, NA, 10, NA, NA, 5, 0, 0, 2, 9, NA, 
NA, NA, NA, 4, 8), ideolparty_F = c(7, 5, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, 5, 0, 4, NA, 1, 10, NA, NA, NA, NA, 6, 4, NA, 
NA, 8, NA, 7, NA, NA, 6, NA, NA, 5, 4, NA, NA, NA, NA, NA, 10, 
NA, NA, NA, 0, NA, NA, NA, 5, NA, NA, 3, NA, 7, 8, NA, NA, NA, 
NA, 2, NA, 5, 6, 0, NA, NA, NA, NA, NA, 6, NA, 8, NA, 6, 1, NA, 
NA, NA, 6, NA, NA, NA, NA, NA, 2, NA, NA, NA, NA, NA, 5, 5, NA, 
10, NA, 0, NA, NA, NA, NA, 0, NA, NA, NA, 7, 3, 3, NA, NA, 1, 
7, NA, NA, 5, NA, NA, 2, 5, NA, 1, 2, NA, NA, NA, NA, NA, 2), 
    ideolparty_G = c(NA, 7, NA, NA, NA, NA, NA, NA, NA, NA, 7, 
    NA, 0, 7, NA, NA, NA, 0, NA, NA, NA, NA, NA, 7, NA, 2, NA, 
    0, NA, 4, NA, NA, NA, NA, NA, NA, 4, NA, NA, NA, 0, NA, NA, 
    NA, NA, 6, NA, 8, NA, NA, 2, NA, NA, NA, 8, NA, NA, NA, NA, 
    NA, NA, NA, 4, 0, NA, NA, 5, NA, NA, NA, NA, NA, NA, NA, 
    NA, 1, 6, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, 0, NA, 0, NA, NA, 0, 10, NA, NA, 
    NA, 2, NA, NA, NA, 1, 3, 6, NA, NA, NA, NA, NA, NA, 0, NA, 
    NA, NA, NA, NA, 10, 8, NA), ideolparty_H = c(NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, 6, NA, NA, NA, NA, NA, 0, NA, 
    NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, 3, NA, NA, NA, 
    NA, NA, NA, 0, NA, NA, NA, NA, NA, 0, NA, NA, 1, NA, NA, 
    NA, NA, 5, NA, NA, NA, 7, NA, NA, NA, NA, NA, NA, NA, NA, 
    0, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, 5, 3, NA, 0, 7, NA, NA, NA, NA, NA, NA, NA, 
    NA, 8, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, 9, NA), ideolparty_I = c(NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, 4, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 10, NA, 2, 
    NA, NA, NA, NA, NA, 0, NA, NA, NA, NA, NA, NA, NA, 0, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, 7, NA, NA, NA, NA, NA, NA, 10, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    9, NA, NA, NA, 4, NA, NA, NA, NA, 5, NA, NA, NA, 1, NA, NA, 
    NA, NA, NA, NA, 4, NA, NA, 2, NA, NA, NA, NA, 6, NA)), row.names = c(NA, 
-127L), class = c("tbl_df", "tbl", "data.frame"))

【问题讨论】:

    标签: r dplyr


    【解决方案1】:

    这是一种使用max.col的矢量化方式

    cols <- grep("^var", names(data))
    data$closest_x <- data[cols][cbind(1:nrow(data), 
                          max.col(-abs(data[cols] - data$x)))]
    
    #   a  b var1 var2 var3  x closest_x
    #1 33 99   24   15   45 11        15
    #2 44 77   12   30   27 22        27
    #3 55 66   76   20   15 33        20
    

    或使用apply

    data$closest_x <- apply(data, 1, function(p) 
                      p[cols][which.min(abs(p[cols] - p["x"]))])
    

    如果数据中有NA 值,我们可以用-Inf 替换它们,然后是子集

    temp_df <- -abs(data[cols] - data$x)
    data$closest_x <- data[cols][cbind(1:nrow(data), 
                       max.col(replace(temp_df, is.na(temp_df), -Inf)))]
    

    【讨论】:

    • 谢谢。第一个解决方案有效,但我忘了提到我缺少我想被忽略的值。这很难做到,某种“na.rm”的事情?我正在尝试将该解决方案与我喜欢的max.col with NA removal 结合起来。但是某种 NA.rm 会更好,所以我不会弄乱数据集。第二个选项不起作用(“二元运算符的非数字参数”)。
    • @GuilhermePiresArbache 是的,我们可以用-Inf 替换它们,然后是子集。你能检查更新的答案吗?
    • 我尝试再次使用此代码,因为我正在恢复使用它的项目,但不幸的是它不适用于我的真实数据集。请参阅上面的编辑。会有什么解决办法吗?感谢任何帮助。
    • @GuilhermePiresArbache 2 年后很难记住答案的上下文。将cses_pr 更改为数据框并再次尝试答案。 cses_pr &lt;- data.frame(cses_pr)
    • 是的,这就是为什么我试图把所有东西都放在我的新 EDIT 中。无论如何,我不敢相信这是问题所在!出于某种原因,其他一些转换使它与 data.frame 有所不同。非常感谢!
    【解决方案2】:

    “整洁”的方法

    一个更“整洁”的解决方案可能是这样的。

    data %>%
    
        # reshape data to long format w/ row numbers
        mutate(row = row_number()) %>%
        gather(col, val, starts_with('var')) %>%
    
        # compute the minimum difference row-by-row
        group_by(row) %>%
        summarize(closest_to_x = val[which.min(abs(val - x))]) %>%
    
        # the next two lines just take the new column and paste it back onto the original data
        select(closest_to_x) %>%
        bind_cols(data, .)
    

    它有点冗长,但我觉得它相当易读(当然是 YMMV)。不确定性能。它不使用max.col()pmin(),而是依赖于将数据重新格式化为“整洁”的格式,您关心的所有列的值都放入一个val 列。

    【讨论】:

    • 谢谢。有没有办法在忽略 NA 的情况下完成这项工作(请检查我的编辑)?
    • which.min() 默认忽略 NA。此代码仍然适用于您编辑的数据。
    • 抱歉这么久才回答。但是此代码不适用于我的真实数据(请参阅上面的编辑)。它确实适用于示例数据。我收到诸如“错误:无法回收 ..1(尺寸 4)以匹配 ..2(尺寸 38390)”之类的警告。
    猜你喜欢
    • 2023-03-27
    • 1970-01-01
    • 1970-01-01
    • 2019-02-11
    • 2021-07-23
    • 1970-01-01
    • 2018-08-02
    • 1970-01-01
    • 2021-04-23
    相关资源
    最近更新 更多