【问题标题】:Subset of data frame base on the value in column or neighboring columns基于列或相邻列中的值的数据框子集
【发布时间】:2017-12-14 07:23:40
【问题描述】:

我想找出哪一列包含最多的1。数字1 每行只能出现一次。一旦找到具有最高编号 1 的列,脚本还应检查相邻列 (+1+ / -1),如果其中任何一个包含编号 1,则它也应被选中。所有这些行都应保留在子集函数内。

我们放一部分原始数据:

structure(list(   `10` = c(0, 0, 0, 0),  `34` = c(0, 0, 0, 0),
                  `59` = c(0, 0, 0, 0),  `84` = c(0, 0, 0, 0),
                 `110` = c(0, 0, 0, 0), `134` = c(0, 0, 0, 0),
                 `165` = c(0, 0, 0, 0), `199` = c(0, 0, 0, 0),
                 `234` = c(0, 0, 0, 0),
                 `257` = c(0.0160178986200301, 0, 0.0409772658686249, 0.0289710439505515),
                 `362` = c(0.0679054515644214, 0.126933274414494, 0.0855598028367368, 0.0596214721268868),
                 `433` = c(0.490914059297718, 0.604765061128296, 0.813348757670254, 1),
                 `506` = c(1, 1, 1, 0.971410482822965),
                 `581` = c(0.198244295668807, 0.234158197083517, 0.269655970224324, 0.195318383259472),
                 `652` = c(0.271177756524115, 0.223018854028576, 0.301352982597324, 0.142584385725234),
                 `733` = c(0.212426561005602, 0.212778023272942, 0.228513228045468, 0),
                 `818` = c(0.213816778248395, 0.168570481661511, 0.264465345538678, 0),
                 `896` = c(0.137102063123377, 0, 0.320234382858867, 0),
                 `972` = c(0.108932231179123, 0, 0.179106729705261, 0),
                `1039` = c(0.101762535865555, 0, 0, 0),
                   EOD = c("Peter", "Peter", "Peter", "Peter"),
               Complex = c(""FT team", "FT team", "FT team", "FT team")),
          .Names = c("10", "34", "59", "84", "110", "134", "165", "199",
                     "234", "257", "362", "433", "506", "581", "652", "733",
                     "818", "896", "972", "1039", "EOD", "Complex"),
          row.names = c("Peter_1_Rep_1_E", "Peter_1_Rep_2_E",
                        "Peter_1_Rep_3_E", "Peter_1_Rep_4_E"),
          class = "data.frame")

正如您在原始数据中清楚地看到的那样,应选择506 列作为包含最多1 的列,并且应根据它对数据进行子集化。但是,输出将完全相同,因为在此数据中,相邻分数(-1,433)也包含1。这是一个简单的例子。

情况可能更复杂,比如这种情况:

structure(list(    `10` = c(0, 0, 0, 0, 0, 0, 0, 0),
                   `34` = c(0, 0, 0, 0, 0, 0, 0, 0),
                   `59` = c(0, 0, 0, 0, 0, 0, 0, 0),
                   `84` = c(0, 0, 0, 0, 0, 0, 0, 0),
                  `110` = c(0, 0, 0, 0, 0, 0, 0, 0),
                  `134` = c(0.168783347110543, 0, 0.382618775924215, 0, 0.530638724516877, 0, 0.169526042048202, 0),
                  `165` = c(1, 0.36380544964196, 1, 0.13979454361738, 1, 0.239652477288689, 1, 0.240341578327444),
                  `199` = c(0.355158938904336, 1, 0.646724265971128, 1, 0.582637073151552, 1, 0.20319390520841, 1),
                  `234` = c(0.0963628165627114, 0.575436312346942, 0.229853828180188, 0.433555069046817, 0.247567185011894, 0.508529485059242, 0.138356164383562, 0.389880251276011),
                  `257` = c(0, 0.17393595585728, 0, 0.127787133715056, 0, 0.117147323350173, 0, 0),
                  `362` = c(0, 0, 0, 0.0919333108790839, 0, 0, 0, 0),
                  `433` = c(0, 0, 0, 0.0745570899292691, 0, 0, 0, 0),
                  `506` = c(0, 0, 0, 0, 0, 0, 0, 0),
                  `581` = c(0, 0, 0, 0, 0, 0, 0, 0),
                  `652` = c(0, 0, 0, 0, 0, 0, 0, 0),
                  `733` = c(0, 0, 0, 0, 0, 0, 0, 0),
                  `818` = c(0, 0, 0, 0, 0, 0, 0, 0),
                  `896` = c(0, 0, 0, 0, 0, 0, 0, 0),
                  `972` = c(0, 0, 0, 0, 0, 0, 0, 0),
                 `1039` = c(0, 0, 0, 0, 0, 0, 0, 0),
                    EOD = c("Paul", "Paul", "Paul", "Paul", "Paul", "Paul", "Paul", "Paul"),
                Complex = c("GG Team", "GG Team", "GG Team", "GG Team", "GG Team", "GG Team", "GG Team", "GG Team")),
          .Names = c("10", "34", "59", "84", "110", "134", "165", "199", "234", "257", "362", "433", "506", "581", "652", "733", "818", "896", "972", "1039", "EOD", "Complex"),
          row.names = c("PaulG_1_Rep_1_E", "Paul_1_Rep_1_E", "PaulN_1_Rep_2_E", "PaulG_1_Rep_2_E", "Paul_1_Rep_3_E", "PaulC_1_Rep_3_E", "PaulC_1_Rep_4_E", "Paul_1_Rep_4_E"),
          class = "data.frame")

在这种情况下,有两列包含相同数量的1s。在这种情况下,应选择具有较大 colsum 的列。

【问题讨论】:

  • 您的数据中没有包含多个1 的行。您可以使用rowSums(d == 1) 进行检查
  • 对不起,我写错了。数据没问题。当它找到1s 数量最多的列时,它应该查看该列的相邻分数是否在除已选择的行之外的任何其他行中包含1
  • 我编辑了整个问题。请查看更新版本。
  • 请编辑您的问题并包括您的预期输出和您想要过滤掉的行。
  • 完成。请注意,我也在查看qsec (-1) 列,但在这种情况下,它不包含任何1

标签: r


【解决方案1】:

df1 成为您的输入:

df_num <- df1[,sapply(df1,is.numeric)]            # keep only numeric columns to build filter
n1 <- colSums(df_num == 1)                        # number of 1s per column
i  <- which(n1 == max(n1))                        # index of cols with max 1s
if(length(i) > 1){
  max_cs <- which.max(colSums(df_num[,i]))        # index of col with max colsum among results
  i <- i[max_cs]                                  # our column index
}
filter <- rowSums(df_num[,seq(max(i-1,0),min(i+1,ncol(df_num)))]==1) >0    # filter is true if chosen column is 1 or if any neighbour is 1

df1[filter,] # your result

在您的两个示例中,所有行都保留

【讨论】:

  • 1 的最大数量在第一列或最后一列被识别时,这看起来很好,直到它遇到问题。它试图检查相邻的列,但在一侧没有任何内容。 Error in rowSums(data[, seq(j - 1, j + 1)] == 1) : error in evaluating the argument 'x' in selecting a method for function 'rowSums': Error in [.data.frame(data, , seq(j - 1, j + 1)) : undefined columns selected
【解决方案2】:

我会使用 tidyverse 将其转换为长格式,然后拉入列总和以确定第一个(总和最大)在哪里:

library(tidyverse)

# add rownames to the data frame
df2$id  <- rownames(df2)

# make a data frame of each column's sum
thecolsums  <- colSums(df2[,map_lgl(df2, is.numeric)]) %>% 
  enframe(name = "colname", value = "colsum")

# change the data frame to long format
dflong  <- df2 %>% 
  mutate(rowid = row_number()) %>% 
  gather(colname, val, -rowid)

# which column has the first 1 value
whichcol  <- dflong %>% 
  group_by(colname) %>% 
  filter(val ==1) %>% 
  summarize(
    firstone = min(rowid, na.rm = T)
  ) %>% 
  left_join(thecolsums, by = 'colname') %>% 
  filter(colsum == max(colsum)) %>% 
  pluck('colname')

# what's the numerical index of the column
whichcolindex  <- which(names(df2) == whichcol)

# get previous and next columns if they exist
prevcolindex  <- ifelse(whichcolindex < 1, F, whichcolindex -1)
nextcolindex  <- ifelse(whichcolindex == ncol(df2) , F, whichcolindex +1)

# do the previous and next columns have 1s in them?
prevcolhasone  <- any(df2[,prevcolindex] == 1)
nextcolhasone  <- any(df2[,nextcolindex] == 1)

# create a vector with 1, 2 or 3 column indexes
finalindex  <- c(
    prevcolindex[prevcolhasone]
  , whichcolindex
  , nextcolindex[nextcolhasone]
)

# subset the original data frame, only preserving the columns in question
results  <- df2[, finalindex]

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2022-11-22
    • 1970-01-01
    • 1970-01-01
    • 2021-05-29
    • 1970-01-01
    • 2020-11-21
    • 1970-01-01
    相关资源
    最近更新 更多