【问题标题】:How to add probabilities values of a dataframe colunm, by categories, to a new column in the same dataframe?如何按类别将数据框列的概率值添加到同一数据框中的新列?
【发布时间】:2018-01-02 19:44:45
【问题描述】:

我在面对数据框时遇到了问题。可以说,我有一个数据框,其中一列包含值(范围为 0 到 100000)。一个例子:

                     TCGA.CG.4462
 ENSG00000000003       4.7574661
 ENSG00000000005       0.0000000
 ENSG00000000419       24.1066335
 ENSG00000000457       2.7631012
 ENSG00000000460       0.8928772

我想通过以下 5 个类别添加一个新列,其中包含列数据的概率:

  • non_expressed [0]
  • low_expressed ]0,1]
  • normal_expressed ]1,10]
  • high_expressed [10,100]
  • very_high_expressed > 100

因此,例如,我想在新列中添加的值是:

  • non_expressed:0.2
  • low_expressed 0.2
  • normal_expressed:0.4
  • high_expressed:0.2
  • very_high_expressed:0.0

所以我的数据框变成了这样:

                     TCGA.CG.4462     Prob
 ENSG00000000003       4.7574661      0.4
 ENSG00000000005       0.0000000      0.2
 ENSG00000000419       24.1066335     0.2
 ENSG00000000457       2.7631012      0.4
 ENSG00000000460       0.8928772      0.2

我已经尝试了很多不同的方法,但是到目前为止都没有奏效。我认为 if 条件将是解决我的问题的最佳方法,但是,if 条件会给出错误,因为条件的长度> 1。 谁能告诉我解决这个问题的最佳方法是什么?

【问题讨论】:

    标签: r dataframe


    【解决方案1】:

    我们可以使用cut 来查找区间并用所需的概率标记它们。由于概率中有重复,因此会出现警告消息,可以忽略。请看下面的演示:

    library(data.table)
    cut(df1$TCGA.CG.4462, breaks = c(-Inf, 0, 1, 10, 100, Inf), include.lowest = TRUE)
    # [1] (1,10]   [-Inf,0] (10,100] (1,10]   (0,1]   
    # Levels: [-Inf,0] (0,1] (1,10] (10,100] (100, Inf]
    
    df1[, prob := as.numeric(as.character(cut(TCGA.CG.4462, 
                                              breaks = c(-Inf, 0, 1, 10, 100, Inf), 
                                              include.lowest = TRUE,
                                              labels = c(0.2, 0.2, 0.4, 0.2, 0.0))))]
    
    # Warning message:
    #   In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) 
    #                 else paste0(labels,  : duplicated levels in factors are deprecated
    
    df1
    #              genes TCGA.CG.4462 prob
    # 1: ENSG00000000003    4.7574661  0.4
    # 2: ENSG00000000005    0.0000000  0.2
    # 3: ENSG00000000419   24.1066335  0.2
    # 4: ENSG00000000457    2.7631012  0.4
    # 5: ENSG00000000460    0.8928772  0.2
    

    使用基础 R(未使用包)

    df1 <- within(df1, prob <- as.numeric(as.character(cut(TCGA.CG.4462, 
                                                           breaks = c(-Inf, 0, 1, 10, 100, Inf), 
                                                           include.lowest = TRUE,
                                                           labels = c(0.2, 0.2, 0.4, 0.2, 0.0)))))
    

    数据:

    library(data.table)
    df1 <- fread('ENSG00000000003       4.7574661
                 ENSG00000000005       0.0000000
                 ENSG00000000419       24.1066335
                 ENSG00000000457       2.7631012
                 ENSG00000000460       0.8928772', header = F)
    colnames(df1) <- c("genes", "TCGA.CG.4462")
    

    编辑:第三列:将值 1 添加到“第三”列

    data.table 包

    df1[, `:=` ( prob = as.numeric(as.character(cut(TCGA.CG.4462, 
                                              breaks = c(-Inf, 0, 1, 10, 100, Inf), 
                                              include.lowest = TRUE,
                                              labels = c(0.2, 0.2, 0.4, 0.2, 0.0)))),
                 third = 1)]
    

    基础 R:

    within(df1, c(prob <- as.numeric(as.character(cut(TCGA.CG.4462, 
                                                  breaks = c(-Inf, 0, 1, 10, 100, Inf), 
                                                  include.lowest = TRUE,
                                                  labels = c(0.2, 0.2, 0.4, 0.2, 0.0)))),
           third <- 1))
    

    【讨论】:

    • 谢谢萨西什!这确实是一种很好的方法,而且效果很好!
    • 是否可以使用类似的方法添加具有相应类别的第三列?
    【解决方案2】:

    这是另一个data.table 解决方案,它使用查找表非等值连接中的更新

    library(data.table)
    # create lookup table
    lookup <- data.table(
      expression = c("non", "low", "normal", "high", "very_high"),
      Prob = c(0.2, 0.2, 0.4, 0.2, 0.0),
      lower = c(-Inf, 0, 10^(0:2))
    )
    lookup[, upper := shift(lower, type = "lead", fill = Inf)][]
    
       expression Prob lower upper
    1:        non  0.2  -Inf     0
    2:        low  0.2     0     1
    3:     normal  0.4     1    10
    4:       high  0.2    10   100
    5:  very_high  0.0   100   Inf
    
    # update in a non-equi join
    # note the left open intervals
    setDT(DT)[lookup, on = .(TCGA.CG.4462 > lower, TCGA.CG.4462 <= upper), 
       `:=`(expression = expression, Prob = Prob)][]
    
                row.id TCGA.CG.4462 expression Prob
    1: ENSG00000000003    4.7574661     normal  0.4
    2: ENSG00000000005    0.0000000        non  0.2
    3: ENSG00000000419   24.1066335       high  0.2
    4: ENSG00000000457    2.7631012     normal  0.4
    5: ENSG00000000460    0.8928772        low  0.2
    

    数据

    library(data.table)
    DT <- fread(
      "row.id                TCGA.CG.4462
     ENSG00000000003       4.7574661
      ENSG00000000005       0.0000000
      ENSG00000000419       24.1066335
      ENSG00000000457       2.7631012
      ENSG00000000460       0.8928772"
    )
    

    【讨论】:

      猜你喜欢
      • 2018-12-01
      • 1970-01-01
      • 2022-01-15
      • 1970-01-01
      • 2015-10-22
      • 1970-01-01
      • 2021-12-22
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多