基于另一个数据框创建一个新的数据框答案

【问题标题】：Create a new data frame based on another dataframe基于另一个数据框创建一个新的数据框
【发布时间】：2014-01-21 15:37:43
【问题描述】：

我正在尝试使用一个巨大的数据框 (180000 x 400) 来计算另一个更小的数据框。

我有以下数据框

df1=data.frame(LOCAT=c(1,2,3,4,5,6),START=c(120,345,765,1045,1347,1879),END=c(150,390,802,1120,1436,1935),CODE1=c(1,1,0,1,0,0),CODE2=c(1,0,0,0,-1,-1))

df1
  LOCAT START  END CODE1 CODE2
1     1   120  150     1     1
2     2   345  390     1     0
3     3   765  802     0     0
4     4  1045 1120     1     0
5     5  1347 1436     0    -1
6     6  1879 1935     0    -1

这是一个示例数据框。行一直持续到 180000，列超过 400。我需要做的是根据每一列创建一个新的数据框，告诉我每个连续“1”或“-1”的大小，并返回它的位置、大小和值。

CODE1 是这样的：

   LOCAT SIZE VALUE
1 1 to 2  270   POS
2 4 to 4   75   POS

对于 CODE2 也是这样：

   LOCAT SIZE VALUE
1 1 to 1   30   POS
2 5 to 6  588   NEG

不幸的是，我仍然不知道如何做到这一点。我一直在尝试几行代码来开发一个自动执行此操作的函数，但开始迷路或陷入循环，似乎没有任何效果。

任何帮助将不胜感激。提前致谢

【问题讨论】：

标签： r dataframe formula calculus

【解决方案1】：

下面的代码以您想要的确切格式为您提供答案，除了我将您的“LOCAT”列分成名为“开始”和“停止”的两列。此代码适用于您的整个数据框，无需为每个 CODE（CODE1、CODE2 等）手动复制。

假定唯一的非 CODE 列具有名称“LOCAT”、“START”和“END”。

# need package "plyr"
library("plyr")

# test2 is the example data frame that you gave in the question
test2 <- data.frame(
    "LOCAT"=1:6, 
    "START"=c(120,345,765, 1045, 1347, 1879), 
    "END"=c(150,390,803,1120,1436, 1935), 
    "CODE1"=c(1,1,0,1,0,0),
    "CODE2"=c(1,0,0,0,-1,-1)
    )

codeNames <- names(test2)[!names(test2)%in%c("LOCAT","START","END")] # the names of columns that correspond to different codes
test3 <- reshape(test2, varying=codeNames, direction="long", v.names="CodeValue", timevar="Code") # reshape so the different codes are variables grouped into the same column
test4 <- test3[,!names(test3)%in%"id"] #remove the "id" column

sss <- function(x){ # sss gives the starting points, stopping points, and sizes (sss) in a data frame
    rleX <- rle(x[,"CodeValue"]) # rle() to get the size of consecutive values
    stops <- cumsum(rleX$lengths) # cumulative sum to get the end-points for the indices (the second value in your LOCAT column)
    starts <- c(1, head(stops,-1)+1) # the starts are the first value in your LOCAT column
    ssX0 <- data.frame("Value"=rleX$values, "Starts"=starts, "Stops"=stops) #the starts and stops from X (ss from X)
    ssX <- ssX0[ssX0[,"Value"]!=0,] # remove the rows the correspond to CODE_ values that are 0 (not POS or NEG)

    # The next 3 lines calculate the equivalent of your SIZE column
    sizeX1 <- x[ssX[,"Starts"],"START"]
    sizeX2 <- x[ssX[,"Stops"],"END"]
    sizeX <- sizeX2 - sizeX1

    sssX <- data.frame(ssX, "Size"=sizeX) # Combine the Size to the ssX (start stop of X) data frame
    return(sssX) #Added in EDIT

}

answer0 <- ddply(.data=test4, .variables="Code", .fun=sss) # use the function ddply() in the package "plyr" (apply the function to each CODE, why we reshaped)
answer <- answer0 # duplicate the original, new version will be reformatted
answer[,"Value"] <- c("NEG",NA,"POS")[answer0[,"Value"]+2] # reformat slightly so that we have POS/NEG instead of 1/-1

希望这会有所帮助，祝你好运！

【讨论】：

哇啊啊啊...这让它变得如此简单。我仍然需要更好地查看您的 sss 函数，但结果非常好，并且可以方便地将所有代码放在同一个数据框中。我会尝试在大数据框上应用所有内容，如果出现问题，我会告诉你。再次感谢您。
没问题！请注意，我刚刚编辑了答案以在函数 sss 中返回 sssX ...否则执行 sss(test4[test4[,"Code"]==1,]) （例如）不会返回任何内容。不过最终产品是一样的。

【解决方案2】：

使用游程编码来确定CODE1 取相同值的组。

rle_of_CODE1 <- rle(df1$CODE1)

为方便起见，找出值不为零的点，以及相应块的长度。

CODE1_is_nonzero <- rle_of_CODE1$values != 0
n <- rle_of_CODE1$lengths[CODE1_is_nonzero]

忽略df1 中CODE1 为零的部分。

df1_with_nonzero_CODE1 <- subset(df1, CODE1 != 0)

根据我们使用rle 找到的连续块定义一个组。

df1_with_nonzero_CODE1$GROUP <- rep(seq_along(n), times = n)

使用ddply 获取每个组的汇总统计信息。

summarised_by_CODE1 <- ddply(
  df1_with_nonzero_CODE1, 
  .(GROUP), 
  summarise, 
  MinOfLOCAT = min(LOCAT), 
  MaxOfLOCAT = max(LOCAT),
  SIZE       = max(END) - min(START)
)
summarised_by_CODE1$VALUE <- ifelse(
  rle_of_CODE1$values[CODE1_is_nonzero] == 1, 
  "POS", 
  "NEG"
)
summarised_by_CODE1
##   GROUP MinOfLOCAT MaxOfLOCAT SIZE VALUE
## 1     1          1          2  270   POS
## 2     3          4          4   75   POS

现在重复CODE2。

【讨论】：

感谢您的帮助。我喜欢你简化一切的方式以及这一切是如何有意义的。一旦我做到了，我意识到这并不难。我接受了答案 rbatt，因为他设法一次将我想要的内容应用于所有列，但我可能会在公式中使用您的建议来申请所有列。我不知道函数 rle，我肯定会更频繁地使用它。再次感谢。干杯