创建新变量的字符位置标识答案

【问题标题】：Character-location identity to create a new variable创建新变量的字符位置标识
【发布时间】：2015-09-29 13:29:09
【问题描述】：

让我们先获取一些随机数据

A <- c(1:5)
score_one <- c(123.5, 223.1, 242.2, 351.8, 123.1)
score_two <- c(324.2, 568.2, 124.9, 323.1, 213.4)
score_three <- c(553.1, 412.3, 435.7, 523.1, 365.4)
score_four <- c(123.2, 225.1, 243.6, 741.1, 951.2)


df1 <- data.frame(A, score_one, score_two, score_three, score_four)

library(dplyr)
library(tidyr)

df2 <- df1 %>% 
  group_by(A) %>% 
  mutate_each(funs(substr(.,1,1))) %>%                
  ungroup %>%
  gather(variable, type, -c(A)) %>%                     
  select(-variable) %>%
  mutate(type = paste0("type_",type),
         value = 1) %>%
  group_by(A,type) %>%                                     
  summarise(value = sum(value)) %>% 
  ungroup %>%
  spread(type, value, fill=0) %>%                       
  inner_join(df1, by=c("A")) %>%                            
  select(A, starts_with("score_"), starts_with("type_"))

这为每个score_ 引入了一个汇总变量并计算每个唯一第一位数字

的频率

因此我们在第一行看到 type_1 == 2。因为在相应的 score_ 列中我们有 2 次出现，其中数字 1 是第一个数字

问题陈述
现在我们要引入一个调用type_n 列的变量。

它检查值是否 > 0。
在这种情况下，我们要检查对应的score_ column/s
这里我们分析小数点后位是否>=大于值2
现在如果一个或所有对应行的小数点后的值 >= 2，我们要分配一个值 1
如果所有对应行的小数点后的值都是我们要赋值为0
因此，如果type_n == 0，我们要分配一个 0
假设我们将此变量命名为$type_n_G2

这样所需的输出应该看起来像1

以type_1_G2为例

我们有type_1 == 2
我们在score_one和score_four有对应的身份
小数点后的两个值都>= 2，所以我们分配type_1_G2==1

【问题讨论】：

我不明白想要的输出是什么。这里的代码和措辞太多，我看不出你真正想要实现的目标。
在你的例子中我不明白你为什么选择 score_one 和 score_four？既然您正在评估 type_1 不应该只是 score_one？
我们要检查score_one 和score_four，因为它们都以== 1 开头
提供的数据集中的值与您的图像不匹配。第 1 行的 score_four 是 123.1 而不是 123.2，第二行的 score_one 是 223.7 而不是 223.1。等等
投反对票，制作一个正确的输入示例，并制作一个包含所有边缘情况的预期输出，以便可以根据它验证答案。

标签： r dplyr tidyr

【解决方案1】：

在我看来，df2 的复杂构造是没有必要的。将df1 重新整形为长格式是一个更好的起点，可以通过更少的步骤获得所需的最终结果。

使用data.table 包的方法：

library(data.table)
# melting the original dataframe 'df1' to a long format datatable
dt <- melt(setDT(df1), "A")

# creating two type variables & a logical vector indicating whether
# the decimal for a specific type is equal or above .2
dt[, `:=` (type1=paste0("type_",substr(value,1,1)),
           type2=paste0("type_",substr(value,1,1),"_g2"))
   ][, g2 := +(+(value - floor(value) >= 0.2)==1), .(A,type1)]

# creating separate wide datatables for the variable & two type columns
dt1 <- dcast(dt, A ~ variable)
dt2 <- dcast(dt, A ~ type1)
dt3 <- dcast(dt, A ~ type2, fun=sum, value.var="g2")[, lapply(.SD, function(x) +(x>=1)), A]

# two options for merging the wide datatables together into one
dtres <- dt1[dt2[dt3, on = "A"], on = "A"]
dtres <- Reduce(function(...) merge(..., all = TRUE, by = "A"), list(dt1, dt2, dt3))

# or in one go without creating intermediate datatables
dtres <- dcast(dt, A ~ variable)[dcast(dt, A ~ type1)[dcast(dt, A ~ type2, fun=sum, value.var = "g2")[, lapply(.SD, function(x) +(x>=1)) , A], on = "A"], on = "A"]

这会导致：

> dtres
   A score_one score_two score_three score_four type_1 type_2 type_3 type_4 type_5 type_7 type_9 type_1_g2 type_2_g2 type_3_g2 type_4_g2 type_5_g2 type_7_g2 type_9_g2
1: 1     123.5     324.2       553.1      123.2      2      0      1      0      1      0      0         1         0         0         0         0         0         0
2: 2     223.1     568.2       412.3      225.1      0      2      0      1      1      0      0         0         0         0         1         1         0         0
3: 3     242.2     124.9       435.7      243.6      1      2      0      1      0      0      0         1         1         0         1         0         0         0
4: 4     351.8     323.1       523.1      741.1      0      0      2      0      1      1      0         0         0         1         0         0         0         0
5: 5     123.1     213.4       365.4      951.2      1      1      1      0      0      0      1         0         1         1         0         0         0         1

这种方法可以翻译成dplyr/tidyr实现如下：

library(dplyr)
library(tidyr)

df <- df1 %>% gather(variable, value,-A) %>%
  mutate(type1 = paste0("type_",substr(value,1,1)),
         type2 = paste0("type_",substr(value,1,1),"_g2")) %>%
  group_by(A,type1) %>%
  mutate(g2 = +(+(value - floor(value) >= 0.2)==1),
         type1n = n()) %>%
  ungroup()

d1 <- df %>% select(1:3) %>% spread(variable, value)
d2 <- df %>% group_by(A, type1) %>% tally() %>% spread(type1, n, fill=0)
d3 <- df %>% group_by(A, type2) %>% summarise(g = any(g2==1)) %>% spread(type2, g, fill=0)

dfres <- left_join(d1, d2, by = "A") %>% left_join(., d3, by = "A")

给出相同的结果：

> dfres
  A score_one score_two score_three score_four type_1 type_2 type_3 type_4 type_5 type_7 type_9 type_1_g2 type_2_g2 type_3_g2 type_4_g2 type_5_g2 type_7_g2 type_9_g2
1 1     123.5     324.2       553.1      123.2      2      0      1      0      1      0      0         1         0         0         0         0         0         0
2 2     223.1     568.2       412.3      225.1      0      2      0      1      1      0      0         0         0         0         1         1         0         0
3 3     242.2     124.9       435.7      243.6      1      2      0      1      0      0      0         1         1         0         1         0         0         0
4 4     351.8     323.1       523.1      741.1      0      0      2      0      1      1      0         0         0         1         0         0         0         0
5 5     123.1     213.4       365.4      951.2      1      1      1      0      0      0      1         0         1         1         0         0         0         1

【讨论】：

大量使用高效编码，不需要 df2.使用dpylr/tidyr的第二个解决方案效果很好。但是，在使用 data.tablemethod 时，我在 dt2 和 dt3 出现错误
@lukeg 它对我有用。你使用的是哪个版本的data.table？

【解决方案2】：

这是一个矢量化尝试，首先使用 melt 然后 dcast 使用 data.table 包的数据。它需要一些润色，但我现在没有时间

library(data.table) # v >= 1.9.6
# melt and order by "A" 
temp <- setorder(melt(df2, id = 1:5), A)

# Create the "type_n_G2" column names
temp$Var <- paste0(temp$variable, "_G2")

# Selecting only the "score_one", "score_two", "score_three" and "score_four"
indx1 <- indx2 <- temp[2:5]

# Finding the first integer within each number
indx2[] <- sub("(^.{1}).*", "\\1", as.matrix(indx2))

# The works horse: simultaneously compare `indx2` against `type_n` and extract decimals
indx3 <- indx1 * (indx2 == as.numeric(sub(".*_", "", temp$variable))) - floor(indx1)

# Compare the result against 0.2, sum the rows and see if any is greater than 0
temp$res<- +(rowSums(indx3 >= 0.2) > 0)

# Convert back to wide format
dcast(temp, A ~ Var, value.var = "res")
#   A type_1_G2 type_2_G2 type_3_G2 type_4_G2 type_5_G2 type_7_G2 type_9_G2
# 1 1         1         0         0         0         0         0         0
# 2 2         0         1         0         1         1         0         0
# 3 3         1         1         0         1         0         0         0
# 4 4         0         0         1         0         0         0         0
# 5 5         0         1         1         0         0         0         1

现在您可以将结果cbind 发送到df2（这与您的结果不完全匹配，因为您提供的数据也不匹配）

【讨论】：

感谢cmets详细解释方法。

【解决方案3】：

这是一个尝试，将您的数据转换为long 格式，以便为每个值保留type 变量。这样可以更容易地在第二步中计算有多少小数 >=2。

library(tidyr)

#transform df1 to the long format
df <- df1 %>% gather(key, value, -A)

 #calculate the type for each line
 #this can be done by extracting the first digit and pasting 
 # "_type" in front of it
df$type <- as.factor(paste("type",sapply(strsplit(as.character(df$value),""),function(x) x[[1]]),sep="_"))

 #expand the levels to add missing types
levels(df$type) <- c(levels(df$type),setdiff(paste("type",1:9,sep="_"),levels(df$type)))

#create a new column that holds the first decimal
#I assumed there was only one decimal for each number 
#but you can adapt this
df$first_decimal <- as.numeric(sapply(strsplit(as.character(df$value),"[.]"),function(x) x[[2]]))

#group by A and type, if any first_decimal is bigger than 2
#G2 will be set to one for that group
df <- df %>% group_by(A,type) %>% mutate(G2=any(first_decimal>=2)*1)

#create a type_G2 column to hold the final column labels
df$type_G2 <- paste0(df$type,"_G2")

#this cbind creates the final result
cbind(df1,as.data.frame.matrix(table(df[,c("A","type")])),spread(unique(df[,c("A","type_G2","G2")]),key=type_G2,value=G2,drop=FALSE,fill=0)[,-1])

最后一个 cbind 的分解：

df1 是原始数据帧

as.data.frame.matrix(table(df[,c("A","type")])) 是一个数据框，其中包含每个type 的编号

spread(unique(df[,c("A","type_G2","G2")]),key=type_G2,value=G2,drop=FALSE,fill=0)[,-1] 持有type_G2 信息。我唯一的子集 df 因为有一些冗余信息（例如type_1_G2 与第一行的值 123.5 和 123.1 相同）。

【讨论】：

【解决方案4】：

免责声明：再次阅读该问题后，我的答案是错误的（至少结果过于复杂），以防您希望将十进制值与每个第一位数字的出现次数进行比较。

如果您愿意将分数小数与此行中的每个 type_N 值进行比较，这是一种方法，希望这里有聪明的人能够改进这一点：

decimalscores <- (df2[grepl("score_*",colnames(df2))] - floor(df2[grepl("score_*",colnames(df2))]))*10 # Get the decimal, as per the sample only one digit 
typesindex <- as.numeric(sub("type_","",colnames(df2[grepl("type_*",colnames(df2))]))) # get  the type_"n" columns names to reuse later
res <- t(sapply(1:nrow(df2),function(x) { # loop over the dataframe rows
    sapply(typesindex,function(y) { # For each type index  
        colname <- paste0("type_",y)
        cmptype <- unlist(unname(df2[x,colname]))
        # create the result if type_n is above 0 
        ifelse(cmptype > 0,
               any(unlist(unname(decimalscores[x,])) >= cmptype)+0L, # If one score is above the value return 1
               0) # Else return 0
     })
  }))
colnames(res) <- paste0("type_",typesindex,"_G2") # Name the resulting columns by adding _G2 to ouptut
res <- as.data.frame(res) # turn matrix into dataframe
df3 <- cbind(df2,res) # bind them to get expected output

我希望cmets解释得足够多，如果有不清楚的地方，请告诉我。

【讨论】：