在R中的分组数据中对具有特定值的行求和答案

【问题标题】：Summing rows with particular value in grouped data in R在R中的分组数据中对具有特定值的行求和
【发布时间】：2020-04-08 00:39:54
【问题描述】：

我包含一个数据集“区域”

House_No. Info_On_Area
1a        Names of neighbouringhouse in 100m  1b   1c    1d    1e 
1a        Area of neighbouringhouse  in 100m  500  1000  1500  300
1a        Names of neighbouringhouse in 300m  1b   1c    1d    1e   1f    1g   1h
1a        Area of neighbouringhouse  in 300m  500  1000  1500  300  600   400  2000
2a        Names of neighbouringhouse in 100m  2b   2c    2d    2e 
2a        Area of neighbouringhouse  in 100m  500  1000  1500  300
2a        Names of neighbouringhouse in 300m  2b   2c    2d    2e   2f    2g   2h
2a        Area of neighbouringhouse  in 300m  500  1000  1500  300  600   400  2000

我想创建一个数据框，我可以让表格显示为

House_No. Area of neighbouringhouse in 100m Area of neighbouringhouse  in 300m

我使用了 dplyr 并将不同的门牌号分组 CT % group_by(House_No.)) 并尝试使用 rowSums。但是，我收到错误消息，说信息不是数字。我认为这是因为我需要将行值中的数字设为数字，但我不知道该怎么做。我卡在这个阶段，无法继续前进。

我确实研究过类似的解决方案，但他们似乎没有一个数据框，他们正在努力对行值求和，例如 Sum rows in data.frame or matrix、Sum by Rows in R。

如果有任何帮助，我将不胜感激！谢谢你:)

【问题讨论】：

通过dput(head(df, 10))提供您的数据样本

标签： r dplyr rows grouped-table

【解决方案1】：

使用stringr::str_extract_* 检索数字，然后使用pivot_wider 执行spread

library(tidyverse)
df %>%  
   #extract everything up to 1+ digits followed by m
   mutate(flag = str_extract(Info_On_Area,'.*\\d+m'), 
          #extract any 1 or more digits followed by space or at the end
          SumArea = map_dbl(Info_On_Area, ~sum(as.numeric(str_extract_all(.x, '\\d+(?=\\s|$)', simplify = TRUE))))) %>% 
   filter(str_detect(Info_On_Area, 'Area')) %>% 
   #As suggested by @Uwe
   pivot_wider(id_cols = House_No., names_from = flag, values_from = SumArea)

# A tibble: 2 x 3
  House_No. `Area of neighbouringhouse  in 100m` `Area of neighbouringhouse  in 300m`
  <chr>                                    <dbl>                                <dbl>
1 1a                                        3300                                 6300
2 2a                                        3300                                 6300

数据

df <- structure(list(House_No. = c("1a", "1a", "1a", "1a", "2a", "2a", 
"2a", "2a"), Info_On_Area = c("Names of neighbouringhouse in 100m  1b   1c    1d    1e", 
"Area of neighbouringhouse  in 100m  500  1000  1500  300", "Names of neighbouringhouse in 300m  1b   1c    1d    1e   1f    1g   1h", 
"Area of neighbouringhouse  in 300m  500  1000  1500  300  600   400  2000", 
"Names of neighbouringhouse in 100m  2b   2c    2d    2e", "Area of neighbouringhouse  in 100m  500  1000  1500  300", 
"Names of neighbouringhouse in 300m  2b   2c    2d    2e   2f    2g   2h", 
"Area of neighbouringhouse  in 300m  500  1000  1500  300  600   400  2000"
)), class = "data.frame", row.names = c(NA, -8L))

【讨论】：

非常感谢您，对于我的回复延迟感到抱歉。我可以将该区域作为单独的列获取数据框，但是，我得到的答案是“0”。我在这里做错了什么？
不客气。嗯没有数据很难说，但它是否在df 工作，也是Area of neighbouringhouse in 300m 500 1000 1500 300 600 400 2000 形式的区域。最后，你有没有最新版的tidyverse

【解决方案2】：

这里的困难在于信息以宽格式和长格式的混合形式呈现。 Info_On_Area 是一个字符列，其中包含变量名称以及由空格分隔的任意数量的值。所以Info_On_Area需要分两步拆分。一是提取变量名，二是提取数字，以便后续转换为数字和求和。

幸运的是，OP只对简化事情的区域信息感兴趣。

1。 tidyverse 方法

library(dplyr)
library(purrr)
library(stringr)
library(tidyr)
area %>% 
  filter(Info_On_Area %>% str_detect("^Area")) %>% 
  separate(Info_On_Area, c("var", "val"), sep = "(?<=00m)") %>% 
  mutate(Area = map_int(val, ~ str_extract_all(. , "\\d+") %>% unlist() %>% as.integer() %>% sum())) %>%
  pivot_wider(id_cols = House_No., names_from = var, values_from = Area)

# A tibble: 2 x 3
  House_No. `Area of neighbouringhouse  in 100m` `Area of neighbouringhouse  in 300m`
  <chr>                                    <int>                                <int>
1 1a                                        3300                                 6300
2 2a                                        3300                                 6300

每个House_No. 都有一行。 ~~这与A. Suliman's solution 不同，后者为每个House_No. 显示两行~~（不再在A. Suliman's answer 的编辑版本中）。其他区别包括使用separate() 和pivot_wider() 函数，一个带有lookbehind "(?<=00m)" 的正则表达式，以及应用filter() 作为管道中的第一步。

2。 data.table 方法

为了完整起见，这里也是data.table的解决方案：

library(data.table)
library(magrittr)
setDT(area)[Info_On_Area %like% "^Area", 
            c(.(House_No.= House_No.), tstrsplit(Info_On_Area, "(?<=00m)", perl = TRUE))][
              , str_extract_all(V3, "\\d+") %>% unlist() %>% as.integer() %>% sum(), by = .(House_No., V2)][
                , dcast(.SD, House_No. ~ V2, value.var = "V1")]

   House_No. Area of neighbouringhouse  in 100m Area of neighbouringhouse  in 300m
1:        1a                               3300                               6300
2:        2a                               3300                               6300

数据

area <- structure(list(House_No. = c("1a", "1a", "1a", "1a", "2a", "2a", 
"2a", "2a"), Info_On_Area = c("Names of neighbouringhouse in 100m  1b   1c    1d    1e", 
"Area of neighbouringhouse  in 100m  500  1000  1500  300", "Names of neighbouringhouse in 300m  1b   1c    1d    1e   1f    1g   1h", 
"Area of neighbouringhouse  in 300m  500  1000  1500  300  600   400  2000", 
"Names of neighbouringhouse in 100m  2b   2c    2d    2e", "Area of neighbouringhouse  in 100m  500  1000  1500  300", 
"Names of neighbouringhouse in 300m  2b   2c    2d    2e   2f    2g   2h", 
"Area of neighbouringhouse  in 300m  500  1000  1500  300  600   400  2000"
)), class = "data.frame", row.names = c(NA, -8L))

【讨论】：