【问题标题】:R fill a matrix by row using matrix row and colum namesR使用矩阵行和列名逐行填充矩阵
【发布时间】:2019-03-15 15:50:23
【问题描述】:

我有一个如下所示的数据集:

set.seed(2)
origin <- rep(c("DEU", "GBR", "ITA", "NLD", "CAN", "MEX", "USA", "CHN", "JPN", "KOR","DEU", "GBR", "ITA", "NLD", "CAN", "MEX", "USA", "CHN", "JPN", "KOR"), 2)
year <- rep(c(1998,1998,1998,1998,1998,1998,1998,1998,1998,1998,2000,2000,2000,2000,2000,2000,2000,2000,2000,2000), 2)
value <- sample(1:10000, size=length(origin), replace=TRUE)
test.df <- as.data.frame(cbind(origin, year, value))
rm(origin, year, value)

然后我有 2 个列表。

第一个是使用ISOcodes 库构建的按地区列出的国家/地区列表,如下所示:

library("ISOcodes")
list.continent <- list(asia = c("Central Asia", "Eastern Asia", "South-eastern Asia", "Southern Asia", "Western Asia"),
             africa = c("Northern Africa", "Sub-Saharan Africa", "Eastern Africa", "Middle Africa", "Southern Africa", "Western Africa"),
             europe = c("Eastern Europe", "Northern Europe", "Channel Islands", "Southern Europe", "Western Europe"),
             oceania = c("Australia and New Zealand", "Melanesia", "Micronesia", "Polynesia"),
             northamerica = c("Northern America"),
             latinamerica = c("South America", "Central America", "Caribbean"))

country.list.continent <- sapply(list.continent, function(item) {    
    region <- subset(UN_M.49_Regions, Name %in% item)
    sub <- subset(UN_M.49_Countries, Code %in% unlist(strsplit(region$Children, ", ")))
    return(sub$ISO_Alpha_3)
}, simplify = FALSE)
rm(list.continent)

还有一份年表:

year.list <- levels(as.factor(unique(test.df$year)))

我想用与特定年份的精确区域相对应的计算数字填充矩阵。矩阵如下:

ncol <- length(year.list)
nrow <- length(country.list.continent)

matrix.extraction <- matrix(, nrow = nrow, ncol = ncol)
rownames(matrix.extraction) <- names(country.list.continent)
colnames(matrix.extraction) <- year.list

为了进行我的计算,我有一个循环能够将数据集子集太大,否则......循环基于年份(相当于colnames(matrix.extraction))。这个想法是计算每年代表每个国家/地区价值的(以百分比为单位)。计算部分足够简单并且运行良好。当我需要将值归因于每一行时,我的问题就出现了。

for(i in 1:length(colnames(matrix.extraction))){
    ### I subset and compute what I want
    table.temp <- test.df %>%
                subset(year == colnames(matrix.extraction)[i]) %>%
                group_by(origin) %>%
                summarise(value = sum(value, na.rm = TRUE))
    table.temp$percent <-  prop.table(table.temp$value)
    ### then I need to attribute the wanted values
    matrix.extraction["ROWNAME",i]  <- table.temp %>% 
                                subset(origin %in% country.list.continent$"ROWNAME") %>% 
                                summarise(. ,sum = sum(percent)))
}

我真的不知道我该怎么做。

预期的结果是一个像这样的矩阵:

             1998 2000
asia         here   NA
africa         NA   NA
europe         NA   NA
oceania        NA   NA
northamerica   NA   NA
latinamerica   NA   NA

用 [1,1] 中的“here”代替 rowname 中该区域的每个国家/地区在 colname 中年份的值的总和。

任何帮助将不胜感激。

【问题讨论】:

  • @RonakShah 问题已编辑

标签: r for-loop matrix-multiplication


【解决方案1】:

使用 double sapply 我们可以遍历所有 year.listcountry.list.continent 并为每个组合计算 sumvalue

sapply(year.list, function(x) sapply(names(country.list.continent), function(y) {
     with(test.df, sum(value[origin %in% country.list.continent[[y]] & year == x]))
 }))

#              1998  2000
#asia         21759 20059
#africa           0     0
#europe       39700 35981
#oceania          0     0
#northamerica 21347 17324
#latinamerica 10847  8672

如果我们对tidyverse 解决方案感兴趣

library(tidyverse)

crossing(x = year.list, y = names(country.list.continent)) %>%
     mutate(sum = map2_dbl(x, y, ~ 
               test.df %>% 
                 filter(year == .x & origin %in% country.list.continent[[.y]]) %>%
                 summarise(total = sum(value)) %>%
                 pull(total)))

#    x     y              sum
#   <chr> <chr>        <dbl>
# 1 1998  africa           0
# 2 1998  asia         21759
# 3 1998  europe       39700
# 4 1998  latinamerica 10847
# 5 1998  northamerica 21347
# 6 1998  oceania          0
# 7 2000  africa           0
# 8 2000  asia         20059
# 9 2000  europe       35981
#10 2000  latinamerica  8672
#11 2000  northamerica 17324
#12 2000  oceania          0

您将数字作为因子存储在test.df 中,我们需要将它们更改为实际数字。在应用上述方法之前运行以下命令。

test.df[-1] <- lapply(test.df[-1], function(x) as.numeric(as.character(x)))

【讨论】:

  • 这些解决方案都不适合我。 R 告诉我“‘总和’对因子没有意义”
  • @TeYaP 因为你有数字作为因子,通过test.df[-1] &lt;- lapply(test.df[-1], function(x) as.numeric(as.character(x)))将它们转换为数字,然后再试一次。
【解决方案2】:

我们可以在tidyverse 中执行此操作。将命名的list 转换为两列数据集(enframestack),然后在filter 之后使用'test.df' 执行full_join,仅包含'year.list 中的'year' ',按'name,'year'分组,得到'value'的sumspread它为'wide'格式

library(tidyverse)
enframe(country.list.continent, value = "origin") %>%
   unnest %>%
   full_join(test.df %>% 
   filter(year %in% year.list)) %>%
   group_by(name, year) %>% 
   summarise(value = sum(value, na.rm = TRUE)) %>% 
   spread(year, value, fill = 0) %>%
   select(-4)
# A tibble: 6 x 3
# Groups:   name [6]
#  name         `1998` `2000`
#  <chr>         <dbl>  <dbl>
#1 africa            0      0
#2 asia          33038  18485
#3 europe        36658  35874
#4 latinamerica  14323  14808
#5 northamerica  15697  27405
#6 oceania           0      0

或者在base R中,这可以通过stacklist添加到两列data.frame,mergesubseting之后使用'test.df'和@987654335来完成@创建表

xtabs(value ~ ind + year, merge(stack(country.list.continent), 
  subset(test.df, year %in% year.list), by.x = "values", by.y = "origin"))
#            year
#ind             1998  2000
#  asia         33038 18485
#  africa           0     0
#  europe       36658 35874
#  oceania          0     0
#  northamerica 15697 27405
#  latinamerica 14323 14808

数据

test.df <- data.frame(origin, year, value)

【讨论】:

    猜你喜欢
    • 2014-01-26
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-01-07
    • 2016-10-28
    • 2019-01-19
    • 2014-03-22
    相关资源
    最近更新 更多