R：每组时间序列的平均年数答案

【问题标题】：R: Average years in time series per groupR：每组时间序列的平均年数
【发布时间】：2021-05-16 13:57:37
【问题描述】：

尊敬的社区，

我正在与 R 合作，并在 20 年内寻找双边出口的时间序列数据趋势。由于数据在不同年份之间波动很大（而且不是 100% 可靠），我更愿意使用四年平均数据（而不是单独查看每一年）来分析主要出口随着时间的推移，合作伙伴发生了变化。我有以下 数据集，称为 GrossExp3，涵盖了 15 个报告国在（1998 年至 2019 年）之间的所有年份对所有可用伙伴国家的双边出口（以 1000 美元为单位） . 它涵盖以下四个变量： Year, ReporterName (= exporter) , PartnerName (= export destination), 'TradeValue in 1000 USD' (= 出口到目的地的价值) PartnerName 列还包括一个名为“All”的条目，它是记者每年所有出口的总和。

这是我的数据摘要

> summary(GrossExp3)
      Year      ReporterName       PartnerName        TradeValue in 1000 USD
 Min.   :1998   Length:35961       Length:35961       Min.   :       0      
 1st Qu.:2004   Class :character   Class :character   1st Qu.:      39      
 Median :2009   Mode  :character   Mode  :character   Median :     597      
 Mean   :2009                                         Mean   :  134370      
 3rd Qu.:2014                                         3rd Qu.:   10090      
 Max.   :2018                                         Max.   :47471515

我的目标是返回一个表格，该表格显示每个出口商对出口目的地的贸易总额占该时期出口总额的百分比。我希望获得以下时期的平均数据，而不是每一年：2000-2003、2004-2007、2008-2011、2012-2015、2016-2019。

我尝试了什么 我当前的代码（在这个神奇社区的支持下创建如下：（目前，它分别显示每年的数据，但我需要标题中的平均数据）

# install packages
library(data.table)
library(dplyr)
library(tidyr)
library(stringr)
library(plyr)
library(visdat)

# set working directory
setwd("C:/R/R_09.2020/Other Indicators/Bilateral Trade Shift of Partners")

# load data

# create a file path SITC 3 
path1 <- file.path("SITC Rev 3_Data from 1998.csv")

# load cvs data table, call "SITC3" 
SITC3 <- fread(path1, drop = c(1,9,11,13))

# prepare data (SITC3) for analysis
# Filter for GROSS EXPORTS SITC3 (Gross exports = Exports that include intermediate products)
GrossExp3 <- SITC3 %>%
  filter(TradeFlowName == "Gross Exp.", PartnerISO3 != "All", Year != 2019) %>%  # filter for gross exports, remove "All", remove 2019
  select(Year, ReporterName, PartnerName, `TradeValue in 1000 USD`) %>%
  arrange(ReporterName, desc(Year))
# compare with old subset
summary(GrossExp3)
summary(SITC3)

# calculate percentage of total
GrossExp3Main <- GrossExp3 %>%
  group_by(Year, ReporterName) %>%
  add_tally(wt = `TradeValue in 1000 USD`, name = "TotalValue") %>%
  mutate(Percentage = 100 * (`TradeValue in 1000 USD` / TotalValue)) %>%
  arrange(ReporterName, desc(Year), desc(Percentage))
head(GrossExp3Main, n = 20)

# print tables in separate sheets to get an overview about hierarchy of export partners and development over time
SpreadExpMain <- GrossExp3Main %>%
  select(Year, ReporterName, PartnerName, Percentage) %>%
  spread(key = Year, value = Percentage) %>%
  arrange(ReporterName, desc(`2018`))
View(SpreadExpMain) # shows whole table

这是我的数据头

> head(GrossExp3Main, n = 20)
# A tibble: 20 x 6
# Groups:   Year, ReporterName [7]
    Year ReporterName PartnerName   `TradeValue in 100~ TotalValue Percentage
   <int> <chr>        <chr>                       <dbl>      <dbl>      <dbl>
 1  2018 Angola       China                   24517058.  42096736.      58.2 
 2  2018 Angola       India                    3768940.  42096736.       8.95
 3  2017 Angola       China                   19487067.  34904881.      55.8 
 4  2017 Angola       India                    2890061.  34904881.       8.28
 5  2016 Angola       China                   13923092.  28057500.      49.6 
 6  2016 Angola       India                    1948845.  28057500.       6.95
 7  2016 Angola       United States            1525650.  28057500.       5.44
 8  2015 Angola       China                   14320566.  33924937.      42.2 
 9  2015 Angola       India                    2676340.  33924937.       7.89
10  2015 Angola       Spain                    2245976.  33924937.       6.62
11  2014 Angola       China                   27527111.  58672369.      46.9 
12  2014 Angola       India                    4507416.  58672369.       7.68
13  2014 Angola       Spain                    3726455.  58672369.       6.35
14  2013 Angola       China                   31947235.  67712527.      47.2 
15  2013 Angola       India                    6764233.  67712527.       9.99
16  2013 Angola       United States            5018391.  67712527.       7.41
17  2013 Angola       Other Asia, ~            4007020.  67712527.       5.92
18  2012 Angola       China                   33710030.  70863076.      47.6 
19  2012 Angola       India                    6932061.  70863076.       9.78
20  2012 Angola       United States            6594526.  70863076.       9.31

我不确定我到此为止的结果是否正确？另外，我还有以下问题：

您对如何使用 R 打印漂亮的表格有什么建议吗？
如何更好地将百分比数据四舍五入到逗号后面的一个数字？

由于我在一周内一直被这些问题困扰，我将非常感谢有关如何解决问题的任何建议！

祝你周末愉快，一切顺利，

我喜欢

** 编辑** 这是一些示例数据

dput(head(GrossExp3Main, n = 20))
structure(list(Year = c(2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 
2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 
2018L, 2018L, 2018L, 2018L, 2018L), ReporterName = c("Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola"), PartnerName = c("China", 
"India", "United States", "Spain", "South Africa", "Portugal", 
"United Arab Emirates", "France", "Thailand", "Canada", "Indonesia", 
"Singapore", "Italy", "Israel", "United Kingdom", "Unspecified", 
"Namibia", "Uruguay", "Congo, Rep.", "Japan"), `TradeValue in 1000 USD` = c(24517058.342, 
3768940.47, 1470132.736, 1250554.873, 1161852.097, 1074137.369, 
884725.078, 734551.345, 649626.328, 647164.297, 575477.283, 513982.584, 
468914.918, 452453.482, 425616.975, 423008.886, 327921.516, 320586.229, 
299119.102, 264671.779), TotalValue = c(42096736.31, 42096736.31, 
42096736.31, 42096736.31, 42096736.31, 42096736.31, 42096736.31, 
42096736.31, 42096736.31, 42096736.31, 42096736.31, 42096736.31, 
42096736.31, 42096736.31, 42096736.31, 42096736.31, 42096736.31, 
42096736.31, 42096736.31, 42096736.31), Percentage = c(58.2398078593471, 
8.9530467213552, 3.49227247731025, 2.97066942147468, 2.75995765667944, 
2.55159298119945, 2.10164767046284, 1.74491281127062, 1.54317504144777, 
1.53732653342598, 1.3670353890672, 1.22095589599877, 1.11389850877492, 
1.07479467925527, 1.01104506502775, 1.00484959899258, 0.778971352043039, 
0.761546516668669, 0.710551762961598, 0.62872279943737)), row.names = c(NA, 
-20L), groups = structure(list(Year = 2018L, ReporterName = "Angola", 
    .rows = structure(list(1:20), ptype = integer(0), class = c("vctrs_list_of", 
    "vctrs_vctr", "list"))), row.names = 1L, class = c("tbl_df", 
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"))
>

【问题讨论】：

请使用 dput(head(GrossExp3Main, n = 20)) 发布您的数据的 sn-p。然后，其他用户可以轻松地将其复制粘贴以读入 R。
打印表格请查看haozhu233.github.io/kableExtra/awesome_table_in_pdf.pdf。关于格式化的问题，请查看库scales(scales.r-lib.org)。
您能以dput 格式发布示例数据吗？请使用dput(GrossExp3) 的输出编辑问题。或者，如果 dput(head(GrossExp3, 20)) 的输出太大。
非常感谢您的快速cmets！我刚刚用 dput 添加了一些示例数据
也感谢您对漂亮的 Latex 表格文档的参考。看起来真的很厉害！！

标签： r time-series average

【解决方案1】：

要执行您想要的操作，需要一个额外的变量来将年份组合在一起。我使用cut 来做到这一点。

library(dplyr)
# Define the cut breaks and labels for each group
# The cut define by the starting of each group and when using cut function
# I would use param right = FALSE to have the desire cut that I want here.
year_group_break <- c(2000, 2004, 2008, 2012, 2016, 2020)
year_group_labels <- c("2000-2003", "2004-2007", "2008-2011", "2012-2015", "2016-2019")

data %>%
  # create the year group variable
  mutate(year_group = cut(Year, breaks = year_group_break,
    labels  = year_group_labels,
    include.lowest = TRUE, right = FALSE)) %>%
  # calculte the total value for each Reporter + Partner in each year group
  group_by(year_group, ReporterName, PartnerName) %>%
  summarize(`TradeValue in 1000 USD` = sum(`TradeValue in 1000 USD`),
    .groups = "drop") %>%
  # calculate the percentage value for Partner of each Reporter/Year group
  group_by(year_group, ReporterName) %>%
  mutate(Percentage = `TradeValue in 1000 USD` / sum(`TradeValue in 1000 USD`)) %>%
  
  ungroup()

样本输出

   year_group ReporterName PartnerName          `TradeValue in 1000 USD` Percentage
   <fct>      <chr>        <chr>                                   <dbl>      <dbl>
 1 2016-2019  Angola       Canada                                647164.    0.0161 
 2 2016-2019  Angola       China                               24517058.    0.609  
 3 2016-2019  Angola       Congo, Rep.                           299119.    0.00744
 4 2016-2019  Angola       France                                734551.    0.0183 
 5 2016-2019  Angola       India                                3768940.    0.0937 
 6 2016-2019  Angola       Indonesia                             575477.    0.0143 
 7 2016-2019  Angola       Israel                                452453.    0.0112 
 8 2016-2019  Angola       Italy                                 468915.    0.0117 
 9 2016-2019  Angola       Japan                                 264672.    0.00658
10 2016-2019  Angola       Namibia                               327922.    0.00815
11 2016-2019  Angola       Portugal                             1074137.    0.0267 
12 2016-2019  Angola       Singapore                             513983.    0.0128 
13 2016-2019  Angola       South Africa                         1161852.    0.0289 
14 2016-2019  Angola       Spain                                1250555.    0.0311 
15 2016-2019  Angola       Thailand                              649626.    0.0161 
16 2016-2019  Angola       United Arab Emirates                  884725.    0.0220 
17 2016-2019  Angola       United Kingdom                        425617.    0.0106 
18 2016-2019  Angola       United States                        1470133.    0.0365 
19 2016-2019  Angola       Unspecified                           423009.    0.0105 
20 2016-2019  Angola       Uruguay                               320586.    0.00797

【讨论】：

非常感谢！！您的方法看起来非常合乎逻辑且很有帮助！但是，当我应用代码时，我收到一条错误消息。知道为什么会这样吗？ > rlang::last_error() 必须按.data 中的变量分组。 * 未找到列 year_group。 * 未找到列ReporterName。
只需确保将我的数据替换为您的实际变量，参考您环境中的数据框 - 可以是您问题中的GrossExp3Main。