【问题标题】:How to reshape a character column into two columns (Date and Text) in R?如何在 R 中将字符列重塑为两列(日期和文本)?
【发布时间】:2021-03-30 09:55:03
【问题描述】:

我有以下性格:

cal = "\n \n21/01/2021\n\n        \nGoverning Council of the ECB: monetary policy meeting in Frankfurt\n        \n \n21/01/2021\n\n        \nPress conference following the Governing Council meeting of the ECB in Frankfurt\n        \n \n03/02/2021\n\n        \nGoverning Council of the ECB: non-monetary policy meeting in Frankfurt\n        \n \n17/02/2021\n\n        \nGoverning Council of the ECB: non-monetary policy meeting in Frankfurt\n        \n \n11/03/2021\n\n        \nGoverning Council of the ECB: monetary policy meeting in Frankfurt\n        \n \n11/03/2021\n\n        \nPress conference following the Governing Council meeting of the ECB in Frankfurt\n        \n \n24/03/2021\n\n        \nGoverning Council of the ECB: non-monetary policy meeting in Frankfurt\n        \n \n25/03/2021\n\n        \nGeneral Council meeting of the ECB in Frankfurt\n        \n \n22/04/2021\n\n        \nGoverning Council of the ECB: monetary policy meeting in Frankfurt\n        \n \n22/04/2021\n\n        \nPress conference following the Governing Council meeting of the ECB in Frankfurt\n        \n \n12/05/2021\n\n        \nGoverning Council of the ECB: non-monetary policy meeting in Frankfurt\n        \n \n10/06/2021\n\n        \nGoverning Council of the ECB: monetary policy meeting in the Netherlands\n        \n \n10/06/2021\n\n        \nPress conference following the Governing Council meeting of the ECB in the Netherlands\n        \n \n23/06/2021\n\n        \nGoverning Council of the ECB: non-monetary policy meeting in Frankfurt\n        \n \n24/06/2021\n\n        \nGeneral Council meeting of the ECB in Frankfurt\n        \n \n22/07/2021\n\n        \nGoverning Council of the ECB: monetary policy meeting in Frankfurt\n        \n \n22/07/2021\n\n        \nPress conference following the Governing Council meeting of the ECB in Frankfurt\n        \n \n09/09/2021\n\n        \nGoverning Council of the ECB: monetary policy meeting in Frankfurt\n        \n \n09/09/2021\n\n        \nPress conference following the Governing Council meeting of the ECB in Frankfurt\n        \n \n22/09/2021\n\n        \nGoverning Council of the ECB: non-monetary policy meeting in Frankfurt\n        \n \n23/09/2021\n\n        \nGeneral Council meeting of the ECB in Frankfurt\n        \n \n06/10/2021\n\n        \nGoverning Council of the ECB: non-monetary policy meeting in Frankfurt\n        \n \n28/10/2021\n\n        \nGoverning Council of the ECB: monetary policy meeting in Frankfurt\n        \n \n28/10/2021\n\n        \nPress conference following the Governing Council meeting of the ECB in Frankfurt\n        \n \n10/11/2021\n\n        \nGoverning Council of the ECB: non-monetary policy meeting in Frankfurt\n        \n \n01/12/2021\n\n        \nGoverning Council of the ECB: non-monetary policy meeting in Frankfurt\n        \n \n02/12/2021\n\n        \nGeneral Council meeting of the ECB in Frankfurt\n        \n \n16/12/2021\n\n        \nGoverning Council of the ECB: monetary policy meeting in Frankfurt\n        \n \n16/12/2021\n\n        \nPress conference following the Governing Council meeting of the ECB in Frankfurt\n        \n"
 cal = gsub( "\n", " ", calendar)


正如您在文本中看到的那样,既有日期也有文本。我想做的是将文本变成两列:“日期”和“事件”。

这将是结果(为简单起见仅显示第一行):

Date                    Event

21/01/2021        Governing Council of the ECB: monetary policy meeting in Frankfurt
21/01/2021        Press conference following the Governing Council meeting of the ECB...
03/02/2021        Governing Council of the ECB: non-monetary policy meeting in Frankfurt
17/02/2021        Governing Council of the ECB: non-monetary policy meeting in Frankfurt
11/03/2021        Governing Council of the ECB: monetary policy meeting in Frankfurt        
...

我尝试了许多将语料库重塑为句子的函数以及提取日期的函数,但我没能做到。例如:

library(anytime)
anydate(str_extract_all(cal, "[[:alnum:]]+[ /]*\\d{2}[ /]*\\d{4}")[[1]]) %>% as.data.frame()

# it gives me back lot of NAs, I don't know why

[1] NA           NA           "2021-03-02" NA           "2021-11-03" "2021-11-03" NA          
 [8] NA           NA           NA           "2021-12-05" "2021-10-06" "2021-10-06" NA          
[15] NA           NA           NA           "2021-09-09" "2021-09-09" NA           NA          
[22] "2021-06-10" NA           NA           "2021-10-11" "2021-01-12" "2021-02-12" NA          
[29] NA          

谁能帮帮我?

谢谢!

【问题讨论】:

    标签: r dataframe dplyr reshape


    【解决方案1】:

    您可以使用str_match_all 提取遵循特定模式的数据。

    library(stringr)
    
    tmp <- data.frame(str_match_all(trimws(gsub('\\s+', ' ', cal)), 
                      '(\\d+/\\d+/\\d+)\\s([A-Za-z:\\s-]+)')[[1]][, -1])
    tmp$X2 <- trimws(tmp$X2)
    tmp
    
    #           X1                                                                                     X2
    #1  21/01/2021                     Governing Council of the ECB: monetary policy meeting in Frankfurt
    #2  21/01/2021       Press conference following the Governing Council meeting of the ECB in Frankfurt
    #3  03/02/2021                 Governing Council of the ECB: non-monetary policy meeting in Frankfurt
    #4  17/02/2021                 Governing Council of the ECB: non-monetary policy meeting in Frankfurt
    #5  11/03/2021                     Governing Council of the ECB: monetary policy meeting in Frankfurt
    #6  11/03/2021       Press conference following the Governing Council meeting of the ECB in Frankfurt
    #7  24/03/2021                 Governing Council of the ECB: non-monetary policy meeting in Frankfurt
    #...
    #...
    

    【讨论】:

      【解决方案2】:

      使用read.table,我们可以在\n 拆分。 strip.white=TRUE 省略了仅包含空格的元素。现在的结果模式是 date - event - date ... 我们现在可以很好地按行将其转换为 matrix

      r <- setNames(data.frame(matrix(
        read.table(text=cal, sep="\n", row.names=NULL, strip.white=T)[,1], 
        ncol=2, byrow=TRUE)), c("date", "event"))
      r$date <- as.Date(r$date, "%d/%m/%Y")  ## format to date
      

      结果

      r
      #          date                                                                                  event
      # 1  2021-01-21                     Governing Council of the ECB: monetary policy meeting in Frankfurt
      # 2  2021-01-21       Press conference following the Governing Council meeting of the ECB in Frankfurt
      # 3  2021-02-03                 Governing Council of the ECB: non-monetary policy meeting in Frankfurt
      # 4  2021-02-17                 Governing Council of the ECB: non-monetary policy meeting in Frankfurt
      # 5  2021-03-11                     Governing Council of the ECB: monetary policy meeting in Frankfurt
      # 6  2021-03-11       Press conference following the Governing Council meeting of the ECB in Frankfurt
      # 7  2021-03-24                 Governing Council of the ECB: non-monetary policy meeting in Frankfurt
      # 8  2021-03-25                                        General Council meeting of the ECB in Frankfurt
      # 9  2021-04-22                     Governing Council of the ECB: monetary policy meeting in Frankfurt
      # 10 2021-04-22       Press conference following the Governing Council meeting of the ECB in Frankfurt
      # 11 2021-05-12                 Governing Council of the ECB: non-monetary policy meeting in Frankfurt
      # 12 2021-06-10               Governing Council of the ECB: monetary policy meeting in the Netherlands
      # 13 2021-06-10 Press conference following the Governing Council meeting of the ECB in the Netherlands
      # 14 2021-06-23                 Governing Council of the ECB: non-monetary policy meeting in Frankfurt
      # 15 2021-06-24                                        General Council meeting of the ECB in Frankfurt
      # 16 2021-07-22                     Governing Council of the ECB: monetary policy meeting in Frankfurt
      # 17 2021-07-22       Press conference following the Governing Council meeting of the ECB in Frankfurt
      # 18 2021-09-09                     Governing Council of the ECB: monetary policy meeting in Frankfurt
      # 19 2021-09-09       Press conference following the Governing Council meeting of the ECB in Frankfurt
      # 20 2021-09-22                 Governing Council of the ECB: non-monetary policy meeting in Frankfurt
      # 21 2021-09-23                                        General Council meeting of the ECB in Frankfurt
      # 22 2021-10-06                 Governing Council of the ECB: non-monetary policy meeting in Frankfurt
      # 23 2021-10-28                     Governing Council of the ECB: monetary policy meeting in Frankfurt
      # 24 2021-10-28       Press conference following the Governing Council meeting of the ECB in Frankfurt
      # 25 2021-11-10                 Governing Council of the ECB: non-monetary policy meeting in Frankfurt
      # 26 2021-12-01                 Governing Council of the ECB: non-monetary policy meeting in Frankfurt
      # 27 2021-12-02                                        General Council meeting of the ECB in Frankfurt
      # 28 2021-12-16                     Governing Council of the ECB: monetary policy meeting in Frankfurt
      # 29 2021-12-16       Press conference following the Governing Council meeting of the ECB in Frankfurt
      

      【讨论】:

        【解决方案3】:
        library(dplyr)
        library(stringr)
        
        x = unlist(str_split(cal,"\n\\s{2,}\n\\s\n"))
        y = data.frame(x, stringsAsFactors = FALSE)
        y %>% separate(x,c("Date","Event"),"\n\n\\s{2,}\n") 
        

        【讨论】:

        • 请添加此代码的描述。解释它的作用以及它如何解决问题。
        • 非常感谢,每个答案都很好。我根据我最熟悉的代码选择了正确的答案。谢谢大家!
        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2021-11-26
        • 2016-01-08
        • 2021-08-05
        • 2020-12-07
        • 1970-01-01
        相关资源
        最近更新 更多