【问题标题】:Parse text file with separator to create dataframe in R?使用分隔符解析文本文件以在 R 中创建数据框?
【发布时间】:2021-05-18 15:11:05
【问题描述】:

您好,我有一个如下所示的文本文件:

[1] "Development Name - Woodstock Terrace"                   
[2] "Location - 920 Trinity Avenue, Bronx 10456"             
[3] "Number of Apts. - 319"                                  
[4] "Type of Project - Co-op"                                
[5] "Development Name - York Hill Apartments"                
[6] "Location - 1540 York Avenue, New York 10028"            
[7] "Number of Apts. - 296"                                  
[8] "Type of Project - Co-op"

我想要一个包含开发名称、位置、公寓数量和项目类型的列的数据框。每个新行都以一个新的开发名称开头。在实际文件中有几百行。

不知道该怎么做。也许使用“ - ”作为read_delim的分隔符?请帮忙!

【问题讨论】:

  • 在“-”上拆分列,然后进行长到宽的转换。

标签: r dataframe parsing text


【解决方案1】:

假设最后的注释中显示的输入可重现,我们通过替换空格、减号、用冒号替换空格、空格并在开发前插入换行符将其转换为 dcf 格式。然后使用 read.dcf 读取它,将其转换为数据框并修复类型。

library(magrittr)

input %>%
  sub(" - ", ": ", .) %>%
  sub("^(Development)", "\n\\1", .) %>%
  textConnection %>%
  read.dcf %>%
  as.data.frame %>%
  type.convert(as.is = TRUE)

给予:

      Development Name                         Location Number of Apts. Type of Project
1    Woodstock Terrace  920 Trinity Avenue, Bronx 10456             319           Co-op
2 York Hill Apartments 1540 York Avenue, New York 10028             296           Co-op

注意

input <- c("Development Name - Woodstock Terrace", "Location - 920 Trinity Avenue, Bronx 10456", 
"Number of Apts. - 319", "Type of Project - Co-op", "Development Name - York Hill Apartments", 
"Location - 1540 York Avenue, New York 10028", "Number of Apts. - 296", 
"Type of Project - Co-op")

【讨论】:

    【解决方案2】:

    使用一列将您的文本读取为 df。让我们将列命名为 X1:

    df=tibble(X1=c("Development Name - Woodstock Terrace",   
                   "Location - 920 Trinity Avenue",          
                   "Number of Apts. - 319",                  
                   "Type of Project - Co-op",                
                   "Development Name - York Hill Apartments",
                   "Location - 1540 York Avenue",            
                   "Number of Apts. - 296",                  
                   "Type of Project - Co-op"))
    

    创建列和值向量并将它们作为新数据框读取

    ColumnNames=c("Development Name - ","Location - ","Number of Apts. - ","Type of Project - ")
    Columns=str_match(df$X1,ColumnNames)%>%str_remove(' - ')
    Values=str_remove_all(df$X1,ColumnNames)
    df0=tibble(Cols=Columns,Vals=Values)
    

    Pivot Wide 新数据框,另请参阅 pivot_wider issue "Values in `values_from` are not uniquely identified; output will contain list-cols"

    df1=df0%>%
      group_by(Cols)%>%
      mutate(row = row_number())%>%
      pivot_wider(names_from = Cols,values_from=Vals,id_cols=Columns)%>%
      select(-row)
    
    > df1
    # A tibble: 2 x 4
      `Development Name`   Location           `Number of Apts.` `Type of Project`
      <chr>                <chr>              <chr>             <chr>            
    1 Woodstock Terrace    920 Trinity Avenue 319               Co-op            
    2 York Hill Apartments 1540 York Avenue   296               Co-op   
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2012-12-22
      • 2019-07-01
      • 2020-06-28
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多