【问题标题】:Convert data with one column and multiple rows into multi column multi row data将一列多行数据转换为多列多行数据
【发布时间】:2017-12-01 09:33:25
【问题描述】:

我在 R 中有一个网络抓取数据的输出,如下所示

Name1
Email: email1@xyz.com
City/Town: Location1
Name2
Email: email2@abc.com
City/Town: Location2
Name3
Email: email3@pqr.com
City/Town: Location3

某些名称可能没有电子邮件地址或位置。我想将上述数据转换为表格格式。输出应该是这样的

Name      Email           City/Town
Name1   email1@xyz.com  Location1
Name2   email2@abc.com  Location2
Name3   email3@pqr.com  Location3
Name4                   Location4
Name5   email5@abc.com  

【问题讨论】:

  • 源数据如下所示。名称1 电子邮件:email1@abc.com 城市/城镇:位置1 名称2 电子邮件:email2@xyz.com 城市/城镇:位置2 名称3 电子邮件:email3@pqr.com 城市/城镇:位置3
  • 你能提供一个reproducible example吗?

标签: r reshape


【解决方案1】:

使用:

txt <- readLines(txt)

library(data.table)
library(zoo)

dt <- data.table(txt = txt)

dt[!grepl(':', txt), name := txt
   ][, name := na.locf(name)
     ][grepl('^Email:', txt), email := sub('Email: ','',txt)
       ][grepl('^City/Town:', txt), city_town := sub('City/Town: ','',txt)
         ][txt != name, lapply(.SD, function(x) toString(na.omit(x))), by = name, .SDcols = c('email','city_town')]

给予:

    name          email city_town
1: Name1 email1@xyz.com Location1
2: Name2 email2@abc.com Location2
3: Name3 email3@pqr.com Location3
4: Name4                Location4
5: Name5 email5@abc.com          

这也适用于真实姓名。使用@uweBlock 的数据,您将获得:

                  name          email city_town
1:            John Doe email1@xyz.com Location1
2: Save the World Fund email2@abc.com Location2
3:     Best Shoes Ltd. email3@pqr.com Location3
4:              Mother                Location4
5:                Jane email5@abc.com

每个部分有多个键(同样使用@UweBlock 的数据):

                  name                          email             city_town
1:            John Doe email1@xyz.com, email1@abc.com             Location1
2: Save the World Fund                 email2@abc.com             Location2
3:     Best Shoes Ltd.                 email3@pqr.com             Location3
4:              Mother                                Location4, everywhere
5:                Jane                 email5@abc.com

使用过的数据:

txt <- textConnection("Name1
Email: email1@xyz.com
City/Town: Location1
Name2
Email: email2@abc.com
City/Town: Location2
Name3
Email: email3@pqr.com
City/Town: Location3
Name4
City/Town: Location4
Name5
Email: email5@abc.com")

【讨论】:

    【解决方案2】:

    在每个名称前插入\nName:,然后使用read.dcf 读取它(如果数据来自文件,则在第一行代码中将textConnection(Lines) 替换为文件名,例如"myfile.dat"。)没有包被使用了。

    L <- trimws(readLines(textConnection(Lines)))
    ix <- !grepl(":", L)
    L[ix] <- paste("\nName:", L[ix])
    read.dcf(textConnection(L))
    

    使用末尾注释中的输入给出以下内容:

         Name    Email            City/Town  
    [1,] "Name1" "email1@xyz.com" "Location1"
    [2,] "Name2" NA               "Location2"
    [3,] "Name3" "email3@pqr.com" NA         
    

    注意:使用的输入。对问题稍作修改,以表明如果缺少电子邮件或城市/城镇,它可以工作:

    Lines <- "Name1
    Email: email1@xyz.com
    City/Town: Location1
    Name2
    City/Town: Location2
    Name3
    Email: email3@pqr.com"
    

    【讨论】:

    • 所有其他答案都在解决 Y 问题,这个是在解决 X!
    • 不错的答案!我猜read.dcf 不是一个众所周知的函数(尽管在基础 R 中可用)
    • 很好地展示了学习base R的好处。如果参数all = TRUEread.dcf() 一起使用,则该解决方案还能够处理重复条目,例如,多个电子邮件地址,如txt2 样本数据集here 中一样。
    【解决方案3】:

    输入数据带来了几个挑战:

    • 数据以直字符向量的形式给出,而不是带有预定义列的 data.frame。
    • 行部分由键/值对组成,由": " 分隔
    • 其他行用作节标题。下面行中的所有键/值对都属于一个部分,直到到达下一个标题。

    以下代码仅依赖于两个假设:

    1. 键/值对包含一个且只有一个": "
    2. 完全没有节标题。

    一个部分中的多个键,例如,具有电子邮件地址的多行通过将toString()指定为dcast()的聚合函数来处理。

    library(data.table)
    # coerce to data.table
    data.table(text = txt)[
      # split key/value pairs in columns
      , tstrsplit(text, ": ")][
        # pick section headers and create new column 
        is.na(V2), Name := V1][
          # fill in Name into the rows below
          , Name := zoo::na.locf(Name)][
            # reshape key/value pairs from long to wide format using Name as row id
            !is.na(V2), dcast(.SD, Name ~ V1, fun = toString, value.var = "V2")]
    
        Name City/Town          Email
    1: Name1 Location1 email1@xyz.com
    2: Name2 Location2 email2@abc.com
    3: Name3 Location3 email3@pqr.com
    4: Name4 Location4             NA
    5: Name5        NA email5@abc.com
    

    数据

    txt <- c("Name1", "Email: email1@xyz.com", "City/Town: Location1", "Name2", 
    "Email: email2@abc.com", "City/Town: Location2", "Name3", "Email: email3@pqr.com", 
    "City/Town: Location3", "Name4", "City/Town: Location4", "Name5", 
    "Email: email5@abc.com")
    

    或者,尝试更“真实”的名称

    txt1 <- c("John Doe", "Email: email1@xyz.com", "City/Town: Location1", "Save the World Fund", 
    "Email: email2@abc.com", "City/Town: Location2", "Best Shoes Ltd.", "Email: email3@pqr.com", 
    "City/Town: Location3", "Mother", "City/Town: Location4", "Jane", 
    "Email: email5@abc.com")
    

    这将导致:

                      Name City/Town          Email
    1:     Best Shoes Ltd. Location3 email3@pqr.com
    2:                Jane        NA email5@abc.com
    3:            John Doe Location1 email1@xyz.com
    4:              Mother Location4             NA
    5: Save the World Fund Location2 email2@abc.com
    

    或者,每个部分有多个键

    txt2 <- c("John Doe", "Email: email1@xyz.com", "Email: email1@abc.com", "City/Town: Location1", "Save the World Fund", 
    "Email: email2@abc.com", "City/Town: Location2", "Best Shoes Ltd.", "Email: email3@pqr.com", 
    "City/Town: Location3", "Mother", "City/Town: Location4", "City/Town: everywhere", "Jane", 
    "Email: email5@abc.com")
    
                      Name             City/Town                          Email
    1:     Best Shoes Ltd.             Location3                 email3@pqr.com
    2:                Jane                                       email5@abc.com
    3:            John Doe             Location1 email1@xyz.com, email1@abc.com
    4:              Mother Location4, everywhere                               
    5: Save the World Fund             Location2                 email2@abc.com
    

    【讨论】:

      【解决方案4】:

      使用 dplyrtidyr,在 @Jaap txt 和 @UweBlock txt1 提供的两个数据上进行测试:

      library(dplyr)
      library(tidyr)
      
      # data_frame(txt = txt1) %>%     
      data_frame(txt = txt) %>% 
        mutate(txt = if_else(grepl(":", txt), txt, paste("Name:", txt)),
               rn = row_number()) %>% 
        separate(txt, into = c("mytype", "mytext"), sep = ":") %>% 
        spread(key = mytype, value = mytext) %>% 
        select(-rn) %>% 
        fill(Name) %>% 
        group_by(Name) %>% 
        fill(1:2, .direction = "down") %>% 
        fill(1:2, .direction = "up") %>% 
        unique() %>% 
        ungroup() %>% 
        select(3:1)
      
      # # A tibble: 5 x 3
      #     Name           Email `City/Town`
      #    <chr>           <chr>       <chr>
      # 1  Name1  email1@xyz.com   Location1
      # 2  Name2  email2@abc.com   Location2
      # 3  Name3  email3@pqr.com   Location3
      # 4  Name4            <NA>   Location4
      # 5  Name5  email5@abc.com        <NA>
      

      注意事项:

      • 请参阅 here 为什么我们需要 rn
      • 希望有人建议仅使用 tidyverse 的更好/更简单的代码。

      【讨论】:

      • 更简单:data_frame(text = txt) %&gt;% separate_rows(text, sep = '\n') %&gt;% separate(text, c('var', 'val'), sep = ': ', fill = 'left') %&gt;% mutate(entry = cumsum(is.na(var)), var = coalesce(var, 'Name')) %&gt;% spread(var, val) %&gt;% select(4:2) 其中txt 是字符向量或路径。或者使用data_frame(text = read_lines(txt)) 而不是separate_rows
      • @alistaire 无法使用 txt2,重复行错误。也许添加 rn?另外,也许添加为新答案或者我可以添加到我的?
      • 您可以使用toString进行分组和汇总,例如data_frame(text = txt2) %&gt;% #separate_rows(text, sep = '\n') %&gt;% separate(text, c('var', 'val'), sep = ': ', fill = 'left') %&gt;% mutate(entry = cumsum(is.na(var)), var = coalesce(var, 'Name')) %&gt;% group_by(entry, var) %&gt;% summarise(val = toString(val)) %&gt;% spread(var, val) %&gt;% ungroup() %&gt;% select(4:2) 或为键添加索引,但我不太喜欢这两种选择。如果您愿意,请继续添加。
      【解决方案5】:

      基准测试:

      代码:

      txt2 <- c("John Doe", "Email: email1@xyz.com", "Email: email1@abc.com", "City/Town: Location1", "Save the World Fund", 
                "Email: email2@abc.com", "City/Town: Location2", "Best Shoes Ltd.", "Email: email3@pqr.com", 
                "City/Town: Location3", "Mother", "City/Town: Location4", "City/Town: everywhere", "Jane", 
                "Email: email5@abc.com")
      
      library(microbenchmark)
      library(data.table)
      library(dplyr)
      library(tidyr)
      
      microbenchmark(ans.uwe = data.table(text = txt2)[, tstrsplit(text, ": ")
                                                       ][is.na(V2), Name := V1
                                                         ][, Name := zoo::na.locf(Name)
                                                           ][!is.na(V2), dcast(.SD, Name ~ V1, fun = toString, value.var = "V2")],
                     ans.zx8754 = data_frame(txt = txt2) %>% 
                       mutate(txt = ifelse(grepl(":", txt), txt, paste("Name:", txt)),
                              rn = row_number()) %>% 
                       separate(txt, into = c("mytype", "mytext"), sep = ":") %>% 
                       spread(key = mytype, value = mytext) %>% 
                       select(-rn) %>% 
                       fill(Name) %>% 
                       group_by(Name) %>% 
                       fill(1:2, .direction = "down") %>% 
                       fill(1:2, .direction = "up") %>% 
                       unique() %>% 
                       ungroup() %>% 
                       select(3:1),
                     ans.jaap = data.table(txt = txt2)[!grepl(':', txt), name := txt
                                                       ][, name := zoo::na.locf(name)
                                                         ][grepl('^Email:', txt), email := sub('Email: ','',txt)
                                                           ][grepl('^City/Town:', txt), city_town := sub('City/Town: ','',txt)
                                                             ][txt != name, lapply(.SD, function(x) toString(na.omit(x))), by = name, .SDcols = c('email','city_town')],
                     ans.G.Grothendieck = {
                       L <- trimws(readLines(textConnection(txt2)))
                       ix <- !grepl(":", L)
                       L[ix] <- paste("\nName:", L[ix])
                       read.dcf(textConnection(L))},
                     times = 1000)
      

      结果:

      Unit: microseconds
                     expr       min         lq       mean     median        uq        max neval  cld
                  ans.uwe  4243.754  4885.4765  5305.8688  5139.0580  5390.360  92604.820  1000   c 
               ans.zx8754 39683.911 41771.2925 43940.7646 43168.4870 45291.504 130965.088  1000    d
                 ans.jaap  2153.521  2488.0665  2788.8250  2640.1580  2773.150  91862.177  1000  b  
       ans.G.Grothendieck   266.268   304.0415   332.6255   331.8375   349.797    721.261  1000 a   
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2010-11-13
        • 2016-02-17
        • 1970-01-01
        • 1970-01-01
        • 2019-02-17
        • 2022-11-17
        • 1970-01-01
        相关资源
        最近更新 更多