【问题标题】:Convert unstructured csv file to a data frame将非结构化 csv 文件转换为数据框
【发布时间】:2016-02-16 14:06:44
【问题描述】:

我正在学习 R 进行文本挖掘。我有一个 CSV 格式的电视节目表。节目通常从早上 06:00 开始,一直持续到第二天早上 5:00,这被称为广播日。例如:2015 年 11 月 15 日的节目从早上 06:00 开始,到第二天早上 05:00 结束。

下面是一个示例代码,展示了日程安排的样子:

 read.table(textConnection("Sunday|\n 01-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|\nMonday|\n 02-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|"), header = F, sep = "|", stringsAsFactors = F)

其输出如下:

  V1|V2
Sunday |  
01-Nov-15 |       
6 | Tom  
some information about the program |       
23.3 |  Jerry  
some information about the program |       
5 | Avatar  
some information about the program |       
5.3 | Panda  
some information about the program |       
Monday  |       
02-Nov-15|       
6  Jerry  
some information about the program |      
6.25 | Panda  
some information about the program |      
23.3 | Avatar  
some information about the program |       
7.25 |   Tom  
some information about the program |      

我想把上面的数据转换成data.frame的形式

Date            |Program|Synopsis
2015-11-1 06:00 |Tom    | some information about the program
2015-11-1 23:30 |Jerry  | some information about the program
2015-11-2 05:00 |Avatar | some information about the program
2015-11-2 05:30 |Panda  | some information about the program
2015-11-2 06:00 |Jerry  | some information about the program
2015-11-2 06:25 |Panda  | some information about the program
2015-11-2 23:30 |Avatar | some information about the program
2015-11-3 07:25 |Tom    | some information about the program

感谢有关我应该查看的功能或软件包的任何建议/提示。

【问题讨论】:

  • @akrun 不,它是一个简单的 csv 文件。我只是添加了 '|'显示列的分隔。
  • 感谢您的留言。看起来你已经有了解决方案。所以,我没有尝试这个。

标签: r dataframe reshape


【解决方案1】:

的替代解决方案:

library(data.table)
library(zoo)
library(splitstackshape)

txt <- textConnection("Sunday|\n 01-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|\nMonday|\n 02-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|")
tv <- readLines(txt)
DT <- data.table(tv)[, tv := gsub('[|]$', '', tv)]

wd <- levels(weekdays(1:7, abbreviate = FALSE))

DT <- DT[, temp := tv %chin% wd
         ][, day := tv[temp], by = 1:nrow(tvDT)
           ][, day := na.locf(day)
             ][, temp := NULL
               ][, idx := rleid(day)
                 ][, date := tv[2], by = idx
                   ][, .SD[-c(1,2)], by = idx]

DT <- cSplit(DT, sep="|", "tv", "long")[, lbl := rep(c("Time","Program","Info")), by = idx]
DT <- dcast(DT, idx + day + date + rowid(lbl) ~ lbl, value.var = "tv")[, lbl := NULL]

DT <- DT[, datetime := as.POSIXct(paste(as.character(date), sprintf("%01.2f",as.numeric(as.character(Time)))), format = "%d-%b-%y %H.%M")
   ][, datetime := datetime + (+(datetime < shift(datetime, fill=datetime[1]) & datetime < 6) * 24 * 60 * 60)
     ][, .(datetime, Program, Info)]

结果:

> DT
              datetime Program                               Info
1: 2015-11-01 06:00:00     Tom some information about the program
2: 2015-11-01 23:30:00   Jerry some information about the program
3: 2015-11-02 05:00:00  Avatar some information about the program
4: 2015-11-02 06:00:00     Tom some information about the program
5: 2015-11-02 23:30:00   Jerry some information about the program
6: 2015-11-03 05:00:00  Avatar some information about the program

解释:

1: 读取数据,转换为 data.table 并删除尾随 |

txt <- textConnection("Sunday|\n 01-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|\nMonday|\n 02-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|")
tv <- readLines(txt)
DT <- data.table(tv)[, tv := gsub('[|]$', '', tv)]

2:将工作日提取到新列中

wd <- levels(weekdays(1:7, abbreviate = FALSE)) # a vector with the full weekdays
DT[, temp := tv %chin% wd
   ][, day := tv[temp], by = 1:nrow(tvDT)
     ][, day := na.locf(day)
       ][, temp := NULL]

3:每天创建一个索引并创建一个包含日期的列

DT[, idx := rleid(day)][, date := tv[2], by = idx]

4:删除不必要的行

DT <- DT[, .SD[-c(1,2)], by = idx]

5:将时间和节目名称分成单独的行并创建标签列

DT <- cSplit(DT, sep="|", "tv", "long")[, lbl := rep(c("Time","Program","Info")), by = idx]

6:使用data.table开发版中的'rowid'函数重塑为宽格式

DT <- dcast(DT, idx + day + date + rowid(idx2) ~ idx2, value.var = "tv")[, idx2 := NULL]

7:创建一个日期时间列并将深夜时间设置为第二天

DT[, datetime := as.POSIXct(paste(as.character(date), sprintf("%01.2f",as.numeric(as.character(Time)))), format = "%d-%b-%y %H.%M")
   ][, datetime := datetime + (+(datetime < shift(datetime, fill=datetime[1]) & datetime < 6) * 24 * 60 * 60)]

8:保留所需的列

DT <- DT[, .(datetime, Program, Info)]

【讨论】:

    【解决方案2】:

    这有点乱,但它似乎工作:

    df <- read.table(textConnection(txt <- "Sunday|\n 01-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|\nMonday|\n 02-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|"), header = F, sep = "|", stringsAsFactors = F)
    cat(txt)
    Sys.setlocale("LC_TIME", "English") # if needed
    weekdays <- format(seq.Date(Sys.Date(), Sys.Date()+6, 1), "%A")
    days <- split(df, cumsum(df$V1 %in% weekdays))
    lapply(days, function(dayDF) {
      tmp <- cbind.data.frame(V1=dayDF[2, 1], do.call(rbind, split(unlist(dayDF[-c(1:2), ]), cumsum(!dayDF[-(1:2), 2]==""))), stringsAsFactors = F)
      tmp[, 1] <- as.Date(tmp[, 1], "%d-%B-%y")
      tmp[, 2] <- as.numeric(tmp[, 2])
      tmp[, 5] <- NULL
      idx <- c(FALSE, diff(tmp[, 2])<0)
      tmp[idx, 1] <- tmp[idx, 1] + 1
      return(tmp)
    }) -> days
    days <- transform(do.call(rbind.data.frame, days), V1=as.POSIXct(paste(V1, sprintf("%.2f", V11)), format="%Y-%m-%d %H.%M"), V11=NULL)  
    names(days) <- c("Date", "Synopsis", "Program")
    rownames(days) <- NULL
    days[, c(1, 3, 2)]
    #                  Date Program                            Synopsis
    # 1 2015-11-01 06:00:00     Tom  some information about the program
    # 2 2015-11-01 23:30:00   Jerry  some information about the program
    # 3 2015-11-02 05:00:00  Avatar  some information about the program
    # 4 2015-11-02 06:00:00     Tom  some information about the program
    # 5 2015-11-02 23:30:00   Jerry  some information about the program
    # 6 2015-11-03 05:00:00  Avatar  some information about the program
    

    【讨论】:

    • 感谢代码,如前文所述,播出日还包含次日凌晨5点前的节目。
    • 不客气。您的示例中的广播日在哪里?
    • 解决方案中的“日期”列包含不同的日期和时间。如我的示例所示,11 月 1 日的时间表还包含 11 月 2 日的节目,直到凌晨 05:00。
    【解决方案3】:

    1) 这会设置一些函数,然后由四个transform(...) %&gt;% subset(...) 代码片段组成,这些片段使用 magrittr 管道链接在一起。我们假设DF 是问题中read.table 的输出。

    首先,加载 zoo 包,以便访问 na.locf。定义一个 Lead 函数,它将每个元素移动 1 个位置。还要定义一个 datetime 函数,它将日期加上 h.m 数字转换为日期时间。

    现在将日期转换为"Date" 类。不是日期的行将变为 NA。使用 Lead 将该向量移动 1 个位置,然后提取 NA 位置,从而有效地删除工作日行。现在使用na.locf 填写日期并仅保留具有重复日期的行,从而有效地删除仅包含日期的行。接下来将Program 设置为V1,将Synopsis 设置为V2,除非我们必须使用Lead 移动V2,因为Synopsis 位于每对的第二行。只保留奇数定位的行。生成datetime 并选择所需的列。

    library(magrittr)
    library(zoo) # needed for na.locf
    
    Lead <- function(x, fill = NA) c(x[-1], fill)  # shift down and fill
    datetime <- function(date, time) {
                  time <- as.numeric(time)
                  as.POSIXct(sprintf("%s %.0f:%02f", date, time, 100 * (time %% 1))) + 
                          24 * 60 * 60 * (time < 6) # add day if time < 6
    }
    
    DF %>% 
    
       transform(date = as.Date(V1, "%d-%b-%y")) %>% 
       subset(Lead(is.na(date), TRUE)) %>%   # rm weekday rows
    
       transform(date = na.locf(date)) %>%  # fill in dates
       subset(duplicated(date)) %>% # rm date rows
    
       transform(Program = V2, Synopsis = Lead(V1)) %>% 
       subset(c(TRUE, FALSE)) %>%  # keep odd positioned rows only
    
       transform(Date = datetime(date, V1)) %>% 
       subset(select = c("Date", "Program", "Synopsis"))
    

    给予:

                     Date Program                            Synopsis
    1 2015-11-01 06:00:00     Tom  some information about the program
    2 2015-11-01 23:30:00   Jerry  some information about the program
    3 2015-11-02 05:00:00  Avatar  some information about the program
    4 2015-11-02 06:00:00     Tom  some information about the program
    5 2015-11-02 23:30:00   Jerry  some information about the program
    6 2015-11-03 05:00:00  Avatar  some information about the program
    

    2) dplyr,这里使用了 dplyr 和上面的 datetime 函数。我们本可以将 (1) 中的 transformsubset 函数替换为 dplyr mutatefilterLead 替换为 lead,但为了多样化,我们采用另一种方式:

    library(dplyr)
    library(zoo) # na.locf
    
    DF %>%
       mutate(date = as.Date(V1, "%d-%b-%t")) %>%
       filter(lead(is.na(date), default = TRUE)) %>% # rm weekday rows
       mutate(date = na.locf(date)) %>% # fill in dates
       group_by(date) %>%
       mutate(Program = V2, Synopsis = lead(V1)) %>%
       slice(seq(2, n(), by = 2)) %>%
       ungroup() %>%
       mutate(Date = datetime(date, V1)) %>%
       select(Date, Program, Synopsis)
    

    给予:

    Source: local data frame [6 x 3]
    
                     Date Program                            Synopsis
                   (time)   (chr)                               (chr)
    1 2015-11-01 06:00:00     Tom  some information about the program
    2 2015-11-01 23:30:00   Jerry  some information about the program
    3 2015-11-02 05:00:00  Avatar  some information about the program
    4 2015-11-02 06:00:00     Tom  some information about the program
    5 2015-11-02 23:30:00   Jerry  some information about the program
    6 2015-11-03 05:00:00  Avatar  some information about the program
    

    3) data.table 这也使用了来自 zoo 的 na.locf 和 (1) 中定义的 datetime

    library(data.table)
    library(zoo)
    
    dt <- data.table(DF)
    dt <- dt[, date := as.Date(V1, "%d-%b-%y")][
              shift(is.na(date), type = "lead", fill = TRUE)][, # rm weekday rows
              date := na.locf(date)][duplicated(date)][,  # fill in dates & rm date rows
              Synopsis := shift(V1, type = "lead")][seq(1, .N, 2)][, # align Synopsis
              c("Date", "Program") := list(datetime(date, V1), V2)][, 
              list(Date, Program, Synopsis)]
    

    给予:

    > dt
                      Date Program                            Synopsis
    1: 2015-11-01 06:00:00     Tom  some information about the program
    2: 2015-11-01 23:30:00   Jerry  some information about the program
    3: 2015-11-02 05:00:00  Avatar  some information about the program
    4: 2015-11-02 06:00:00     Tom  some information about the program
    5: 2015-11-02 23:30:00   Jerry  some information about the program
    6: 2015-11-03 05:00:00  Avatar  some information about the program
    

    更新:简化 (1) 并添加 (2) 和 (3)。

    【讨论】:

    • 感谢您的精彩解释和代码。第 3 行的日期应为 11 月 2 日,第 6 行的日期应为 11 月 3 日,如我的示例中所述。
    • 非常感谢。你能解释一下你是如何修复它的吗?对不起,我太天真了。
    • 已简化 (1) 并添加了 dplyr 解决方案 (2)。
    猜你喜欢
    • 1970-01-01
    • 2019-09-16
    • 2017-11-26
    • 1970-01-01
    • 2023-03-03
    • 1970-01-01
    • 2015-01-05
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多