【问题标题】:How to read file with a specific format in R?如何在 R 中读取特定格式的文件?
【发布时间】:2015-12-11 16:44:56
【问题描述】:

我想读取一个文件,其中每一行代表一个包含日期、一些文本和数字的数据集。示例:

Fri Dec 11 12:40:01 CET 2015    Uptime: 108491  Threads: 2  Questions: 576603  Slow queries: 10  Opens: 2238  Flush tables: 1  Open tables: 7  Queries per second avg: 5.314
Fri Dec 11 12:50:01 CET 2015    Uptime: 109090  Threads: 2  Questions: 580407  Slow queries: 10  Opens: 2253  Flush tables: 1  Open tables: 6  Queries per second avg: 5.320
Fri Dec 11 13:00:01 CET 2015    Uptime: 109690  Threads: 2  Questions: 583895  Slow queries: 10  Opens: 2268  Flush tables: 1  Open tables: 8  Queries per second avg: 5.323
Fri Dec 11 13:10:01 CET 2015    Uptime: 110290  Threads: 1  Questions: 586891  Slow queries: 10  Opens: 2279  Flush tables: 1  Open tables: 6  Queries per second avg: 5.321
Fri Dec 11 13:20:01 CET 2015    Uptime: 110890  Threads: 2  Questions: 590871  Slow queries: 10  Opens: 2292  Flush tables: 1  Open tables: 5  Queries per second avg: 5.328

没有通用的分隔字符(如 CSV 中),但格式可以很好地描述,因为可以使用制表符、字符和文本。

%DATESTRING%\tUptime: %uptime%  Threads: %threads%  Questions: %questions%  Slow queries: %slow%  Opens: %opens%  Flush tables: %flush%  Open tables: %otables%  Queries per second avg: %qps%

是否有一个函数可以获取格式和文件的描述并用给定的数据填充 data.frame?

【问题讨论】:

  • 我很幸运地将它放入 excel 中,在需要的地方修复它,然后将其保存为 csv。
  • @rawr 列名包含在行记录中的事实对于我使用过的固定宽度是非标准的......
  • @MichaelChirico 是的,你是对的
  • @rawr 我仍然认为这种方法有效——读取为固定宽度,然后从列名中提取子集

标签: r data-import fileparsing


【解决方案1】:

tidyr 有一些可能对此有用的实用功能,但如果有更多为此工作构建的专用工具,我不会感到惊讶。

我们从加载数据开始,在这个例子中是从一个字符串开始

raw <- 'Fri Dec 11 12:40:01 CET 2015    Uptime: 108491  Threads: 2     Questions: 576603  Slow queries: 10  Opens: 2238  Flush tables: 1  Open tables: 7  Queries per second avg: 5.314
Fri Dec 11 12:50:01 CET 2015    Uptime: 109090  Threads: 2  Questions: 580407  Slow queries: 10  Opens: 2253  Flush tables: 1  Open tables: 6  Queries per second avg: 5.320
Fri Dec 11 13:00:01 CET 2015    Uptime: 109690  Threads: 2  Questions: 583895  Slow queries: 10  Opens: 2268  Flush tables: 1  Open tables: 8  Queries per second avg: 5.323
Fri Dec 11 13:10:01 CET 2015    Uptime: 110290  Threads: 1  Questions: 586891  Slow queries: 10  Opens: 2279  Flush tables: 1  Open tables: 6  Queries per second avg: 5.321
Fri Dec 11 13:20:01 CET 2015    Uptime: 110890  Threads: 2  Questions: 590871  Slow queries: 10  Opens: 2292  Flush tables: 1  Open tables: 5  Queries per second avg: 5.328'

df <- read.csv(textConnection(raw), header=F)

这里我使用了read.csv,以便将其作为数据框获取,但您也可以只使用readLines 并自己将其添加到框中。

然后我们处理它

library(tidyr)
> processed <- df %>% extract(V1,
  c("Date", "Uptime", "Threads", "Questions"),
  "(.*) *Uptime: (\\d+) *Threads: (\\d+) *Questions: (\\d+)")
> processed
                              Date Uptime Threads Questions
1 Fri Dec 11 12:40:01 CET 2015     108491       2    576603
2 Fri Dec 11 12:50:01 CET 2015     109090       2    580407
3 Fri Dec 11 13:00:01 CET 2015     109690       2    583895
4 Fri Dec 11 13:10:01 CET 2015     110290       1    586891
5 Fri Dec 11 13:20:01 CET 2015     110890       2    590871

应该清楚如何从这里提取剩余的列。

【讨论】:

    【解决方案2】:

    另外两个选项:

    txt <- "Fri Dec 11 12:40:01 CET 2015    Uptime: 108491  Threads: 2  Questions: 576603  Slow queries: 10  Opens: 2238  Flush tables: 1  Open tables: 7  Queries per second avg: 5.314
    Fri Dec 11 12:50:01 CET 2015    Uptime: 109090  Threads: 2  Questions: 580407  Slow queries: 10  Opens: 2253  Flush tables: 1  Open tables: 6  Queries per second avg: 5.320
    Fri Dec 11 13:00:01 CET 2015    Uptime: 109690  Threads: 2  Questions: 583895  Slow queries: 10  Opens: 2268  Flush tables: 1  Open tables: 8  Queries per second avg: 5.323
    Fri Dec 11 13:10:01 CET 2015    Uptime: 110290  Threads: 1  Questions: 586891  Slow queries: 10  Opens: 2279  Flush tables: 1  Open tables: 6  Queries per second avg: 5.321
    Fri Dec 11 13:20:01 CET 2015    Uptime: 110890  Threads: 2  Questions: 590871  Slow queries: 10  Opens: 2292  Flush tables: 1  Open tables: 5  Queries per second avg: 5.328"
    
    ## first just tack on the date label
    txt <- gsub('^', 'Date: ', readLines(textConnection(txt)))
    

    选项 1

    sp <- strsplit(txt, '\\s{2,}')
    out <- lapply(sp, function(x) gsub('([\\w ]+:)\\s+(.*)$', '\\2', x, perl = TRUE))
    dd <- setNames(do.call('rbind.data.frame', out),
                   gsub('([\\w ]+):\\s+(.*)$', '\\1', sp[[1]], perl = TRUE))
    dd[, -1] <- lapply(dd[, -1], function(x) as.numeric(as.character(x)))
    dd
    

    选项 2:这个使用 yaml 包,但更直接,可以为您进行类型转换

    yml <- gsub('\\s{2,}', '\n', txt)
    do.call('rbind.data.frame', lapply(yml, yaml::yaml.load))
    
    #                    Date Uptime Threads Questions Slow queries Opens Flush tables
    # 1 Fri Dec 11 12:40:01 CET 2015 108491       2    576603           10  2238            1
    # 2 Fri Dec 11 12:50:01 CET 2015 109090       2    580407           10  2253            1
    # 3 Fri Dec 11 13:00:01 CET 2015 109690       2    583895           10  2268            1
    # 4 Fri Dec 11 13:10:01 CET 2015 110290       1    586891           10  2279            1
    # 5 Fri Dec 11 13:20:01 CET 2015 110890       2    590871           10  2292            1
    #   Open tables Queries per second avg
    # 1           7                  5.314
    # 2           6                  5.320
    # 3           8                  5.323
    # 4           6                  5.321
    # 5           5                  5.328
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2018-08-20
      • 1970-01-01
      • 1970-01-01
      • 2023-03-04
      • 1970-01-01
      • 2014-03-09
      • 2019-10-07
      • 1970-01-01
      相关资源
      最近更新 更多