从 R 中的文本文件中提取表格（和其他信息）答案

【问题标题】：Extracting a table (and other information) from a text file in R从 R 中的文本文件中提取表格（和其他信息）
【发布时间】：2021-06-27 22:49:26
【问题描述】：

我正在尝试使用 R 从 historical Met Office data 中提取数据表以及其他一些信息，但尽管在 StackOverflow 上度过了整个晚上，但仍然遇到问题。

例如，这是sunny (maybe??) Lowestoft 的数据：

Lowestoft / Lowestoft Monckton Ave from Sept 2007
Location 654300E 294600N 25m amsl to July 2007 
& from Sept 2007 653000E 293800N, Lat 52.483 Lon 1.727, 18m amsl
Estimated data is marked with a * after the value.
Missing data (more than 2 days missing in month) is marked by  ---.
Sunshine data taken from an automatic Kipp & Zonen sensor marked with a #, otherwise sunshine data taken from a Campbell Stokes recorder.
   yyyy  mm   tmax    tmin      af    rain     sun
              degC    degC    days      mm   hours
   1914   1    5.2     0.7    ---     52.0    ---
   1914   2    9.2     3.5    ---     28.0    ---
   1914   3   ---     ---     ---     ---     ---
   1914   4   12.9     5.3    ---     18.0    ---
   ...
   2020  11   12.5*    6.1*      0*   31.9*   73.7*  Provisional
   2020  12    7.7*    2.9*      6*  105.8*   50.5*  Provisional
   2021   1    5.8*    1.2*     10*   78.6*   49.4*  Provisional
   2021   2    7.9*    2.4*      9*   48.6*   84.7*  Provisional

到目前为止，我管理的最好方法是使用 sed（在 R 之外）删除 *'d 和 #'d 变量，但是使用 read.table(lowestoftdata.text, skip = 8, col.names = c("year","month","max_temp", "min_temp", "frost", "rainfall", "sunshine")) 导入它时遇到 2020 年的数据会失败之后标记为临时。提取纬度和经度值也非常方便，这些值通常在第 2 行，但如果像 Lowestoft 一样，车站在某个点移动，但我的 very 正则表达式有限，则可以在第 3 行知识（和移动的目标）让我失望了。

我的伪代码方法是：

用纬度和经度识别线，解析该线以提取这些变量
识别以数字开头的第一行，并从该行读取.table

...但是，由于我在处理格式良好的 CSV 文件以外的任何内容方面的有限经验，将其转化为实践证明是具有挑战性的，因此任何关于从哪里开始的建议都将不胜感激。

【问题讨论】：

这是一种固定宽度的格式。也许使用utils::read.fwf

标签： r regex text text-extraction data-extraction

【解决方案1】：

这是一种“解析”标题文本的请求方法：

metadata <- 
 readLines(url("https://www.metoffice.gov.uk/pub/data/weather/uk/climate/stationdata/lowestoftdata.txt"), n=9)
> metadata
[1] "Lowestoft / Lowestoft Monckton Ave from Sept 2007"                                                                                        
[2] "Location 654300E 294600N 25m amsl to July 2007 "                                                                                          
[3] "& from Sept 2007 653000E 293800N, Lat 52.483 Lon 1.727, 18m amsl"                                                                         
[4] "Estimated data is marked with a * after the value."                                                                                       
[5] "Missing data (more than 2 days missing in month) is marked by  ---."                                                                      
[6] "Sunshine data taken from an automatic Kipp & Zonen sensor marked with a #, otherwise sunshine data taken from a Campbell Stokes recorder."
[7] "   yyyy  mm   tmax    tmin      af    rain     sun"                                                                                       
[8] "              degC    degC    days      mm   hours"  

                                                                               

> sub( "Location (\\d+[EW]) (\\d+[NS])(.+$)", "\\1,\\2", metadata[2])
[1] "654300E,294600N"

我需要对数据应用“标尺”以获取read.fwf 方法的位置和宽度。

> paste( rep("123456789",6), 1:6, collapse="", sep="")
[1] "123456789112345678921234567893123456789412345678951234567896"
> metadata[9]
[1] "   1914   1    5.2     0.7    ---     52.0    ---"

这是字符的结果。在使用as.numeric 之前，您需要做一些进一步的处理以去除星号。我用一栏来说明它。您可以使用metadata[9] 来编辑列名

 widths=c(3,4,4,7,8,7,10,7)

 dat=read.fwf( "https://www.metoffice.gov.uk/pub/data/weather/uk/climate/stationdata/lowestoftdata.txt", widths = widths , skip=8, colClasses="character", header=FALSE)
Warning message:
In readLines(file, n = thisblock) :
  incomplete final line found on 'https://www.metoffice.gov.uk/pub/data/weather/uk/climate/stationdata/lowestoftdata.txt'
 tail(dat)
#---------------------
      V1   V2   V3      V4       V5      V6         V7      V8
1269     2020    9    19.6 *   11.5 *       0*   97.1*   168.6
1270     2020   10    14.2 *    9.0 *       0*   85.7*    58.8
1271     2020   11    12.5 *    6.1 *       0*   31.9*    73.7
1272     2020   12     7.7 *    2.9 *       6*  105.8*    50.5
1273     2021    1     5.8 *    1.2 *     1 0*   78.6*    49.4
1274     2021    2     7.9 *    2.4 *       9*   48.6*    84.7
#----------------
head(dat)
   V1   V2   V3      V4       V5      V6         V7     V8
1     1914    1     5.2      0.7     ---      52.0     ---
2     1914    2     9.2      3.5     ---      28.0     ---
3     1914    3    ---      ---      ---      ---      ---
4     1914    4    12.9      5.3     ---      18.0     ---
5     1914    5    13.7      7.2     ---      38.0     ---
6     1914    6    16.2     10.4     ---      38.0     ---

summary(as.numeric(sub("[*]","", dat$V8)))
#--------------------
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   11.0    70.3   136.3   136.1   189.9   314.4     157

还有?readr::read_fwf，它有一些优势。一方面，它允许您使用位置而不是宽度来指定 fwf。我发现这更容易，尤其是如果您使用我的临时“尺子”。

【讨论】：

标尺技巧非常棒。我已经勾选了 Sirius 的答案，因为这对我来说似乎是最直观的，但这一个也非常好，在两者之间我已经得到了它的出色工作。感谢您花时间帮助我。

【解决方案2】：

这是另一个尝试：

清理这个需要一堆不同的东西。

首先处理两行标题（这些总是很痛苦）。对此可能有更简单的解决方案，但在某些时候您只需要完成工作即可。

我将这两行合二为一，并将那些稍长的文本用作标题。

读取数据之前的清理步骤有点神秘，但它会从行尾去除任何不是数字、破折号或星号的东西。（修剪那些文本 cmets，否则会用 fread 混淆字段解析，这非常快。）


library(data.table)
library(purrr)

raw.text <- read_file("https://www.metoffice.gov.uk/pub/data/weather/uk/climate/stationdata/lowestoftdata.txt")

lat.long <- as.numeric( str_match( raw.text, "Lat (\\d+\\.\\d+) Lon (\\d+\\.\\d+)" )[,-1] )

m <- regexpr( "+yyyy.*hours", raw.text )

headertext <- substr( raw.text, m, m+attr(m,"match.length")-1 )
header.lines <- strsplit( headertext, "\\r?\\n" )[[1]]
header.lines <- sub( "^\\s+", "", header.lines )
header.fields2 <- strsplit( header.lines, "\\s+" )
header.fields2[[2]] <- c( "", "", header.fields2[[2]] )

header.fields <- pmap_chr( header.fields2, paste, collapse=" " ) %>% str_trim

## some cleanup:
text.to.read <- substring( raw.text, m+attr(m,"match.length") )

## This next line matches anything that is not a digit (\\d) and not a dash (\\-) and not a star (\\*) until the end of the line, $. It's the enclosing (?m: ... ) that changes $ to match end of line, and not end of string as usual.
text.to.read2 <- gsub( "(?m:([^\\d\\-\\*]*)$)", "", text.to.read, perl=TRUE )

## by now a simple fread will do the rest for us
d <- fread( text=text.to.read2, fill=TRUE, header=FALSE, na="---" )
setnames(d, header.fields)

d

输出：


      yyyy mm tmax degC tmin degC af days rain mm sun hours
   1: 1914  1       5.2       0.7    <NA>    52.0      <NA>
   2: 1914  2       9.2       3.5    <NA>    28.0      <NA>
   3: 1914  3      <NA>      <NA>    <NA>    <NA>      <NA>
   4: 1914  4      12.9       5.3    <NA>    18.0      <NA>
   5: 1914  5      13.7       7.2    <NA>    38.0      <NA>
  ---                                                      
1270: 2020 10     14.2*      9.0*      0*   85.7*     58.8*
1271: 2020 11     12.5*      6.1*      0*   31.9*     73.7*
1272: 2020 12      7.7*      2.9*      6*  105.8*     50.5*
1273: 2021  1      5.8*      1.2*     10*   78.6*     49.4*
1274: 2021  2      7.9*      2.4*      9*   48.6*     84.7*

【讨论】：

工作很漂亮，我什至可以弄清楚它背后的逻辑。添加mutate_all(funs(str_replace(., "[*#]", ""))) 将数据清理为我需要的数据。感谢您花时间教我这个，非常感谢。