网页从纯文本抓取到 R答案

【问题标题】：Web Scrape from plain text into R网页从纯文本抓取到 R
【发布时间】：2016-05-01 07:02:34
【问题描述】：

我需要为“发布名称”列下包含“就业情况”的每个日期搜索http://www.bls.gov/schedule/schedule/2007/2007_sched.htm。网页抓取输出应为：

Jan.  5, Feb.  2, 2007, March  9, April  6, May  4, June  1, 2007
July  6, 2007, Aug.  3, Sept.  7, Oct.  5, Nov.  2, 2007, Dec.  7  
#year can be ignored/omitted

要为http://www.bls.gov/schedule/news_release/2015_sched.htm 实现相同的效果，请使用以下内容：

library(rvest)
pg <- read_html("http://www.bls.gov/schedule/news_release/2015_sched.htm")

# target only  <td> elements under bodytext div
body <- html_nodes(pg, "div#bodytext")

# use this new set of nodes and a relative XPath to get initial <td> elements, then get their siblings
es_nodes <- html_nodes(body, xpath=".//td[contains(., 'Employment Situation for')]/../td[1]")

# clean up and make dates
nfpdates <- as.Date(trimws(html_text(es_nodes)), format="%A, %B %d, %Y")

这将日期列表存储在nfpdates 下。我试图调整该代码以适用于http://www.bls.gov/schedule/schedule/2007/2007_sched.htm，但失败了。问题是这两个 URL 以不同的格式存储信息。鉴于信息以纯文本而不是 HTML 表格形式存储，如何从该 URL 中提取日期？谢谢。

【问题讨论】：

标签： r web-scraping text

【解决方案1】：

这不是完整的解决方案，但它确实从网页中提取了包含“就业情况”的请求行。您请求的文本与 pre 标记相关联。此页面有 4 个部分（第 3 和 4 部分为空）。

library(rvest)
url <- "http://www.bls.gov/schedule/schedule/2007/2007_sched.htm"
body<-html_nodes(read_html(url), "pre")
#text= xml_text(body[1])  #only uses the first table
text= sapply(1:length(body), function(i) {xml_text(body[i])})  #looks at all tables
#create one list for all the captured lines
table1<-unlist(strsplit(text, "\n"))
#find lines that match the search string
employ<-table1[grepl("The Employment Situation", table1)]

最终结果是：

[1]“就业形势，2006 年 12 月 5 日上午 8:30\r”
[2]“就业形势，2007 年 1 月 \tFeb. 2, 2007\t 8:30 am \r”

...

此时需要使用 strsplit、gsub、grep 来清理和隔离每一行的期望文本。如果您仍然遇到问题，则可能是另一个 Stackoverflow 问题，重点是从每一行中提取日期。祝你好运。

【讨论】：