【发布时间】:2016-05-01 07:02:34
【问题描述】:
我需要为“发布名称”列下包含“就业情况”的每个日期搜索http://www.bls.gov/schedule/schedule/2007/2007_sched.htm。网页抓取输出应为:
Jan. 5, Feb. 2, 2007, March 9, April 6, May 4, June 1, 2007
July 6, 2007, Aug. 3, Sept. 7, Oct. 5, Nov. 2, 2007, Dec. 7
#year can be ignored/omitted
要为http://www.bls.gov/schedule/news_release/2015_sched.htm 实现相同的效果,请使用以下内容:
library(rvest)
pg <- read_html("http://www.bls.gov/schedule/news_release/2015_sched.htm")
# target only <td> elements under bodytext div
body <- html_nodes(pg, "div#bodytext")
# use this new set of nodes and a relative XPath to get initial <td> elements, then get their siblings
es_nodes <- html_nodes(body, xpath=".//td[contains(., 'Employment Situation for')]/../td[1]")
# clean up and make dates
nfpdates <- as.Date(trimws(html_text(es_nodes)), format="%A, %B %d, %Y")
这将日期列表存储在nfpdates 下。我试图调整该代码以适用于http://www.bls.gov/schedule/schedule/2007/2007_sched.htm,但失败了。问题是这两个 URL 以不同的格式存储信息。鉴于信息以纯文本而不是 HTML 表格形式存储,如何从该 URL 中提取日期?谢谢。
【问题讨论】:
标签: r web-scraping text