如何使用 R（Rcurl/XML 包）从 Yahoo 抓取选项数据？答案

【问题标题】：How to use R (Rcurl/XML packages) to scrape options data from Yahoo?如何使用 R（Rcurl/XML 包）从 Yahoo 抓取选项数据？
【发布时间】：2011-04-25 01:11:51
【问题描述】：

基本上，我想每天从 Yahoo! 抓取一些选项数据！金融。我一直以（1）为例来踢轮胎。然而它还没有完全解决，因为我不熟悉 HTML。

(1)Scraping html tables into R data frames using the XML package

作为一个例子，我想抓取并收集以下选项链 http://finance.yahoo.com/q/op?s=MNTA&m=2011-05

这是我到目前为止所尝试的。最后两行不起作用，因为我不清楚我应该寻找什么类。任何帮助都会很棒。谢谢。

library(RCurl)
library(XML)

theurl <- "http://finance.yahoo.com/q/op?s=MNTA&m=2011-05"
webpage <- getURL(theurl)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)

pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)

tablehead <- xpathSApply(pagetree, "//*/table[@class='yfnc_datamodoutline1']/tr/th", xmlValue)

results <- xpathSApply(pagetree, "//*/table[@class='wikitable sortable']/tr/td", xmlValue)

最后两行没有

【问题讨论】：

标签： xml r web-scraping finance

【解决方案1】：

我假设您想获取看涨期权和看跌期权这两个表中的信息。这是使用XML 包的一种简单方法

url  = "http://finance.yahoo.com/q/op?s=MNTA&m=2011-05"
# extract all tables on the page
tabs = readHTMLTable(url, stringsAsFactors = F)

# locate tables containing call and put information
call_tab = tabs[[11]]
put_tab  = tabs[[15]]

我通过人工检查找出了两张桌子的位置。如果位置在您正在解析的页面中会有所不同，那么您可能希望使用表格长度或其他一些文本标准以编程方式定义位置。

编辑。您可能感兴趣的两张表都有cellpadding = 3。您可以使用此信息使用以下代码直接提取两个表

# parse url into html tree
doc = htmlTreeParse(url, useInternalNodes = T)

# find all table nodes with attribute cellpadding = 3
tab_nodes = xpathApply(doc, "//table[@cellpadding = '3']")

# parse the two nodes into tables
tabs = lapply(tab_nodes, readHTMLTable)
names(tabs) = c("calls", "puts")

这是一个包含两个表的列表。

【讨论】：

非常感谢！这比我在 quantmod 包中修复 getOptionChain() 命令的其他尝试要好得多。
@James 你对getOptionChain 有什么问题。我试过getOptionChain('MNTA')，它返回的结果与此处定义的解析器相同
getOptionChain() 对于只有一次罢工的期权链失败，我找不到干净的解决方案。尝试getOptionChain('ACUR')，您会看到有关不正确尺寸的错误。
跟进：您将如何尝试将这些数据每天存储在 R 中？试图操纵一堆这些
@James 你是说每天都拉取这些信息，还是历史上拉取几天？