【发布时间】:2018-03-24 13:36:22
【问题描述】:
有人有使用 R 抓取数据的经验吗?
我想提取给定股票的相应数据。我是通过以下方式做到的:
library(XML)
stocks <- c("AAPL","MSFT")
for (s in stocks) {
url <- paste0("http://finviz.com/quote.ashx?t=", s)
webpage <- readLines(url)
html <- htmlTreeParse(webpage, useInternalNodes = TRUE, asText = TRUE)
tableNodes <- getNodeSet(html, "//table")
# ASSIGN TO STOCK NAMED DFS
assign(s, readHTMLTable(tableNodes[[9]],
header= c("data1", "data2", "data3", "data4", "data5", "data6",
"data7", "data8", "data9", "data10", "data11", "data12")))
# ADD COLUMN TO IDENTIFY STOCK
df <- get(s)
df['stock'] <- s
assign(s, df)
}
# COMBINE ALL STOCK DATA
stockdatalist <- cbind(mget(stocks))
stockdata <- do.call(rbind, stockdatalist)
# MOVE STOCK ID TO FIRST COLUMN
stockdata <- stockdata[, c(ncol(stockdata), 1:ncol(stockdata)-1)]
但是,问题是我以错误的格式获取它:
stock data1 data2 data3 data4 data5 data6 data7 data8 data9 data10 data11 data12
1 AAPL Index DJIA S&P500 P/E 16.13 EPS (ttm) 10.22 Insider Own 0.06% Shs Outstand 5.09B Perf Week -7.35%
2 AAPL Market Cap 839.87B Forward P/E 12.50 EPS next Y 13.20 Insider Trans -7.80% Shs Float 5.07B Perf Month -4.38%
3 AAPL Income 53.13B PEG 1.38 EPS next Q 2.71 Inst Own 63.20% Short Float 1.16% Perf Quarter -5.40%
4 AAPL Sales 239.18B P/S 3.51 EPS this Y 10.80% Inst Trans 0.98% Short Ratio 1.60 Perf Half Y 7.53%
5 AAPL Book/sh 27.42 P/B 6.02 EPS next Y 14.97% ROA 13.80% Target Price 192.54 Perf Year 17.05%
6 AAPL Cash/sh 15.15 P/C 10.89 EPS next 5Y 11.68% ROE 37.40% 52W Range 138.62 - 183.50 Perf YTD -2.54%
我想做的是某个股票只在行名中出现一次,然后数据名显示为列名,然后列包含相应的数字...
【问题讨论】:
-
您的问题与stackoverflow.com/questions/40490717/… 类似(尽管不是完全重复的)——检查那里,看看是否有帮助。建议:使用
rvest进行抓取。 -
@mysteRious ,我在发布之前确实看到了特定的帖子,但是它并没有帮助我进一步解决我的问题...... :(
-
哦,我终于明白你要做什么了。会努力的。
-
@mysteRious 我们的目标是把它变成某种标准的宽格式……为了使数据分析更容易……
标签: r web-scraping rvest quantmod quandl