【发布时间】:2020-07-12 02:54:22
【问题描述】:
考虑这个简单的例子
library(rvest)
library(tidyverse)
library(dplyr)
library(lubridate)
library(tibble)
mytib <- tibble(mylink = c('https://en.wikipedia.org/wiki/List_of_software_bugs',
'https://en.wikipedia.org/wiki/Software_bug'))
mytib <- mytib %>% mutate(html.data = map(mylink, ~read_html(.x)))
> mytib
# A tibble: 2 x 2
mylink html.data
<chr> <list>
1 https://en.wikipedia.org/wiki/List_of_software_bugs <xml_dcmn>
2 https://en.wikipedia.org/wiki/Software_bug <xml_dcmn>
> mytib$html.data[1]
[[1]]
{html_document}
<html class="client-nojs" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<meta charset="UTF-8">\n<title> ...
[2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject mw-editable page-List_of_software_b ...
如您所见,我的tibble 正确包含存储在mylink 列中的两个不同维基百科页面的html 代码。问题是我无法将这个辛苦的抓取存储到磁盘上。一个简单的read_csv 会失败
> mytib %>% write_csv('mydata.csv')
Error in stream_delim_(df, path, ..., bom = bom, quote_escape = quote_escape) :
Don't know how to handle vector of type list.
写入rds 时将无法正常工作
mytib %>% write_rds('mydata.rds')
test <- read_rds('mydata.rds')
test$html.data[1]
> test$html.data[1]
[[1]]
Error in doc_type(x) : external pointer is not valid
我该怎么办?我应该以哪种格式存储我的数据? 谢谢!
【问题讨论】:
标签: html r web-scraping purrr