【发布时间】:2021-04-06 13:04:30
【问题描述】:
我是 R 和网络抓取的新手。为了练习,我试图从一个有多个页面的假网站('http://books.toscrape.com/catalogue/page-1.html')中抓取书名,然后根据书名计算某些指标。每页有 20 本书和 50 页,我已经设法抓取并计算了前 20 本书的指标,但是我想计算网站上全部 1000 本书的指标。
当前输出如下所示:
[1] "A Light in the Attic"
[2] "Tipping the Velvet"
[3] "Soumission"
[4] "Sharp Objects"
[5] "Sapiens: A Brief History of Humankind"
[6] "The Requiem Red"
[7] "The Dirty Little Secrets of Getting Your Dream Job"
[8] "The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull"
[9] "The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics"
[10] "The Black Maria"
[11] "Starving Hearts (Triangular Trade Trilogy, #1)"
[12] "Shakespeare's Sonnets"
[13] "Set Me Free"
[14] "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)"
[15] "Rip it Up and Start Again"
[16] "Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991"
[17] "Olio"
[18] "Mesaerion: The Best Science Fiction Stories 1800-1849"
[19] "Libertarianism for Beginners"
[20] "It's Only the Himalayas"
我希望它的长度为 1000 本书而不是 20 本书,这将允许我使用相同的代码来计算指标,但需要 1000 本书而不是 20 本书。
代码:
url<-'http://books.toscrape.com/catalogue/page-1.html'
url %>%
read_html() %>%
html_nodes('h3 a') %>%
html_attr('title')->titles
titles
从网站上抓取每本书并使列表长度为 1000 个而不是 20 个的最佳方法是什么?提前致谢。
【问题讨论】:
标签: r web-scraping