【发布时间】:2021-11-22 09:17:25
【问题描述】:
我对网络抓取完全不熟悉,我可能会淹死在茶杯里。 我想自动执行以下操作
- 在 etsy.com 上运行以下查询
https://www.etsy.com/search?q=Christmas+candle&order=most_relevant&view_type=gallery
即只需在 Etsy 上查找“圣诞蜡烛”
- 然后分别检索产品的标题和描述,可能会给出我想在搜索中包含的页数作为我的函数或管道的输入。
我查看了基本示例
https://github.com/dmi3kno/polite
但是当我尝试使其适应我的需求时(请参阅帖子末尾的 reprex),它未能准确返回......什么都没有!
谁能指出我正确的方向? 非常感谢!
library(polite)
library(rvest)
session <- bow("https://www.cheese.com/by_type", force = TRUE)
result <- scrape(session, query=list(t="semi-soft", per_page=100)) %>%
html_node("#main-body") %>%
html_nodes("h3") %>%
html_text()
result
#> [1] "3-Cheese Italian Blend" "Abbaye de Citeaux"
#> [3] "Abbaye du Mont des Cats" "Adelost"
#> [5] "ADL Brick Cheese" "Ailsa Craig"
#> [7] "Airedale" "Aisy Cendre"
#> [9] "Alpe di Frabosa" "Alpine Gold"
#> [11] "Alta Badia" "Amablu Blue cheese"
#> [13] "Ameribella" "American Cheese"
#> [15] "Ami du Chambertin" "Amsterdammer (British Columbia)"
#> [17] "Amul Pizza Mozzarella Cheese" "Anthotyro Fresco"
#> [19] "Aphrodite Haloumi " "Appalachian"
#> [21] "Applewood Smoked Chevre" "Ardrahan"
#> [23] "Armenian String Cheese" "Aromes au Gene de Marc"
#> [25] "Asher Blue" "Asiago Pressato DOP"
#> [27] "Aura" "Azeitao"
#> [29] "Baby Swiss" "Baluchon"
#> [31] "Bandal" "Basajo"
#> [33] "Basils Original Rauchkäse" "Baskeriu"
#> [35] "Basket Cheese" "Bassigny au porto"
#> [37] "Beaumont" "Beemster 2% Milk"
#> [39] "Bel Paese" "Bergere Bleue"
#> [41] "Bermuda Triangle" "Beyaz Peynir"
#> [43] "Bica de Queijo" "Bierkase"
#> [45] "Bijou" "Blarney Castle"
#> [47] "Bleu Bénédictin" "Bleu d'Auvergne"
#> [49] "Bleu Des Causses" "Bleu L'Ermite"
#> [51] "Blue Benedictine" "Blue Lupine"
#> [53] "Blue Rathgore" "Blue Vein (Australian)"
#> [55] "Blue Vein Cheese" "Blue Yonder"
#> [57] "Bocconcini" "Boivin Marbled Cheddar"
#> [59] "Bossa" "Boulder Chevre"
#> [61] "Brewer's Gold" "Brie de Melun"
#> [63] "Brillat-Savarin" "Brin"
#> [65] "Brin d'Amour" "Bruder Basil"
#> [67] "Brunost" "Brutal Blue"
#> [69] "Burwash Rose" "Buttercup"
#> [71] "Butterkase" "Buttermilk Blue Affinee"
#> [73] "Buttermilk Gorgonzola" "Caciobarricato"
#> [75] "Cacio De Roma®" "Caciotta"
#> [77] "Caciotta Al Tartufo" "Cacow Belle"
#> [79] "Calenzana (Calinzanincu)" "Cambozola Grand Noir"
#> [81] "Cameo" "Cana de Cabra"
#> [83] "Cape Vessey" "Capra al Fieno"
#> [85] "Capra Nouveau" "Cardo "
#> [87] "Carr Valley Glacier Wildfire Blue" "Casatica"
#> [89] "Casciotta di Urbino" "Cashel Blue"
#> [91] "Castelo Branco" "Castle Blue"
#> [93] "Celtic Promise" "Chabichou du Poitou"
#> [95] "Charolais" "Chaumes"
#> [97] "Chevre" "Chevre en Marinade"
#> [99] "Chile Caciotta" "Chile Jack"
## My naive attempt to adapt the code to etsy.com fails miserably
session_etsy <- bow("https://www.etsy.com", force = TRUE)
result_etsy <- scrape(session_etsy, query=list(t="Christmas candle", per_page=100)) %>% html_node("#main-body") %>%
html_nodes("h3") %>%
html_text()
result_etsy
#> character(0)
由reprex package (v2.0.1) 于 2021-09-30 创建
【问题讨论】:
-
如果你查看他们的服务条款,你是不允许这样做的。可能有人可以帮助你,但你应该更正确地指出这违反了网站的规则。
-
好吧,我成功了。在我看来,这有点灰色地带。我的意思是,网站可以在 ToS 中写下它想要的东西,但这是否意味着它有权执行它?见medium.com/@tjwaterman99/web-scraping-is-now-legal-6bf0e5730a78。我不认为我所做的有任何违法行为。
标签: r web-scraping rvest