【问题标题】:Crawl Multiple pages from a website (BeautifulSoup,Requests,Python3)从网站爬取多个页面(BeautifulSoup、Requests、Python3)
【发布时间】:2016-04-25 19:06:44
【问题描述】:

我想知道如何使用漂亮的汤/请求从一个网站抓取多个不同的页面,而不必一遍又一遍地重复我的代码。

下面是我当前的代码,它正在爬取某些城市的旅游景点:

RegionIDArray = [187147,187323,186338]
dict = {187147: 'Paris', 187323: 'Berlin', 186338: 'London'}
already_printed = set()

for reg in RegionIDArray:
    for page in range(1,700,30):
        r = requests.get("https://www.tripadvisor.de/Attractions-c47-g" + str(reg) + "-oa" + str(page) + ".html")

        g_data = soup.find_all("div", {"class": "element_wrap"})

        for item in g_data:
            header = item.find_all("div", {"class": "property_title"})
            item = (header[0].text.strip())
            if item not in already_printed:
                already_printed.add(item)

                print("POI: " + str(item) + " | " + "Location: " + str(dict[reg]) + " | " + "Art: Museum ")

到目前为止,一切都按预期工作。下一步,除了旅游景点,我想爬取这些城市最受欢迎的博物馆。

因此,我必须通过更改 c 参数来修改请求,以获取所有必需的博物馆:

r = requests.get("https://www.tripadvisor.de/Attractions-c" + str(museumIDArray) +"-g" + str(reg) + "-oa" + str(page) + ".html")

因此我的代码如下所示:

RegionIDArray = [187147,187323,186338]
museumIDArray = [47,49]
dict = {187147: 'Paris', 187323: 'Berlin', 186338: 'London'}
already_printed = set()

for reg in RegionIDArray:
    for page in range(1,700,30):
        r = requests.get("https://www.tripadvisor.de/Attractions-c" + str(museumIDArray) +"-g" + str(reg) + "-oa" + str(page) + ".html")
        soup = BeautifulSoup(r.content)

        g_data = soup.find_all("div", {"class": "element_wrap"})

        for item in g_data:
            header = item.find_all("div", {"class": "property_title"})
            item = (header[0].text.strip())
            if item not in already_printed:
                already_printed.add(item)

                print("POI: " + str(item) + " | " + "Location: " + str(dict[reg]) + " | " + "Art: Museum ")

这似乎并不完全正确。我得到的输出,不包括某些城市的所有博物馆和旅游景点。

谁能帮我解决这个问题?感谢您提供任何反馈。

【问题讨论】:

  • 你的代码会出错,还有什么是 dict 在你的代码栏中隐藏了一个 python 内置函数?
  • @PadraicCunningham "shadowing a python builtin" 是什么意思对不起,如果我让你紧张,但我还是个初学者
  • dict 是一个 python 类型/函数,最好避免隐藏,即对变量使用与内置类型相同的名称。你能添加一个链接并准确解释你想从中解析什么吗?
  • @PadraicCunningham 这是链接:tripadvisor.de/… 从这个链接我想解析各个项目的标题,比如奥赛博物馆或卢浮宫

标签: python-3.x request beautifulsoup web-crawler


【解决方案1】:

所有名称都在带有property_title 类的div 内的锚标记中。

for reg in RegionIDArray:
    for page in range(1,700,30):
        r = requests.get("https://www.tripadvisor.de/Attractions-c" + str(museumIDArray) +"-g" + str(reg) + "-oa" + str(page) + ".html")
        soup = BeautifulSoup(r.content)

        for item in (a.text for a in soup.select("div.property_title a")):
            if item not in already_printed:
                already_printed.add(item)
                print("POI: " + str(item) + " | " + "Location: " + str(dct[reg]) + " | " + "Art: Museum ")

最好从分页div中获取链接:

from bs4 import BeautifulSoup
import requests
from urllib.parse import  urljoin


RegionIDArray = [187147,187323,186338]
museumIDArray = [47,49]
dct = {187147: 'Paris', 187323: 'Berlin', 186338: 'London'}
already_printed = set()

def get_names(soup):
    for item in (a.text for a in soup.select("div.property_title a")):
        if item not in already_printed:
            already_printed.add(item)
            print("POI: {} | Location: {} | Art: Museum ".format(item, dct[reg]))

base = "https://www.tripadvisor.de"
for reg in RegionIDArray:
    r = requests.get("https://www.tripadvisor.de/Attractions-c[47,49]-g{}-oa.html".format(reg))
    soup = BeautifulSoup(r.content)

    # get links to all next pages.
    all_pages = (urljoin(base, a["href"]) for a in soup.select("div.unified.pagination a.pageNum.taLnk")[1:])
    # use helper function to print the names.
    get_names(soup)

    # visit all remaining pages.
    for url in all_pages:
        soup = BeautifulSoup(requests.get(url).content)
        get_names(soup)

【讨论】:

  • 非常感谢您的反馈。但现在我收到以下错误消息: Traceback(最近一次调用最后一次):文件“C:/Users/Raju/Desktop/Scr​​ipts/nnnn.py”,第 25 行,在 get_names(soup) File 中C:/Users/Raju/Desktop/Scr​​ipts/nnnn.py",第 15 行,在 get_names print("POI: {} | Location: {} | " + "Art: Museum ".format(item.dict[reg] )) AttributeError: 'str' object has no attribute 'dict' 你能帮帮我吗?怎么了?
  • @SeriousRuffy,你用的是字典?
  • @Padriac 它应该是 dct,就像你在上面的代码中所说的那样。我只是尝试使用“dict”。不过,我收到相同的错误消息
  • @SeriousRuffy,有一段时间应该有逗号,现在应该可以正常运行
猜你喜欢
  • 2016-02-20
  • 1970-01-01
  • 2017-09-12
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2020-08-05
  • 2016-08-01
  • 2023-03-05
相关资源
最近更新 更多