如何从网站爬取多个页面/城市（BeautifulSoup、Requests、Python3）答案

【问题标题】：How to Crawl Multiple pages/cities from a website (BeautifulSoup,Requests,Python3)如何从网站爬取多个页面/城市（BeautifulSoup、Requests、Python3）
【发布时间】：2016-02-20 07:11:11
【问题描述】：

我想知道如何使用漂亮的汤/请求从一个网站抓取多个不同的页面/城市，而不必一遍又一遍地重复我的代码。

这是我现在的代码：

Region = "Marrakech"
Spider = 20

def trade_spider(max_pages):
    page = -1

    partner_ID = 2
    location_ID = 25

    already_printed = set()

    while page <= max_pages:
        page += 1
        response = urllib.request.urlopen("http://www.jsox.com/s/search.json?q=" + str(Region) +"&page=" + str(page))
        jsondata = json.loads(response.read().decode("utf-8"))
        format = (jsondata['activities'])
        g_data = format.strip("'<>()[]\"` ").replace('\'', '\"')
        soup = BeautifulSoup(g_data)



        hallo = soup.find_all("article", {"class": "activity-card"})

        for item in hallo:
            headers = item.find_all("h3", {"class": "activity-card"})
            for header in headers:
                header_final = header.text.strip()
                if header_final not in already_printed:
                    already_printed.add(header_final)

            deeplinks = item.find_all("a", {"class": "activity"})
            for t in set(t.get("href") for t in deeplinks):
                deeplink_final = t
                if deeplink_final not in already_printed:
                    already_printed.add(deeplink_final)

            end_final = "Header: " + header_final + " | " + "Deeplink: " + deeplink_final
            print(end_final)

 trade_spider(int(Spider))

我的目标是理想地从一个特定网站抓取多个城市/地区。

现在，我可以通过一遍又一遍地重复我的代码并爬取每个单独的网站，然后将这些数据帧中的每一个的结果连接在一起来手动执行此操作，但这似乎非常不符合 Python 标准。我想知道是否有人有更快的方法或任何建议？

我尝试在我的区域标签中添加第二个城市，但不起作用

Region = "Marrakech","London"

谁能帮我解决这个问题？感谢您提供任何反馈。

【问题讨论】：

您是否尝试过在 while 循环之外使用 for 循环来迭代多个区域？

标签： python json beautifulsoup web-crawler

【解决方案1】：

Region = ["Marrakech","London"]

将 while 循环放入 for 循环中，然后将页面重置为 -1。

for reg in Region:
   pages = -1

在请求 url 时将 Region 替换为 reg。

Region = ["Marrakech","London"]    
Spider = 20

def trade_spider(max_pages):

    partner_ID = 2
    location_ID = 25
    already_printed = set()
    for reg in Region:
        page = -1  
        while page <= max_pages:
            page += 1
            response = urllib.request.urlopen("http://www.jsox.com/s/search.json?q=" + str(reg) +"&page=" + str(page))
            jsondata = json.loads(response.read().decode("utf-8"))
            format = (jsondata['activities'])
            g_data = format.strip("'<>()[]\"` ").replace('\'', '\"')
            soup = BeautifulSoup(g_data)



            hallo = soup.find_all("article", {"class": "activity-card"})

            for item in hallo:
                headers = item.find_all("h3", {"class": "activity-card"})
                for header in headers:
                    header_final = header.text.strip()
                    if header_final not in already_printed:
                        already_printed.add(header_final)

                deeplinks = item.find_all("a", {"class": "activity"})
                for t in set(t.get("href") for t in deeplinks):
                    deeplink_final = t
                    if deeplink_final not in already_printed:
                        already_printed.add(deeplink_final)

                end_final = "Header: " + header_final + " | " + "Deeplink: " + deeplink_final
                print(end_final)
trade_spider(int(Spider))

【讨论】：

感谢您的反馈，但我没有正确理解。如果可能，您能否详细说明。在请求 url 时，我应该在哪里用 reg 替换 region。感谢您的反馈，如果我让您紧张，我很抱歉，但我仍然是初学者
@SeriousRuffy 在这一行 'response = urllib.request.urlopen("jsox.com/s/search.json?q=" + str(Region) +"&page=" + str(page))'
非常感谢。欣赏它