【问题标题】:Python3 - web scraping zomato - multiple pagesPython3 - 网页抓取 zomato - 多个页面
【发布时间】:2019-03-31 07:53:13
【问题描述】:

我似乎无法为第 1 页以外的任何内容生成输出(一页有 15 家餐厅,这就是我得到的全部内容(只有 15 个输出)。看起来第一页的输出被第二页替换,依此类推.

我尝试将页面范围添加到 scrape,但仍然只返回 15 个结果(scraping 只有一页)。

import requests
import pandas
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}

for num in range(1,5):
    url = 'https://www.zomato.com/auckland/restaurants?gold_partner=1&page={}'.format(num)
response = requests.get(url,headers=headers)
content = response.content
soup = BeautifulSoup(content,"html.parser")

top_rest = soup.find_all("div",attrs={"class": "col-s-16 search_results mbot"})
list_tr = top_rest[0].find_all("div",attrs={"class": "js-search-result-li even status 1"})
list_rest =[]

for tr in list_tr:
    dataframe ={}
    dataframe["1.rest_name"] = (tr.find("a",attrs={"class": "result-title hover_feedback zred bold ln24 fontsize0"})).text.replace('\n', ' ')
    dataframe["2.rest_address"] = (tr.find("div",attrs={"class": "col-m-16 search-result-address grey-text nowrap ln22"})).text.replace('\n', ' ')
    list_rest.append(dataframe)
    list_rest

df = pandas.DataFrame(list_rest)
df.to_csv("zomato_res26.csv",index=False)

我希望得到一份包含 40 多家餐厅及其名称和位置的输出列表,但到目前为止,我似乎每页只有 15 家餐厅

【问题讨论】:

    标签: python-3.x pandas web-scraping beautifulsoup python-requests


    【解决方案1】:

    更改缩进并将列表创建list_rest移出循环并在循环中附加到它。此外,将输出的编码更改为 encoding='utf-8-sig' 以正确处理存在的字符。您可以通过int(soup.select_one('.pagination-number b:last-child').text) 获取页数。

    我还添加了requests.Session() 以重复使用连接。

    import requests
    import pandas
    from bs4 import BeautifulSoup
    
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
    
    list_rest =[]
    
    with requests.Session() as s:
        for num in range(1,5):
            url = 'https://www.zomato.com/auckland/restaurants?gold_partner=1&page={}'.format(num)
            response = s.get(url,headers=headers)
            content = response.content
            soup = BeautifulSoup(content,"html.parser")
    
            top_rest = soup.find_all("div",attrs={"class": "col-s-16 search_results mbot"})
            list_tr = top_rest[0].find_all("div",attrs={"class": "js-search-result-li even status 1"})
    
            for tr in list_tr:
                dataframe ={}
                dataframe["1.rest_name"] = (tr.find("a",attrs={"class": "result-title hover_feedback zred bold ln24 fontsize0"})).text.replace('\n', ' ')
                dataframe["2.rest_address"] = (tr.find("div",attrs={"class": "col-m-16 search-result-address grey-text nowrap ln22"})).text.replace('\n', ' ')
                list_rest.append(dataframe)
    
    df = pandas.DataFrame(list_rest)
    df.to_csv(r"zomato_res26.csv", sep=',', encoding='utf-8-sig',index = False )
    

    如果您想循环所有页面并使用更快的选择器和列表推导:

    import requests
    import pandas
    from bs4 import BeautifulSoup
    
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
    
    list_rest =[]
    
    def getInfo(soup):
        names = [item.text.strip() for item in soup.select('.result-title')]
        addresses =  [item.text.strip() for item in soup.select('.search-result-address')]
        row = list(zip(names, addresses))
        return row
    
    with requests.Session() as s:   
            url = 'https://www.zomato.com/auckland/restaurants?gold_partner=1&page={}'
            response = s.get(url.format(1),headers=headers)
            content = response.content
            soup = BeautifulSoup(content,"lxml")
            numPages = int(soup.select_one('.pagination-number b:last-child').text)
            list_rest.append(getInfo(soup))
    
            if numPages > 1:
                for page in range(2, numPages + 1):
                    response = s.get(url.format(page),headers=headers)
                    content = response.content
                    soup = BeautifulSoup(content,"lxml")
                    list_rest.append(getInfo(soup))
    
    final_list = [item for sublist in list_rest for item in sublist]
    df = pandas.DataFrame(final_list, columns = ['1.rest_name', '2.rest_address'])
    df.to_csv(r"zomato_res26.csv", sep=',', encoding='utf-8-sig',index = False )
    

    【讨论】:

    • 非常感谢您的回复,测试了代码并且成功了!需要仔细研究以了解它的结构,再次感谢您!
    【解决方案2】:

    如果你不知道最后的页码怎么办!!!下面的脚本将处理分页的事情。它将解析最后一个页码,然后创建一个循环来遍历它们,以获取餐厅名称及其相关电话号码。

    import pandas
    import requests
    from bs4 import BeautifulSoup
    
    url = "https://www.zomato.com/auckland/restaurants?gold_partner=1&page="
    
    def get_content(session,link):
        session.headers["User-Agent"] = "Mozilla/5.0"
        response = session.get(link)
        soup = BeautifulSoup(response.text,"lxml")
        dataframe = []
        last_page = soup.select_one(".pagination-number b:nth-of-type(2)").text
        for item_url in range(1,int(last_page)+1):
            res = session.get(f"{link}{item_url}")
            sauce = BeautifulSoup(res.text,"lxml")
            for elem in sauce.select(".search-card"):
                d = {}
                d['name'] = elem.select_one("a[data-result-type='ResCard_Name']").get_text(strip=True)
                d['phone'] = elem.select_one("a.res-snippet-ph-info").get("data-phone-no-str")
                dataframe.append(d)
    
        return dataframe
    
    if __name__ == '__main__':
        with requests.Session() as session:
            item = get_content(session,url)
            df = pandas.DataFrame(item)
            df.to_csv("zomato_res26.csv",index=False)
    

    【讨论】:

    • 我肯定需要实现这一点,因为我认为我并不总是能够确定最后一个内容页面。再次感谢!
    猜你喜欢
    • 1970-01-01
    • 2020-09-13
    • 1970-01-01
    • 2016-08-09
    • 2022-01-21
    • 2021-09-30
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多