【问题标题】:How do I fix the code to scrape Zomato website?如何修复代码以抓取 Zomato 网站?
【发布时间】:2020-03-26 01:05:13
【问题描述】:

我写了这段代码,但在运行最后一行后得到了错误“IndexError: list index out of range”。请问,我该如何解决这个问题?

    import requests
    from bs4 import BeautifulSoup

    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, 
                                           like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
    response = requests.get("https://www.zomato.com/bangalore/top-restaurants",headers=headers)

    content = response.content
    soup = BeautifulSoup(content,"html.parser")

    top_rest = soup.find_all("div",attrs={"class": "sc-bblaLu dOXFUL"})
    list_tr = top_rest[0].find_all("div",attrs={"class": "sc-gTAwTn cKXlHE"})

list_rest =[]
for tr in list_tr:
    dataframe ={}
    dataframe["rest_name"] = (tr.find("div",attrs={"class": "res_title zblack bold nowrap"})).text.replace('\n', ' ')
    dataframe["rest_address"] = (tr.find("div",attrs={"class": "nowrap grey-text fontsize5 ttupper"})).text.replace('\n', ' ')
    dataframe["cuisine_type"] = (tr.find("div",attrs={"class":"nowrap grey-text"})).text.replace('\n', ' ')
    list_rest.append(dataframe)
list_rest

【问题讨论】:

    标签: python python-3.x web-scraping data-science web-scraping-language


    【解决方案1】:

    您收到此错误是因为当您尝试获取它的第一个元素 "top_rest[0]" 时,top_rest 为空。原因是您尝试引用的第一个类是动态命名的。您会注意到,如果您刷新页面,该 div 的相同位置将不会被命名为相同。因此,当您尝试抓取时,您会得到空结果。

    另一种方法是抓取所有 div,然后缩小您想要的元素,注意动态 div 命名架构,以便从一个请求到另一个请求,您会得到不同的结果:

    import requests
    from bs4 import BeautifulSoup
    
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
    response = requests.get("https://www.zomato.com/bangalore/top-restaurants",headers=headers)
    
    content = response.content
    soup = BeautifulSoup(content,"html.parser")
    
    top_rest = soup.find_all("div")
    list_tr = top_rest[0].find_all("div",attrs={"class": "bke1zw-1 eMsYsc"})
    list_tr
    

    【讨论】:

    • 我试过了,并尝试访问列表元素,即一个餐厅信息,使用一个循环,在编辑的代码中“在你帮助回答的那个之后”使用这个代码。很抱歉,这可能看起来很简单,但我是网络爬虫的新手,刚刚决定着手进行现实生活中的项目。
    【解决方案2】:

    我最近做了一个项目,让我在菲律宾马尼拉的 Zomato 网站上进行了研究。我使用 Geolibrary 获取马尼拉市的经度和纬度值,然后使用这些信息抓取餐厅的详细信息。 ADD:您可以在 zomato 网站上获取自己的 API 密钥,一天最多可以进行 1000 次调用。

    # Use geopy library to get the latitude and longitude values of Manila City.
    from geopy.geocoders import Nominatim
    
    address = 'Manila City, Philippines'
    geolocator = Nominatim(user_agent = 'Makati_explorer')
    location = geolocator.geocode(address)
    latitude = location.lenter code hereatitude
    longitude = location.longitude
    print('The geographical coordinate of Makati City are {}, {}.'.format(latitude, longitude))
    
    # Use Zomato's API to make call
    headers = {'user-key': '617e6e315c6ec2ad5234e884957bfa4d'}
    venues_information = []
    
    for index, row in foursquare_venues.iterrows():
        print("Fetching data for venue: {}".format(index + 1))
        venue = []
        url = ('https://developers.zomato.com/api/v2.1/search?q={}' + 
              '&start=0&count=1&lat={}&lon={}&sort=real_distance').format(row['name'], row['lat'], row['lng'])
        try:
            result = requests.get(url, headers = headers).json()
        except:
            print("There was an error...")
        try:
    
            if (len(result['restaurants']) > 0):
                venue.append(result['restaurants'][0]['restaurant']['name'])
                venue.append(result['restaurants'][0]['restaurant']['location']['latitude'])
                venue.append(result['restaurants'][0]['restaurant']['location']['longitude'])
                venue.append(result['restaurants'][0]['restaurant']['average_cost_for_two'])
                venue.append(result['restaurants'][0]['restaurant']['price_range'])
                venue.append(result['restaurants'][0]['restaurant']['user_rating']['aggregate_rating'])
                venue.append(result['restaurants'][0]['restaurant']['location']['address'])
                venues_information.append(venue)
            else:
                venues_information.append(np.zeros(6))
        except:
            pass
    
    ZomatoVenues = pd.DataFrame(venues_information, 
                                      columns = ['venue', 'latitude', 
                                                 'longitude', 'price_for_two', 
                                                 'price_range', 'rating', 'address'])
    

    【讨论】:

      【解决方案3】:

      使用Web Scraping Language 我可以写这个:

      GOTO https://www.zomato.com/bangalore/top-restaurants
      EXTRACT {'rest_name': '//div[@class="res_title zblack bold nowrap"]', 
               'rest_address': '//div[@class="nowrap grey-text fontsize5 ttupper', 
               'cusine_type': '//div[@class="nowrap grey-text"]'} IN //div[@class="bke1zw-1 eMsYsc"]
      

      这将使用类bke1zw-1 eMsYsc 遍历每个记录元素并拉取 各餐厅信息。

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2018-04-21
        • 1970-01-01
        • 2020-03-05
        • 1970-01-01
        • 2019-08-16
        • 1970-01-01
        相关资源
        最近更新 更多