【问题标题】:webscraping script for booking.com doesn't workbooking.com 的网页抓取脚本不起作用
【发布时间】:2021-05-18 12:54:41
【问题描述】:

我编写了一个脚本来从该页面上的酒店中获取酒店名称、评级和福利:link

这是我的脚本:

import numpy as np


import time
from random import randint
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import re
import random

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
    'Referer': 'https://www.espncricinfo.com/',
    'Upgrade-Insecure-Requests': '1',
    'Connection': 'keep-alive',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
}

url0 = 'https://www.booking.com/searchresults.fr.html?label=gen173nr-1DCA0oTUIMZWx5c2Vlc3VuaW9uSA1YBGhNiAEBmAENuAEXyAEM2AED6AEB-AECiAIBqAIDuAL_5ZqEBsACAdICJDcxYjgyZmI2LTFlYWQtNGZjOS04Y2U2LTkwNTQyZjI5OWY1YtgCBOACAQ;sid=303509179a2849df63e4d1e5bc1ab1e3;dest_id=-1456928;dest_type=city&'
links1 = []

results = requests.get(url0, headers = headers)


soup = BeautifulSoup(results.text, "html.parser")

links1 = [a['href']  for a in soup.find("div", {"class": "hotellist sr_double_search"}).find_all('a', class_ = 'js-sr-hotel-link hotel_name_link url',  href=True)]
 

root_url = 'https://www.booking.com/'
urls1 = [ '{root}{i}'.format(root=root_url, i=i) for i in links1 ]



pointforts = []
hotels = []
notes = []

for url in urls1: 
    results = requests.get(url)

    soup = BeautifulSoup(results.text, "html.parser")

    try :
        div = soup.find("div", {"class": "hp_desc_important_facilities clearfix hp_desc_important_facilities--bui"})
        pointfort = [x['data-name-en'] for x in div.select('div[class*="important_facility"]')]
        pointforts.append(pointfort)

    except:
        pointforts.append('Nan')

    try:    
        note = soup.find('div', class_ = 'bui-review-score__badge').text
        notes.append(note)

    except:
        notes.append('Nan')
    
    try:
        hotel = soup.find("h2",attrs={"id":"hp_hotel_name"}).text.strip("\n").split("\n")[1]
        hotels.append(hotel)
    except:
        hotels.append('Nan')



data = pd.DataFrame({
    'Notes' : notes,
    'Points fort' : pointforts,
    'Nom' : hotels})


#print(data.head(20))

data.to_csv('datatest.csv', sep=';', index=False, encoding = 'utf_8_sig')

它成功了,我做了一个循环来抓取酒店的所有链接,并在抓取所有这些酒店的评级和福利之后。但是我有双倍数,所以不是: links1 = [a['href'] for a in soup.find("div", {"class": "hotellist sr_double_search"}).find_all('a', href=True)]

我输入了:links1 = [a['href'] for a in soup.find("div", {"class": "hotellist sr_double_search"}).find_all('a', class_ = 'js-sr-hotel-link hotel_name_link url', href=True)],正如您在上面的脚本中看到的那样。

但现在它不再起作用了,我只获得了Nan,而之前,当我有双倍时,我有一些和南一起,但大多数都有我想要的津贴和收视率。我不明白为什么。

这是酒店链接的 html:

hotellink

这里是获取名称的html(获取链接后,脚本转到此链接):

namehtml

这是获取与酒店相关的所有特权的 html(如名称,脚本转到我之前抓取的链接):

perkshtml

这是我的结果...

output

【问题讨论】:

    标签: python python-3.x web-scraping beautifulsoup


    【解决方案1】:

    该网站上的href 标签包含换行符。一个在开始,也有一些在中途。因此,当您尝试合并 root_url 时,您没有获得有效的 URL。

    解决方法是删除所有换行符。由于 href 始终以 / 开头,因此也可以将其从 root_url 中删除,或者您可以使用 urllib.parse.urljoin()

    import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
        'Referer': 'https://www.espncricinfo.com/',
        'Upgrade-Insecure-Requests': '1',
        'Connection': 'keep-alive',
        'Pragma': 'no-cache',
        'Cache-Control': 'no-cache',
    }
    
    url0 = 'https://www.booking.com/searchresults.fr.html?label=gen173nr-1DCA0oTUIMZWx5c2Vlc3VuaW9uSA1YBGhNiAEBmAENuAEXyAEM2AED6AEB-AECiAIBqAIDuAL_5ZqEBsACAdICJDcxYjgyZmI2LTFlYWQtNGZjOS04Y2U2LTkwNTQyZjI5OWY1YtgCBOACAQ;sid=303509179a2849df63e4d1e5bc1ab1e3;dest_id=-1456928;dest_type=city&'
    
    results = requests.get(url0, headers = headers)
    soup = BeautifulSoup(results.text, "html.parser")
    
    links1 = [a['href'].replace('\n','')  for a in soup.find("div", {"class": "hotellist sr_double_search"}).find_all('a', class_ = 'js-sr-hotel-link hotel_name_link url',  href=True)]
    root_url = 'https://www.booking.com'
    urls1 = [f'{root_url}{i}' for i in links1]
    
    pointforts = []
    hotels = []
    notes = []
    
    for url in urls1: 
        results = requests.get(url)
        soup = BeautifulSoup(results.text, "html.parser")
    
        try:
            div = soup.find("div", {"class": "hp_desc_important_facilities clearfix hp_desc_important_facilities--bui"})
            pointfort = [x['data-name-en'] for x in div.select('div[class*="important_facility"]')]
            pointforts.append(pointfort)
        except:
            pointforts.append('Nan')
    
        try:    
            note = soup.find('div', class_ = 'bui-review-score__badge').text
            notes.append(note)
        except:
            notes.append('Nan')
        
        try:
            hotel = soup.find("h2",attrs={"id":"hp_hotel_name"}).text.strip("\n").split("\n")[1]
            hotels.append(hotel)
        except:
            hotels.append('Nan')
    
    
    data = pd.DataFrame({
        'Notes' : notes,
        'Points fort' : pointforts,
        'Nom' : hotels})
    
    #print(data.head(20))
    data.to_csv('datatest.csv', sep=';', index=False, encoding = 'utf_8_sig')
    

    这会给你一个输出 CSV 文件开始:

    Notes;Points fort;Nom
     8,3 ;['Parking (fee required)', 'Free WiFi Internet Access Included', 'Family Rooms', 'Airport Shuttle', 'Non Smoking Rooms', '24 hour Front Desk', 'Bar'];Elysées Union
     8,4 ;['Free WiFi Internet Access Included', 'Family Rooms', 'Non Smoking Rooms', 'Pets allowed', '24 hour Front Desk', 'Rooms/Facilities for Disabled'];Hyatt Regency Paris Etoile
     8,3 ;['Free WiFi Internet Access Included', 'Family Rooms', 'Non Smoking Rooms', 'Pets allowed', 'Restaurant', '24 hour Front Desk', 'Bar'];Pullman Paris Tour Eiffel
     8,7 ;['Free WiFi Internet Access Included', 'Non Smoking Rooms', 'Restaurant', '24 hour Front Desk', 'Rooms/Facilities for Disabled', 'Elevator', 'Bar'];citizenM Paris Gare de Lyon
    

    【讨论】:

    • 非常感谢!! :) 那是微妙的,做得很好
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2023-03-06
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多