【发布时间】:2021-05-18 12:54:41
【问题描述】:
我编写了一个脚本来从该页面上的酒店中获取酒店名称、评级和福利:link
这是我的脚本:
import numpy as np
import time
from random import randint
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import re
import random
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
'Referer': 'https://www.espncricinfo.com/',
'Upgrade-Insecure-Requests': '1',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
}
url0 = 'https://www.booking.com/searchresults.fr.html?label=gen173nr-1DCA0oTUIMZWx5c2Vlc3VuaW9uSA1YBGhNiAEBmAENuAEXyAEM2AED6AEB-AECiAIBqAIDuAL_5ZqEBsACAdICJDcxYjgyZmI2LTFlYWQtNGZjOS04Y2U2LTkwNTQyZjI5OWY1YtgCBOACAQ;sid=303509179a2849df63e4d1e5bc1ab1e3;dest_id=-1456928;dest_type=city&'
links1 = []
results = requests.get(url0, headers = headers)
soup = BeautifulSoup(results.text, "html.parser")
links1 = [a['href'] for a in soup.find("div", {"class": "hotellist sr_double_search"}).find_all('a', class_ = 'js-sr-hotel-link hotel_name_link url', href=True)]
root_url = 'https://www.booking.com/'
urls1 = [ '{root}{i}'.format(root=root_url, i=i) for i in links1 ]
pointforts = []
hotels = []
notes = []
for url in urls1:
results = requests.get(url)
soup = BeautifulSoup(results.text, "html.parser")
try :
div = soup.find("div", {"class": "hp_desc_important_facilities clearfix hp_desc_important_facilities--bui"})
pointfort = [x['data-name-en'] for x in div.select('div[class*="important_facility"]')]
pointforts.append(pointfort)
except:
pointforts.append('Nan')
try:
note = soup.find('div', class_ = 'bui-review-score__badge').text
notes.append(note)
except:
notes.append('Nan')
try:
hotel = soup.find("h2",attrs={"id":"hp_hotel_name"}).text.strip("\n").split("\n")[1]
hotels.append(hotel)
except:
hotels.append('Nan')
data = pd.DataFrame({
'Notes' : notes,
'Points fort' : pointforts,
'Nom' : hotels})
#print(data.head(20))
data.to_csv('datatest.csv', sep=';', index=False, encoding = 'utf_8_sig')
它成功了,我做了一个循环来抓取酒店的所有链接,并在抓取所有这些酒店的评级和福利之后。但是我有双倍数,所以不是:
links1 = [a['href'] for a in soup.find("div", {"class": "hotellist sr_double_search"}).find_all('a', href=True)]
我输入了:links1 = [a['href'] for a in soup.find("div", {"class": "hotellist sr_double_search"}).find_all('a', class_ = 'js-sr-hotel-link hotel_name_link url', href=True)],正如您在上面的脚本中看到的那样。
但现在它不再起作用了,我只获得了Nan,而之前,当我有双倍时,我有一些和南一起,但大多数都有我想要的津贴和收视率。我不明白为什么。
这是酒店链接的 html:
这里是获取名称的html(获取链接后,脚本转到此链接):
这是获取与酒店相关的所有特权的 html(如名称,脚本转到我之前抓取的链接):
这是我的结果...
【问题讨论】:
标签: python python-3.x web-scraping beautifulsoup