BeautifulSoup 努力抓取列表详细信息页面答案

【问题标题】：BeatifulSoup struggling to scrape the Listings Detail PageBeautifulSoup 努力抓取列表详细信息页面
【发布时间】：2020-06-27 13:27:28
【问题描述】：

我还是 Python 世界的新手。我正在尝试构建一个对我的日常工作有用的scraper。但我被困在一个特定的点：

我的目标是抓取一个房地产网站。我正在使用 BeatifulSoup，并且我设法毫无问题地获取列表页面上的参数。但是当我进入列表详情页面时，我无法抓取任何数据。

我的代码：

from bs4 import BeautifulSoup
import requests

url = "https://timetochoose.co.ao/?search-listings=true"

headers = {'User-Agent': 'whatever'}

response = requests.get(url, headers=headers)

print(response)

data = response.text

print(data)

soup = BeautifulSoup(data, 'html.parser')

anuncios = soup.find_all("div", {"class": "grid-listing-info"})

for anuncios in anuncios:
    titles = anuncios.find("a",{"class": "listing-link"}).text
    location = anuncios.find("p",{"class": "location muted marB0"}).text
    link = anuncios.find("a",{"class": "listing-link"}).get("href")
    anuncios_response = requests.get(link)
    anuncios_data = anuncios_response.text
    anuncios_soup = BeautifulSoup(anuncios_data, 'html.parser')
    conteudo = anuncios_soup.find("div", {"id":"listing-content"}).text


    print("Título", titles, "\nLocalização", location, "\nLink", link, "\nConteudo", conteudo)

示例：我在“conteudo”变量下没有得到任何东西。我尝试从详细信息页面获取不同的数据，例如价格或房间数量，但总是失败，我只得到“无”。

我从昨天下午开始一直在寻找答案，但我没有找到我失败的地方。我设法在上部页面上毫无问题地获取参数，但是当我达到列表详细信息页面级别时，它就失败了。

如果有人能指出我做错了什么，我将不胜感激。提前感谢您花时间阅读我的问题。

【问题讨论】：

您是否在浏览器中禁用了javascript并重新加载页面以查看是否从其他地方动态检索内容？
我认为这不是动态问题。我刚刚发现，如果我这样尝试： for div in soup.find_all('div', id='listing-content'): print(div.text) 它可以工作......但是这个： conteudo = anuncios_soup.find( "div", {"id":"listing-content"}).text 不行！我越来越糊涂了。

标签： python web-scraping beautifulsoup

【解决方案1】：

要获得正确的页面，您需要设置User-Agent http header。

例如：

import requests
from bs4 import BeautifulSoup


main_url = 'https://timetochoose.co.ao/?search-listings=true'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}


def print_info(url):
    soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
    print(soup.select_one('#listing-content').get_text(strip=True, separator='\n'))


soup = BeautifulSoup(requests.get(main_url, headers=headers).content, 'html.parser')
for a in soup.select('a.listing-featured-image'):
    print(a['href'])
    print_info(a['href'])
    print('-' * 80)

打印：

https://timetochoose.co.ao/listings/loja-rua-rei-katiavala-luanda/
Avenida brasil , Rua katiavala
Maculusso
Loja com 90 metros quadrados
2 andares
1 wc
Frente a estrada
Arrendamento  mensal 500.000 kz Negociável
--------------------------------------------------------------------------------
https://timetochoose.co.ao/listings/apertamento-t3-rua-cabral-montcada-maianga/
Apartamento T3 maianga
1  suíte com varanda
2 quartos com varanda
1 wc
1 sala comum grande
1 cozinha
Tanque de  agua
Predio limpo
Arrendamento 350.000  akz Negociável
--------------------------------------------------------------------------------

...and so on.

【讨论】：

谢谢，您的建议很有效。现在，我仍在努力找到我失败的地方。我更新了用户代理并进行了更多测试，我发现我的代码问题从这里开始：anuncios_response = requests.get(link) - 当我打印它时它返回 410，所以接下来的一切当然会失败。
@theprodigy83 是的，您需要在anuncios_response = requests.get(link, headers=<headers here>) 中指定标头才能获得正确的响应。
啊，我明白了。我错过了在请求参数上添加标题。