使用 BeautifulSoup 进行网页抓取不起作用答案

【问题标题】：Web scraping with BeautifulSoup won't work使用 BeautifulSoup 进行网页抓取不起作用
【发布时间】：2020-08-01 09:47:22
【问题描述】：

最终，我试图打开新闻网站的所有文章，然后将所有文章中使用的词排在前 10 位。为此，我首先想看看有多少文章，以便我可以在某个时候对它们进行迭代，还没有真正弄清楚我想如何做所有事情。

为此，我想使用 BeautifulSoup4。我认为我想要获得的课程是 Javascript，因为我没有得到任何回报。这是我的代码：

url = "http://ad.nl"
ad = requests.get(url)
soup = BeautifulSoup(ad.text.lower(), "xml")
titels = soup.findAll("article")

print(titels)
for titel in titels:
    print(titel)

文章名称有时是 h2 或 h3。它总是有一个相同的课程，但我无法通过该课程获得任何东西。它有一些父母，但它使用相同的名称，但例如扩展名为 -wrapper。我什至不知道如何使用父母来获得我想要的东西，但我认为这些类也是 Javascript。还有一个我感兴趣的href。但再一次，这可能也是Javascript，因为它什么都不返回。

有谁知道我可以如何使用 BeautifulSoup 来使用任何东西（最好是 href，但文章名称也可以）？

【问题讨论】：

如果您在浏览器中打开您的网址，您可以查看源代码。如果你想要的东西在那里，那么它来自服务器，而不是通过JS添加，所以Beautifulsoup可以工作。如果它确实来自服务器，那么我将使用适当的 CSS 选择器，您可以在浏览器的开发工具控制台中通过 $("<selector>") 使用它。一旦成功，在浏览器中，soup.select("<selector>") 可以接管。据我所知，您可以通过 BeautifulSoup 中的 CSS 选择器获得与其自定义 find 一样多的功能。不同之处在于您可以从使用 CSS 的前端人员那里获得帮助。
我遇到的一个问题是在打开页面时，您首先会看到 Accept Cookie 页面。如果不通过该页面，您将无法继续获取文章。
@Sri 不错！从来没有想过 GDPR cookie 是一个抓取拦截器，但是我再次知道抓取，我的观点仅限于如何从网页中抓取数据以进行单元测试。有谁知道请求是否可以甜言蜜语成为 GDPR cookie 接受（它可能不能，但为什么不问）？还是你必须直接去硒？

标签： javascript python class web-scraping beautifulsoup

【解决方案1】：

如果您不想使用硒。这对我有用。我在 2 台具有不同互联网连接的 PC 上进行了尝试。可以试试吗？

from bs4 import BeautifulSoup
import requests

cookies={"pwv":"2",
"pws":"functional|analytics|content_recommendation|targeted_advertising|social_media"}

page=requests.get("https://www.ad.nl/",cookies=cookies)

soup = BeautifulSoup(page.content, 'html.parser')

articles = soup.findAll("article")

然后按照kimbo的代码提取h2/h3。

【讨论】：

是的，它也适用于我，它让我从文章中获得了所有不同的 html 内容。您能否进一步解释一下您对 cookie 做了什么？我真的不明白那里发生了什么，谢谢！
我喜欢这种方法@inxp！ +1
网站在成功接受 GPDR 和一些其他 Adblock/跟踪相关检查时设置这些 cookie。我们只是自己设置这些 cookie 并作为 get 请求的一部分发送。

【解决方案2】：

正如@Sri 在 cmets 中提到的，当您打开该 url 时，会出现一个页面，您必须首先接受 cookie，这需要交互。当您需要交互时，可以考虑使用 selenium (https://selenium-python.readthedocs.io/) 之类的东西。

这里有一些东西可以帮助你入门。

（编辑：您需要在运行下面的代码之前运行pip install selenium）

import requests
from bs4 import BeautifulSoup
from selenium import webdriver

url = 'https://ad.nl'

# launch firefox with your url above
# note that you could change this to some other webdriver (e.g. Chrome)
driver = webdriver.Firefox()
driver.get(url)

# click the "accept cookies" button
btn = driver.find_element_by_name('action')
btn.click()

# grab the html. It'll wait here until the page is finished loading
html = driver.page_source

# parse the html soup
soup = BeautifulSoup(html.lower(), "html.parser")
articles = soup.findAll("article")

for article in articles:
    # check for article titles in both h2 and h3 elems
    h2_titles = article.findAll('h2', {'class': 'ankeiler__title'})
    h3_titles = article.findAll('h3', {'class': 'ankeiler__title'})
    for t in h2_titles:
        # first I was doing print(t.text), but some of them had leading
        # newlines and things like '22:30', which I assume was the hour of the day
        text = ''.join(t.findAll(text=True, recursive=False)).lstrip()
        print(text)
    for t in h3_titles:
        text = ''.join(t.findAll(text=True, recursive=False)).lstrip()
        print(text)

# close the browser
driver.close()

这可能与您的想法完全一致，但这只是如何使用硒和美丽汤的示例。您可以随意复制/使用/修改它。如果您想知道要使用哪些选择器，请阅读@JL Peyret 的评论。

【讨论】：

这确实有效，太棒了！我认为也可以这样做：soup = BeautifulSoup(driver.lower(), "html.parser") 而不是 soup = BeautifulSoup(html.lower(), "html.parser") 但这不起作用。你能解释一下为什么会这样吗？非常感谢您的回答！
driver 是一个对象。尝试运行print(driver) 和/或print(type(driver))。另一方面，html 只是一个str。
再次感谢您。目前，我正在尝试获取文章中的所有href，因为我想稍后以某种方式单击所有这些href。我以为我可以稍微更改您的代码，但它似乎不起作用。我将这个：h2_titles = article.findAll('h2', {'class': 'ankeiler__title'}) 更改为：links = article.findAll('a', {'class': 'ankeiler__link'})，然后我想print(links) 但我认为它给了我页面的整个 html。你知道我应该怎么做才能只获取链接，或者你可能知道一种不同的方式让我一个接一个地点击这些链接？
见stackoverflow.com/questions/19664253/…。
哈哈你说得对。我刚刚发布了这个问题。非常感谢您的时间和回答，谢谢！