Beautifulsoup 没有捕捉到内容答案

【问题标题】：Beautifulsoup not catching the contentBeautifulsoup 没有捕捉到内容
【发布时间】：2020-10-19 15:06:33
【问题描述】：

我对 Python 完全陌生，但我使用 Beuautifulsoup 编写了一些代码来解析来自不同站点的内容。此代码应捕获站点中的所有 <article> 标记，或者如果不可用，则应捕获 <p> 标记。它在大多数情况下都可以正常工作，但是有些站点会返回错误，尽管检查站点，但其中有带有内容的 <p> 标签，因此它应该返回 <p> 标签之间的文本。

import requests
import sys
from bs4 import BeautifulSoup

try:
    source = requests.get('https://reactpodcast.com/episodes/96').text
except:
    print('Site does not exist')
    sys.exit()

soup = BeautifulSoup(source, 'lxml')
div_s = soup.find_all('div')
title = soup.find('title')
article = soup.find('article')

content = soup.find_all('p')
allContent = ""
for c in content:
  allContent += c.text
    
yt_title = soup.find('span', class_='watch-title')
yt_description = soup.find('p', attrs={'id': 'eow-description'})
try:
    if article != None:
        print(title.text)
        print(article.text)
    elif "https://www.youtube.com" in source:
        print(yt_title.text)
        print(yt_description.text)
    elif article == None:
        print(title.text)
        print(allContent)
    else:
        print('There is an error')
except:
    print('This URL is invalid')
    sys.exit()

有没有人有任何建议（提示和技巧）来解决这个问题？

提前谢谢你！

【问题讨论】：

您好，亲爱的，感谢您的示例：太好了-您正在从两个站点收集数据并收集数据...太好了

标签： python web-scraping beautifulsoup

【解决方案1】：

我曾经遇到过这个问题。这可能是由于 Javascript。我建议使用 Selenium 来绕过这个问题：How to use Selenium with Python?.

【讨论】：

从 100 个不同的网站获取
标签是否也很好？
Selenium 比正常方法（“请求”）慢。一些网站使用 JS 生成其内容。因此，正常的方法无法获取此内容。 Selenium 允许您在浏览器上打开网站并获取所有使用 JS 生成的内容。

【解决方案2】：

我可以对您的代码提出一些改进建议：

将您的对象与无类似something != None 进行比较是不正确的，您可以在这篇文章中了解它： https://realpython.com/python-is-identity-vs-equality。
最好将它们比较为something is not None 或something is None
避免在未指定错误或异常名称的情况下使用except。你可以在这里找到一些有用的信息：https://www.techbeamers.com/use-try-except-python/

祝你好运！

【讨论】：