使用 Python 3 提取包含在带有版权符号 © 的 html 标记中的文本答案

【问题标题】：Extracting texts contained in a html tag with a copyright symbol © using Python 3使用 Python 3 提取包含在带有版权符号 © 的 html 标记中的文本
【发布时间】：2018-07-13 20:09:00
【问题描述】：

【问题讨论】：

标签： python python-3.x web-scraping beautifulsoup

【解决方案1】：

我终于找到了我正在寻找的解决方案；

URL = 'https://profile.theguardian.com/signin'
webpage = requests.get(URL)
soup = BeautifulSoup(webpage.content,'html.parser')
symbol = u'\N{COPYRIGHT SIGN}'.encode('utf-8')
symbol = symbol.decode('utf-8')
pattern = r'' + symbol
for tag in soup.findAll(text=re.compile(pattern)):
        copyrightTexts = tag.parent.text
        print(copyrightTexts)

希望这对其他人有所帮助。感谢那些试图提供帮助的人。

【讨论】：

【解决方案2】：

您好，您应该在提交问题时发布示例代码，但以下内容应说明版权标志是否在特定页面上：

from bs4 import BeautifulSoup
import urllib.request


masterURL = 'https://profile.theguardian.com/signin'

sauce = urllib.request.urlopen(masterURL).read()
soup = BeautifulSoup(sauce,'lxml')
temp = soup.prettify().encode('UTF-8')

#\xc2\xa9 is unicode symbol for copyright sign

if(b'\xc2\xa9' in temp):
     print('Copy Right On Page')
else:
     print('No Copy Right on Page')

【讨论】：

【解决方案3】：

将其作为footer_copyright，您可以这样做：

from bs4 import BeautifulSoup
import urllib.request as url
BeautifulSoup(url.urlopen(masterURL).read()).select("p.footer__copyright")

【讨论】：

您的解决方案是针对此网页的，但版权信息可以放在不同的标签和属性中。所以我想要一个使用符号进行搜索的通用代码。