用 BeautifulSoup 提取标题答案

【问题标题】：Extract title with BeautifulSoup用 BeautifulSoup 提取标题
【发布时间】：2016-06-27 15:10:37
【问题描述】：

我有这个

from urllib import request
url = "http://www.bbc.co.uk/news/election-us-2016-35791008"
html = request.urlopen(url).read().decode('utf8')
html[:60]

from bs4 import BeautifulSoup
raw = BeautifulSoup(html, 'html.parser').get_text()
raw.find_all('title', limit=1)
print (raw.find_all("title"))
'<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN'

我想使用 BeautifulSoup 提取页面的标题，但出现此错误

Traceback (most recent call last):
  File "C:\Users\Passanova\AppData\Local\Programs\Python\Python35-32\test.py", line 8, in <module>
    raw.find_all('title', limit=1)
AttributeError: 'str' object has no attribute 'find_all'

请给点建议

【问题讨论】：

标签： python-3.x beautifulsoup

【解决方案1】：

在某些页面中，我遇到了 NoneType 问题。一个建议是：

soup = BeautifulSoup(data, 'html.parser')
if (soup.title is not None):
    title = soup.title.string

【讨论】：

【解决方案2】：

就这么简单：

soup = BeautifulSoup(htmlString, 'html.parser')
title = soup.title.text

这里，soup.title 返回一个 BeautifulSoup 元素，它是标题元素。

【讨论】：

【解决方案3】：

您可以直接使用“soup.title”而不是“soup.find_all('title', limit=1)”或“soup.find('title')”，它会给你标题。

from urllib import request
url = "http://www.bbc.co.uk/news/election-us-2016-35791008"
html = request.urlopen(url).read().decode('utf8')
html[:60]

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
title = soup.title
print(title)
print(title.string)

【讨论】：

【解决方案4】：

要浏览汤，您需要一个 BeautifulSoup 对象，而不是字符串。所以删除你对汤的get_text() 调用。

此外，您可以将raw.find_all('title', limit=1) 替换为等效的find('title')。

试试这个：

from urllib import request
url = "http://www.bbc.co.uk/news/election-us-2016-35791008"
html = request.urlopen(url).read().decode('utf8')
html[:60]

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
title = soup.find('title')

print(title) # Prints the tag
print(title.string) # Prints the tag string content

【讨论】：

一些网站在标题标签中包含域，例如“我的标题 - 我的网站”。如果我不想那样，有没有比在 - 之后砍掉所有东西更好的选择？我不想假设永远不会有包含破折号的标题。
您应该为此提出另一个问题，但您可以尝试使用正则表达式删除最后一个 - 之前的所有字符：re.sub('.*- *', '', title.string)
stackoverflow.com/questions/62866238/…
我想知道，这个html[:60] 应该做什么？ :)
@BeryCZ 这是html 值的一部分（60 第一个字符）。但是这里没用（只是从 OP 代码中复制/粘贴）。