将 HTML 标记中的文本提取为单个字符串答案

【问题标题】：Extract text across HTML tags as a single string将 HTML 标记中的文本提取为单个字符串
【发布时间】：2019-05-03 10:36:11
【问题描述】：

我正在尝试从下面的 HTML 代码中提取文本内容作为一个完整的句子，但我无法做到。我尝试同时使用Beautifulsoup.prettify() 和Beautifulsoup.get_text()，但它们给了我 3 个句子。我想把下面的 HTML 当作一个正确的句子来阅读，比如

获得 Microsoft 和 Google, Inc. 办事处的认可。

<li>Recognized by   
                                    <em>Microsoft</em> &amp; 
                                    <em>Google, Inc.</em>, offices.</li>

【问题讨论】：

你的代码是什么？
有源网址吗？我假设源中还有其他带有子 em 的 li。这应该只发生一次吗？对于重复的模式？

标签： html python-3.x web-scraping

【解决方案1】：

我真的不明白你需要什么，但它会帮助你从网站的 url 中提取内容

import requests
import xlsxwriter 
from bs4 import BeautifulSoup

#Text File where the content will be written
file = open("test.txt","w")

#Url from where the data will be extracted
urls ="https://www.pythonforbeginners.com/files/reading-and-writing-files-in-python"
page = requests.get(urls)
soup = BeautifulSoup(page.content, 'html.parser')
for link in soup.find_all('p'): #extracting all content of <P> tag from the url
    #You can put the desired tag according to your need
 file.write(link.get_text())  
file.close()

【讨论】：

谢谢。希望我下面的 cmets 能清楚地解释你的问题。请找到我为 @glhr 的答案添加的 cmets

【解决方案2】：

您可以使用像 BeautifulSoup 这样的 HTML 解析器来提取不带标签的文本 (soup.text)，然后去除重复的空格/换行符等文本：

input_str = '''
<li>Recognized by   
                                    <em>Microsoft</em> &amp; 
                                    <em>Google, Inc.</em>, offices.</li>
'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(input_str,"html.parser")
text = " ".join(soup.text.split())
print(text)

输出：

Recognized by Microsoft & Google, Inc., offices.

编辑：根据您的 cmets，为了获取字符串列表作为输出（每个 li 标签一个，您可以这样做：

input_str = '''<ul> <li>This is sentence one in a order</li> <li>This is sentence two in a order</li> <li>This is sentence <em>Three</em> in a order </li> <li>This is sentence <em>four</em> in a order </li> </ul>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(input_str,"html.parser")

result = []
for li in soup.find_all('li'):
    text = " ".join(li.text.split())
    result.append(text)

print(result)

输出：

['This is sentence one in a order', 'This is sentence two in a order', 'This is sentence Three in a order', 'This is sentence four in a order']

【讨论】：

我的解决方案对您有用吗？还是有问题？我的代码为该输入输出This is sentence one in a order This is sentence two in a order This is sentence Three in a order This is sentence four in a order。
感谢您的回复@glhr。很抱歉我没有把问题解释清楚。来自下面的 html 内容<ul> <li>This is sentence one in a order</li> <li>This is sentence two in a order</li> <li>This is sentence <em>Three</em> in a order </li> <li>This is sentence <em>four</em> in a order </li> </ul>。我想将上述内容作为单独的句子，例如“这是顺序中的第一个句子，这是顺序中的第二句，这是顺序中的第三句，这是顺序中的第四句”。在阅读第 3 句和第 4 句时出现问题。
您希望将结果作为您的意思的字符串列表？
它给了我一个完整的句子。但我喜欢以单独给我们句子的方式提取
是的，就像每个
一样的句子。问题在于在两者之间读取 标记。在使用 beautifulsoup 时，它将 sentence 3 分隔为“This is sentence”、“Three”、“in a order”。就像它把文本分成 3 个句子，但它实际上是一个句子。