BeautifulSoup 将不需要的换行符添加到字符串 Python3.5答案

【问题标题】：BeautifulSoup adding unwanted linebreaks to strings Python3.5BeautifulSoup 将不需要的换行符添加到字符串 Python3.5
【发布时间】：2025-12-09 05:10:01
【问题描述】：

我一直在处理使用 BeautifulSoup .find 函数获取的字符串中似乎隐藏的换行符时遇到了一些问题。我的代码扫描了一个 html 文档，并提取出名称、标题、公司和国家作为字符串。我输入检查并看到它们是字符串，当我打印它们并检查它们的长度时，一切似乎都是正常的字符串。但是，当我在 print("%s is a %s at %s in %s" % (name,title,company,country)) 或 outputWriter.writerow([name,title,company,country]) 中使用它们写入 csv 文件时，我会得到额外的换行符，这些换行符似乎不存在于字符串中。

发生了什么事？或者谁能指出我正确的方向？

我是 Python 新手，不知道在哪里查找我不知道的所有内容，所以在花了一整天试图解决问题后，我在这里问。我搜索了谷歌和其他几篇关于剥离隐藏字符的堆栈溢出文章，但似乎没有任何效果。

import csv
from bs4 import BeautifulSoup

# Open/create csvfile and prep for writing
csvFile = open("attendees.csv", 'w+', encoding='utf-8')
outputWriter = csv.writer(csvFile)

# Open HTML and Prep BeautifulSoup
html = open('WEB SUMMIT _ LISBON 2016 _ Web Summit Featured Attendees.html', 'r', encoding='utf-8')
bsObj = BeautifulSoup(html.read(), 'html.parser')
itemList = bsObj.find_all("li", {"class":"item"})

outputWriter.writerow(['Name','Title','Company','Country'])

for item in itemList:
    name = item.find("h4").get_text()
    print(type(name))
    title = item.find("strong").get_text()
    print(type(title))
    company = item.find_all("span")[1].get_text()
    print(type(company))
    country = item.find_all("span")[2].get_text()
    print(type(country))
    print("%s is a %s at %s in %s" % (name,title,company,country))
    outputWriter.writerow([name,title,company,country])

【问题讨论】：

我解决了我的问题，尝试了一个过滤器。 def filter_non_printable(str): return ''.join([c for c in str if ord(c) > 31 or ord(c) == 9])

标签： python string csv beautifulsoup python-3.5

【解决方案1】：

您很可能需要去除空白，您的代码中没有添加它的任何内容，因此它必须存在：

outputWriter.writerow([name.strip(),title.strip(),company.strip(),country.strip()])

您可以通过查看 repr 输出来验证我们那里的内容：

print("%r is a %r at %r in %r" % (name,title,company,country))

当您 print 时，您会看到 str 输出，因此如果有换行符，您可能没有意识到它的存在：

In [8]: s = "string with newline\n"

In [9]: print(s)
string with newline


In [10]: print("%r" % s)
'string with newline\n'

difference-between-str-and-repr-in-python

如果换行符实际上嵌入到字符串的正文中，则需要替换，即name.replace("\n", " ")

【讨论】：

谢谢！正如我在上一条评论中感到难过的那样，我尝试了另一种解决方案并发现它有效。我仍然不确定一切的方式或原因，但我正在慢慢学习。再次感谢！