【发布时间】:2018-01-16 19:19:21
【问题描述】:
我正在使用 BeautifulSoup 从文本文件中解析一些 HTML。文本被写入字典,如下所示:
websites = ["1"]
html_dict = {}
for website_id in websites:
with codecs.open("{}/final_output/raw_html.txt".format(website_id), encoding="utf8") as out:
get_raw_html = out.read().splitlines()
html_dict.update({website_id:get_raw_html})
我从 html_dict = {} 解析 HTML 以查找带有 <p> 标签的文本:
scraped = {}
for website_id in html_dict.keys():
scraped[website_id] = []
raw_html = html_dict[website_id]
for i in raw_html:
soup = BeautifulSoup(i, 'html.parser')
scrape_selected_tags = soup.find_all('p')
这是html_dict 中的 HTML 的样子:
<p>Hey, this should be scraped
but this part gets ignored for some reason.</p>
问题是,BeautifulSoup 似乎正在考虑换行并忽略第二行。所以当我打印出scrape_selected_tags 时,输出是......
<p>Hey, this should be scraped</p>
当我期待全文时。
我怎样才能避免这种情况?我试过在html_dict 中拆分行,但它似乎不起作用。提前致谢。
【问题讨论】:
-
您是否尝试过其他解析器,例如 lxml?
-
你也可以去掉
BeautifulSoup的第二个参数,它会自动推荐你系统上最好的解析器。 -
@ForceBru 这是推荐的可用解析器。
-
删除
splitlines? -
@t.m.adam 当我这样做时,我收到此错误
UserWarning: "." looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.
标签: python beautifulsoup