抓取网站并仅将可见文本导出到文本文档 Python 3 (Beautiful Soup)答案

【问题标题】：Scrape websites and export only the visible text to a text document Python 3 (Beautiful Soup)抓取网站并仅将可见文本导出到文本文档 Python 3 (Beautiful Soup)
【发布时间】：2014-09-02 02:19:43
【问题描述】：

问题：我正在尝试使用beautifulsoup 仅针对可见文本抓取多个网站，然后将所有数据导出到单个文本文件中。

此文件将用作使用 NLTK 查找搭配的语料库。到目前为止，我正在处理这样的事情，但任何帮助都将不胜感激！

import requests
from bs4 import BeautifulSoup
from collections import Counter
urls = ["http://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart","http://en.wikipedia.org/wiki/Golf"]
    for url in urls:
    website = requests.get(url)
    soup = BeautifulSoup(website.content)
    text = [''.join(s.findAll(text=True))for s in soup.findAll('p')]
with open('thisisanew.txt','w') as file:
    for item in text:
        print(file, item)

不幸的是，这有两个问题：当我尝试将文件导出到 .txt 文件时，它完全是空白的。

有什么想法吗？

【问题讨论】：

超时？你能显示超时的输出是什么吗？ “出口不起作用”是什么意思？有什么错误吗？谢谢。
找出了“超时”部分并编辑了代码以反映它！至于“导出不起作用”的部分，我的意思是它返回一个空白文档！
你能不能修复你的缩进？谢谢。
好的，抱歉没看到应该都设置好了

标签： python python-3.x beautifulsoup

【解决方案1】：

print(file, item) 应该是print(item, file=file)。

但不要将你的文件命名为 file，因为这会影响 file 内置函数，这样更好：

with open('thisisanew.txt','w') as outfile:
    for item in text:
        print(item, file=outfile)

为了解决下一个问题，从第一个 URL 覆盖数据，您可以将文件编写代码移动到循环中，并在进入循环之前打开文件一次：

import requests
from bs4 import BeautifulSoup
from collections import Counter
urls = ["http://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart","http://en.wikipedia.org/wiki/Golf"]

with open('thisisanew.txt', 'w', encoding='utf-8') as outfile:
    for url in urls:
        website = requests.get(url)
        soup = BeautifulSoup(website.content)
        text = [''.join(s.findAll(text=True))for s in soup.findAll('p')]
        for item in text:
            print(item, file=outfile)

【讨论】：

嗨，Mhawke，它成功了一半——它只返回从高尔夫 URL 抓取的数据
那是因为你在循环中覆盖了text。
我怎样才能不覆盖循环？对不起，我有点蟒蛇绿
感谢您的帮助，但不幸的是，当我尝试使用此代码时，我遇到了与 alex 相同的错误：UnicodeEncodeError: 'charmap' codec can't encode character '\u02c8' in position 34:字符映射到 - 任何想法。再次感谢您的宝贵时间，请提供任何详细信息，以便我学习！
除非另有说明，否则文件是使用默认编码打开的，我猜你的系统上是 CP-1252（如果你发布了完整的回溯，可以确定）。 unicode 代码点'\u02c8' 在 CP-1252 中没有表示，因此它不能写入使用 CP-1252 编码的文件。您最好的选择是为您的文件使用 UTF-8 编码，并使用 open('thisisanew.txt', 'w', encoding='utf-8') 打开文件 - 请参阅更新的答案。以下是帮助您学习的参考：diveintopython3.net/files.html

【解决方案2】：

还有另一个问题：您只从最后一个 url 收集文本：一遍又一遍地重新分配 text 变量。

在循环之前将text定义为一个空列表，并在里面添加新数据：

text = []
for url in urls:
    website = requests.get(url)
    soup = BeautifulSoup(website.content)
    text += [''.join(s.findAll(text=True))for s in soup.findAll('p')]

【讨论】：

当我尝试这段代码时，它给了我一个无效的语法错误
@user3682157 哎呀，对不起，请再检查一次。
啊，刚刚看到它需要括号——谢谢！但是新问题，我试图将此答案与 Mhawkes 结合起来，但每次尝试时都会出现以下错误：UnicodeEncodeError: 'charmap' codec can't encode character '\u02c8' in position 34: character maps to - - 知道为什么会这样（我正在学习越详细越好！）
@user3682157 尝试在打开文件时指定编码：open('thisisanew.txt', 'w', encoding='utf-8')。