从文本文件中读取多个 url 并处理网页答案

【问题标题】：Reading multiple urls from text file and processing web page从文本文件中读取多个 url 并处理网页
【发布时间】：2018-04-09 20:32:07
【问题描述】：

脚本的输入是一个文本文件，其中包含来自网页的多个 url。脚本中的预期步骤如下：

从文本文件中读取一个url
剥离 url 以将其用作输出文件的名称 (fname)
使用正则表达式“clean_me”清理 url/web 页面的内容。
将内容写入文件（fname）
对输入文件中的每个文件重复。

这是输入文件urloutshort.txt的内容；

http://feedproxy.google.com/~r/autonews/ColumnistsAndBloggers/~3/6HV2TNAKqGk/diesel-with-no-nox-emissions-it-may-be-possible

http://feedproxy.google.com/~r/entire-site-rss/~3/3j3Hyq2TJt0/kyocera-corp-opens-its-largest-floating-solar-power-plant-in-japan.html

http://feedproxy.google.com/~r/entire-site-rss/~3/KRhGaT-UH_Y/crews-replace-rhode-island-pole-held-together-with-duct-tape.html

这是脚本：

import os
import sys
import requests
import bs4
from bs4 import BeautifulSoup
import html5lib
import re

def clean_me(htmldoc):
    soup = BeautifulSoup(htmldoc.text.encode('UTF-8'), 'html5lib')
    for s in soup(['script', 'style']):
    s.decompose()       
    return ' '.join(soup.stripped_strings)
with open('urloutshort.txt', 'r') as filein:
    for url in filein:
        page = requests.get(url.strip())
        fname=(url.replace('http://',' '))
        fname = fname.replace ('/',' ')
        print (fname)
        cln = clean_me(page)
        with open (fname +'.txt', 'w') as outfile:              
        outfile.write(cln +"\n")

这是错误信息；

python : Traceback (most recent call last):
At line:1 char:1
+ python webpage_A.py
+ ~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (Traceback (most recent call last)::String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError

  File "webpage_A.py", line 43, in <module>
    with open (fname +'.txt', 'w') as outfile:                              
OSError: [Errno 22] Invalid argument: ' feedproxy.google.com ~r autonews ColumnistsAndBloggers ~3 6HV2TNAKqGk 
diesel-with-no-nox-emissions-it-may-be-possible\n.txt'

问题似乎与从文本文件中读取 url(s) 有关，因为如果我绕过脚本来读取输入文件并且只是硬编码其中一个 url，那么脚本将处理网页并将结果保存到从 url 中提取名称的 txt 文件。我搜索了关于 SO 的主题，但没有找到解决方案。

我们将非常感谢您对这个问题的帮助。

【问题讨论】：

标签： python url

【解决方案1】：

问题在于以下代码：

    with open (fname +'.txt', 'w') as outfile:              
    outfile.write(cln +"\n")

fname 包含“\n”，它不能是要打开的有效文件名。你需要做的就是把它改成这个

    with open (fname.rstrip() +'.txt', 'w') as outfile:              
    outfile.write(cln +"\n")

包括完整的代码修复：

import os
import sys
import requests
import bs4
from bs4 import BeautifulSoup
import re
import html5lib

def clean_me(htmldoc):
    soup = BeautifulSoup(htmldoc.text.encode('UTF-8'), 'html5lib')
    for s in soup(['script', 'style']):
        s.decompose()
        return ' '.join(soup.stripped_strings)


with open('urloutshort.txt', 'r') as filein:
    for url in filein:
        if "http" in url:
            page = requests.get(url.strip())
            fname = (url.replace('http://', ''))
            fname = fname.replace('/', ' ')
            print(fname)
            cln = clean_me(page)
            with open(fname.rstrip() + '.txt', 'w') as outfile:
                outfile.write(cln + "\n")

希望对你有帮助

【讨论】：

按照建议更改脚本。该脚本按预期处理第一个 url，但不处理 urloutshort.txt 中的后续 url。我更改了文件中 url 的顺序，但这并没有改变结果；第一个 url 被处理，但不是后续的。
python : Traceback (最近一次调用最后一次): At line:1 char:1 + python pages.py + ~~~~~~~~~~~~~~~~~~ + CategoryInfo : NotSpecified: (Traceback (last recent call last)::String) [], RemoteException + FullyQualifiedErrorId : NativeCommandError
文件“webpage.py”，第 33 行，在页面 = requests.get(url.strip()) 文件“C:\Users\rschafish\AppData\Local\Programs\Python \Python35-32\lib\site-packages\requests\api.py"，第 72 行，在 get return request('get', url, params=params, **kwargs)
文件“C:\Users\rschafish\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\api.py”，第 58 行，在请求返回会话中。请求（方法=方法，url=url，**kwargs）文件“C:\Users\rschafish\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\sessions.py”，第 494 行, 在请求 prep = self.prepare_request(req)
文件“C:\Users\rschafish\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\sessions.py”，第 437 行，在 prepare_request hooks=merge_hooks （request.hooks，self.hooks），文件“C:\Users\rschafish\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\models.py”，第 305 行，准备自我.prepare_url(url, 参数)