【发布时间】:2021-07-06 02:17:13
【问题描述】:
我正在尝试抓取多个链接,提取在<p> HTML 标记上找到的文本并将输出写入不同的文件。每个链接都应该有自己的输出文件。到目前为止:
#!/usr/bin/python
# -*- coding: utf-8 -*-
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import re
import csv
import pyperclip
import pprint
import requests
urls = ['https://link1',
'https://link2']
url_list = list(urls)
#scrape elements
for url in urls:
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(response.content, "html.parser")
page = soup.find_all('p')
page = soup.getText()
for line in urls:
with open('filename{}.txt'.format(line), 'w', encoding="utf8") as outfile:
outfile.write('\n'.join([i for i in page.split('\n') if len(i) > 0]))
我收到OSError: [Errno 22] Invalid argument: filenamehttps://link1
如果我把我的代码改成这个
for index, line in enumerate(urls):
with open('filename{}.txt'.format(index), 'w', encoding="utf8") as outfile:
outfile.write('\n'.join([i for i in page.split('\n') if len(i) > 0]))
脚本运行,但出现语义错误;两个输出文件都包含从 link2 中提取的文本。我猜第二个 for 循环就是这样做的。
我已经研究了类似1 答案的 S/O,但我无法弄清楚。
【问题讨论】:
标签: python for-loop beautifulsoup python-3.8