【发布时间】:2021-08-12 18:04:05
【问题描述】:
我有一个脚本,它应该包含一个网站列表,并从那里搜索电子邮件(参见下面的代码)。 每次出现错误时,例如“网站被禁止”或“服务暂时不可用”等。脚本将重新开始。
# -*- coding: utf-8 -*-
import urllib.request, urllib.error
import re
import csv
import pandas as pd
import os
import ssl
# 1: Get input file path from user '.../Documents/upw/websites.csv'
user_input = input("Enter the path of your file: ")
# If input file doesn't exist
if not os.path.exists(user_input):
print("File not found, verify the location - ", str(user_input))
def sites(e):
pass
while True:
try:
# 2. read file
df = pd.read_csv(user_input)
# 3. create the output csv file
with open('Emails.csv', mode='w', newline='') as file:
csv_writer = csv.writer(file, delimiter=',')
csv_writer.writerow(['Website', 'Email'])
# 4. Get websites
for site in list(df['Website']):
# print(site)
gcontext = ssl.SSLContext(ssl.PROTOCOL_TLSv1_2)
req = urllib.request.Request("http://" + site, headers={
'User-Agent': "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1",
# 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'
})
# 5. Scrape email id
with urllib.request.urlopen(req, context=gcontext) as url:
s = url.read().decode('utf-8', 'ignore')
email = re.findall(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}", s)
print(email)
# 6. Write the output
with open('Emails.csv', mode='a', newline='') as file:
csv_writer = csv.writer(file, delimiter=',')
[csv_writer.writerow([site, item]) for item in email]
except urllib.error.URLError as e:
print("Failed to open URL {0} Reason: {1}".format(site, e.reason))
如果我删除代码:
def sites(e):
pass
while True
发生错误时脚本停止..
如果web端发生错误,不应该停止脚本,而是继续搜索。
我已经在网上搜索了一段时间,并查看了几篇帖子,但看起来像是错误的,因为我还没有找到解决方案..
任何帮助我都将不胜感激。
【问题讨论】:
标签: python-3.x urllib