【问题标题】:Scrape emails with python 3.x from websites使用 python 3.x 从网站上抓取电子邮件
【发布时间】:2021-08-12 18:04:05
【问题描述】:

我有一个脚本,它应该包含一个网站列表,并从那里搜索电子邮件(参见下面的代码)。 每次出现错误时,例如“网站被禁止”或“服务暂时不可用”等。脚本将重新开始。

# -*- coding: utf-8 -*-

import urllib.request, urllib.error
import re
import csv
import pandas as pd
import os
import ssl

# 1: Get input file path from user '.../Documents/upw/websites.csv'
user_input = input("Enter the path of your file: ")

# If input file doesn't exist
if not os.path.exists(user_input):
    print("File not found, verify the location - ", str(user_input))


def sites(e):
    pass


while True:
    try:
        # 2. read file
        df = pd.read_csv(user_input)

        # 3. create the output csv file
        with open('Emails.csv', mode='w', newline='') as file:
            csv_writer = csv.writer(file, delimiter=',')
            csv_writer.writerow(['Website', 'Email'])

        # 4. Get websites
        for site in list(df['Website']):
            # print(site)
            gcontext = ssl.SSLContext(ssl.PROTOCOL_TLSv1_2)
            req = urllib.request.Request("http://" + site, headers={
                'User-Agent': "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1",
                # 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1',
                'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
                'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
                'Accept-Encoding': 'none',
                'Accept-Language': 'en-US,en;q=0.8',
                'Connection': 'keep-alive'
            })

            # 5. Scrape email id
            with urllib.request.urlopen(req, context=gcontext) as url:
                s = url.read().decode('utf-8', 'ignore')
                email = re.findall(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}", s)
                print(email)

                # 6. Write the output
                with open('Emails.csv', mode='a', newline='') as file:
                    csv_writer = csv.writer(file, delimiter=',')
                    [csv_writer.writerow([site, item]) for item in email]

    except urllib.error.URLError as e:
        print("Failed to open URL {0} Reason: {1}".format(site, e.reason))

如果我删除代码:

def sites(e):
pass

while True

发生错误时脚本停止..

如果web端发生错误,不应该停止脚本,而是继续搜索。

我已经在网上搜索了一段时间,并查看了几篇帖子,但看起来像是错误的,因为我还没有找到解决方案..

任何帮助我都将不胜感激。

【问题讨论】:

    标签: python-3.x urllib


    【解决方案1】:

    while True: 循环的问题。它总是会重新启动,因为在try 块中生成了异常,然后循环转到exception 块。之后它将再次循环并从一开始就运行try 块。

    当您取出 While True: 时,当异常发生时,它将完全停止进程,因为在 try 块中将引发异常,该块将停止 try 块执行,然后继续执行 @ 987654328@ 阻止,然后继续执行程序的其余部分。

    您想要的是在循环中使用 try 块,您正在循环 df['Website'] 中的网站,这样如果抛出异常,它将移动到列表中的下一个网站,而不是一直到开始读取数据框并重新开始网站的循环。

        # 2. read file
    df = pd.read_csv(user_input)
    
    # 3. create the output csv file
    with open('Emails.csv', mode='w', newline='') as file:
        csv_writer = csv.writer(file, delimiter=',')
        csv_writer.writerow(['Website', 'Email'])
    
    # 4. Get websites
    for site in list(df['Website']):
        try:
            # print(site)
            gcontext = ssl.SSLContext(ssl.PROTOCOL_TLSv1_2)
            req = urllib.request.Request("http://" + site, headers={
                'User-Agent': "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1",
                # 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1',
                'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
                'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
                'Accept-Encoding': 'none',
                'Accept-Language': 'en-US,en;q=0.8',
                'Connection': 'keep-alive'
            })
    
            # 5. Scrape email id
            with urllib.request.urlopen(req, context=gcontext) as url:
                s = url.read().decode('utf-8', 'ignore')
                email = re.findall(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}", s)
                print(email)
    
                # 6. Write the output
                with open('Emails.csv', mode='a', newline='') as file:
                    csv_writer = csv.writer(file, delimiter=',')
                    [csv_writer.writerow([site, item]) for item in email]
    
        except urllib.error.URLError as e:
            print("Failed to open URL {0} Reason: {1}".format(site, e.reason))
    

    【讨论】:

    • 非常感谢!解释得很好,现在对我来说更清楚了。一切都有意义。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-10-10
    • 2019-06-21
    • 2018-09-01
    • 1970-01-01
    • 2017-02-01
    相关资源
    最近更新 更多