【问题标题】:Python scrape web query and put it into .csvPython 抓取网页查询并将其放入 .csv
【发布时间】:2021-01-13 14:50:30
【问题描述】:

使用python,我想从网站中提取信息:

  1. 电话
  2. 电子邮件
  3. 网站
  4. 主要活动(不带 div 的 li 元素文本)“计算机咨询活动”。

问题:

  1. 代码不会每次都获得所需的信息,因为有时 html 元素会丢失或变化,并因此在 python 中出现错误:
  • 有时html请求结果中不存在公司网站
<tr>
    <td class="col-1"><div class="col-1-text">Website:</div></td>
    <td class="col-2"><div class="col-2-text"><a href="http://www.somecompany.com" target="_blank">www.somecompany.com</a></div></td>
</tr>
  • 有时 html 请求结果中不存在电子邮件
<tr>
    <td class="col-1"><div class="col-1-text">E-mail:</div></td>
    <td class="col-2"><div class="col-2-text"><a href="mailto:some@one.com">some@one.com</a></div></td>
</tr>
  1. 我不确定如何将新行写入 csv 而不在 extract.csv 中一遍又一遍地覆盖同一行。我搞砸了循环,我不知道如何修复它。

  2. 第三,它需要在未来(每周左右)只抓取新条目,所以我认为它还需要每次检查提取的.csv(时间戳)以避免重复内容之前它会放一个新的行到提取的.csv中。

HTML 代码结构(示例):


<table class="table-info">
    <tbody>
        <tr>
            <td class="col-1">
                <div class="col-1-text">Business name</div>
            </td>
            <td class="col-2">
                <div class="col-2-text">Company XYZ</div>
            </td>
        </tr>
        <tr>
            <td class="col-1">
                <div class="col-1-text">Register code:</div>
            </td>
            <td class="col-2">
                <div class="col-2-text">112233558</div>
            </td>
        </tr>
 
 
        <tr>
            <td class="col-1">
                <div class="col-1-text">Operating address:</div>
            </td>
            <td class="col-2">
                <div class="col-2-text"><a target="googlemaps" href="https://www.google.com/maps/place/Some-location"
                        class="link-location">Some location strt. 233</a></div>
            </td>
        </tr>
        <tr>
            <td class="col-1">
                <div class="col-1-text">Legal address</div>
            </td>
            <td class="col-2">
                <div class="col-2-text">
                    <a class="link-location" href="https://www.google.com/maps/place/Some-location" target="_new">Some
                        location
                    </a>
                </div>
            </td>
        </tr>
        <tr>
            <td class="col-1">
                <div class="col-1-text">VAT No:</div>
            </td>
            <td class="col-2">
                <div class="col-2-text"><a href="javascript:void(0)" onclick="return getVAT(this, '12345678')">Get VAT
                        liability</a></div>
            </td>
        </tr>
        <tr>
            <td class="col-1">
                <div class="col-1-text">Age:</div>
            </td>
            <td class="col-2">
                <div class="col-2-text">1 year&nbsp;3 months</div>
            </td>
        </tr>
        <tr>
            <td class="col-1">
                <div class="col-1-text">Founded:</div>
            </td>
            <td class="col-2">
                <div class="col-2-text">20/09/2019</div>
            </td>
        </tr>
        <tr>
            <td class="col-1">
                <div class="col-1-text">Capital:</div>
            </td>
            <td class="col-2">
                <div class="col-2-text">2000 USD</div>
            </td>
        </tr>
        <tr>
            <td colspan="2" class="sep"></td>
        </tr>
        <tr>
            <td class="col-1">
                <div class="col-1-text">Phone:</div>
            </td>
            <td class="col-2">
                <div class="col-2-text">123456789</div>
            </td>
        </tr>
        <tr>
            <td class="col-1">
                <div class="col-1-text">E-mail:</div>
            </td>
            <td class="col-2">
                <div class="col-2-text"><a href="mailto:some@one.com">some@one.com</a></div>
            </td>
        </tr>
        <tr>
            <td class="col-1"><div class="col-1-text">Website:</div></td>
            <td class="col-2"><div class="col-2-text"><a href="http://www.somecompany.com" target="_blank">www.somecompany.com</a></div></td>
        </tr>
        <tr>
            <td colspan="2" class="sep"></td>
        </tr>
        <tr>
            <td class="col-1">
                <div class="col-1-text">Representatives:</div>
            </td>
            <td class="col-2">
                <div class="col-2-text">
                    <div class="box-message">
                        <p class="desc">To access information, please</p>
                        <p>
                            <a href="#" onclick="return loginClicked(this, '#');"
                                class="btn btn-small btn-purple link-login">Log in</a>
                        </p>
                    </div>
                </div>
            </td>
        </tr>
        <tr>
            <td colspan="2" class="sep"></td>
        </tr>
        <tr>
            <td class="col-1">
                <div class="col-1-text">
                    Main activity:
                    <span class="tip info" title=""
                        data-original-title="Activities are classified according to EMTAK 2008"></span>
                </div>
            </td>
            <td class="col-2">
                <div class="col-2-text" id="activity_top5ffe2eab23d13">
                    <ul>
                        <li>
                            Computer consultancy activities
                            <div class="main_activities_top_link_wrapper">
                                <a href="https://www.somesite.com/" target="_blank"
                                    onclick="ga('send', 'event', 'check', 'top_btn', 'Anonym');"
                                    class="btn btn-simple btn-open-graph">
                                    <span>Open TOP 20</span> </a>
                            </div>
                        </li>
                    </ul>
 
                </div>
            </td>
        </tr>
 
 
    </tbody>
</table>

Python 代码:


import csv
import requests
import datetime
import time
 
from requests import get
from bs4 import BeautifulSoup
 
 
with open('data.csv', encoding='utf8') as csvfile:
    reader = csv.reader(csvfile, delimiter=';')
    next(reader)
 
    count = 0
     
    for row in reader:
         
        timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
 
        url = f'https://www.somedomain.com/result?country=en&q={row[1]}'
         
        headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
        cookies = {'__test': '1bb6e881021f013463740eeb74840b18'}
        content = get(url, headers=headers,  cookies=cookies).content
        soup = BeautifulSoup(content, "lxml")
 
        table_info = soup.select_one('.table-info')
 
        mail = table_info.select_one('.col-2 a[href^=mailto]')
        mail = mail.get('href')
        mail_clean = mail.split(':')[1]
 
        website = soup.find(text='Website:')
        website = table_info.select_one('.col-2 a[target^=_blank]')
        website = website.get('href') 
         
        collected_data = row[1], mail_clean, website, timestamp
 
        data_list = [["Regcode", "Email", "Website", "Timestamp"],collected_data]
        with open('extracted.csv', 'w', newline='') as file:
            writer = csv.writer(file, delimiter=';')
            writer.writerows(data_list)
 
        print(row[1], "|", mail_clean,"|", website,"|", timestamp)
        #print("Waiting 3 seconds...")
        #time.sleep(3)
        count+=1

【问题讨论】:

标签: python web-scraping beautifulsoup


【解决方案1】:

好的,你这里有很多。通常,您应该将 SO 帖子限制为 1 个问题和问题,但我会处理每个问题。如果您提供至少几行 data.csv 会更容易

1.

代码不会每次都获得所需的信息,因为 有时 html 元素丢失或变化,并会给出错误 因此在python中:

使用if 逻辑检查它是否存在。如果它不存在,则将变量设置为空字符串、null、nana 或任何你想要的

您也可以使用try/except。我喜欢使用它,但是我已经多次被告知从技术上讲,不应该以这种方式使用它

我把两者都放在那里,你可以看到它。

  1. 你想追加而不是覆盖,所以从'w'更改为'a'。另外,我认为您不想继续编写列名,因此您需要以某种方式对此进行说明。有几种方法可以做到。

  2. 将脚本加载到extract.csv 文件中(如果存在,并列出您需要检查的内容。然后让脚本检查该列表以查看它是否与您将要编写的内容重复.如果不存在则写入文件,如果存在则不写入。

完整代码:

import pandas as pd
from requests import get
from bs4 import BeautifulSoup
import csv
import datetime
import time
import os.path




with open('data.csv', encoding='utf8') as csvfile:
    reader = csv.reader(csvfile, delimiter=';')
    next(reader)

    count = 0

    for row in reader:

        timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")

        url = f'https://www.somedomain.com/result?country=en&q={row[1]}'

        headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
        cookies = {'__test': '1bb6e881021f013463740eeb74840b18'}
        content = get(url, headers=headers,  cookies=cookies).content
        soup = BeautifulSoup(content, "lxml")

        table_info = soup.select_one('.table-info')

        if table_info.select_one('.col-2 a[href^=mailto]') == None:
            mail = 'N/A'

        else:
            mail = table_info.select_one('.col-2 a[href^=mailto]')
            mail = mail.get('href')
            mail_clean = mail.split(':')[1]


        try:
            website = soup.find(text='Website:').find_next('a')['href']
        except:
            website = 'N/A'


        collected_data = row[1], mail_clean, website, timestamp

        data_list = [["Regcode", "Email", "Website", "Timestamp"],collected_data]


        if os.path.isfile('extracted.csv'):
            header_exist = True
            check_list = pd.read_csv('extracted.csv', delimiter=';')['Regcode'].tolist()
        else:
            header_exist = False
            check_list = []


        with open('extracted.csv', 'a', newline='') as file:
            writer = csv.writer(file, delimiter=';')
            if header_exist == False:
                writer.writerows(data_list)
                print(row[1], "|", mail_clean,"|", website,"|", timestamp)
            else:
                if row[1] not in check_list:
                    writer.writerows([collected_data])
                    print(row[1], "|", mail_clean,"|", website,"|", timestamp)
                else:
                    print(" ** ALREADY EXIST ** ", row[1], "|", mail_clean,"|", website,"|", timestamp)


        #print("Waiting 3 seconds...")
        #time.sleep(3)
        count+=1

【讨论】:

  • 感谢您的评论。我尝试了代码,但我遇到了 csv 的问题。它看起来像这样:prnt.sc/wmvmve
  • data.csv 看起来像这样:prnt.sc/wmvshg
  • 好的。明天我会看看并修复。
  • 在我看来,它必须做一些事情如何发出电子邮件/网站请求。提取的.csv 的第一行看起来不错,但它破坏了公司在表中有网站的下一行/请求。第一个html:prnt.sc/wmzo7j,第二个:prnt.sc/wmzsuy
  • 首先我忘了将collected_data 作为列表。我修好了。另一个问题你必须自己解决。就像我说的,如果我没有 data.csv 文件(我可以在这里使用 url content = get(url, headers=headers, cookies=cookies).content 来获取 html,我能做的事情是有限的。同样在第二个屏幕截图中,它的网站标记为"Homepage"
猜你喜欢
  • 2019-11-12
  • 2019-03-12
  • 2014-06-20
  • 1970-01-01
  • 1970-01-01
  • 2013-11-14
  • 1970-01-01
  • 1970-01-01
  • 2017-02-10
相关资源
最近更新 更多