【发布时间】:2021-01-13 14:50:30
【问题描述】:
使用python,我想从网站中提取信息:
- 电话
- 电子邮件
- 网站
- 主要活动(不带 div 的 li 元素文本)“计算机咨询活动”。
问题:
- 代码不会每次都获得所需的信息,因为有时 html 元素会丢失或变化,并因此在 python 中出现错误:
- 有时html请求结果中不存在公司网站
<tr>
<td class="col-1"><div class="col-1-text">Website:</div></td>
<td class="col-2"><div class="col-2-text"><a href="http://www.somecompany.com" target="_blank">www.somecompany.com</a></div></td>
</tr>
- 有时 html 请求结果中不存在电子邮件
<tr>
<td class="col-1"><div class="col-1-text">E-mail:</div></td>
<td class="col-2"><div class="col-2-text"><a href="mailto:some@one.com">some@one.com</a></div></td>
</tr>
-
我不确定如何将新行写入 csv 而不在 extract.csv 中一遍又一遍地覆盖同一行。我搞砸了循环,我不知道如何修复它。
-
第三,它需要在未来(每周左右)只抓取新条目,所以我认为它还需要每次检查提取的.csv(时间戳)以避免重复内容之前它会放一个新的行到提取的.csv中。
HTML 代码结构(示例):
<table class="table-info">
<tbody>
<tr>
<td class="col-1">
<div class="col-1-text">Business name</div>
</td>
<td class="col-2">
<div class="col-2-text">Company XYZ</div>
</td>
</tr>
<tr>
<td class="col-1">
<div class="col-1-text">Register code:</div>
</td>
<td class="col-2">
<div class="col-2-text">112233558</div>
</td>
</tr>
<tr>
<td class="col-1">
<div class="col-1-text">Operating address:</div>
</td>
<td class="col-2">
<div class="col-2-text"><a target="googlemaps" href="https://www.google.com/maps/place/Some-location"
class="link-location">Some location strt. 233</a></div>
</td>
</tr>
<tr>
<td class="col-1">
<div class="col-1-text">Legal address</div>
</td>
<td class="col-2">
<div class="col-2-text">
<a class="link-location" href="https://www.google.com/maps/place/Some-location" target="_new">Some
location
</a>
</div>
</td>
</tr>
<tr>
<td class="col-1">
<div class="col-1-text">VAT No:</div>
</td>
<td class="col-2">
<div class="col-2-text"><a href="javascript:void(0)" onclick="return getVAT(this, '12345678')">Get VAT
liability</a></div>
</td>
</tr>
<tr>
<td class="col-1">
<div class="col-1-text">Age:</div>
</td>
<td class="col-2">
<div class="col-2-text">1 year 3 months</div>
</td>
</tr>
<tr>
<td class="col-1">
<div class="col-1-text">Founded:</div>
</td>
<td class="col-2">
<div class="col-2-text">20/09/2019</div>
</td>
</tr>
<tr>
<td class="col-1">
<div class="col-1-text">Capital:</div>
</td>
<td class="col-2">
<div class="col-2-text">2000 USD</div>
</td>
</tr>
<tr>
<td colspan="2" class="sep"></td>
</tr>
<tr>
<td class="col-1">
<div class="col-1-text">Phone:</div>
</td>
<td class="col-2">
<div class="col-2-text">123456789</div>
</td>
</tr>
<tr>
<td class="col-1">
<div class="col-1-text">E-mail:</div>
</td>
<td class="col-2">
<div class="col-2-text"><a href="mailto:some@one.com">some@one.com</a></div>
</td>
</tr>
<tr>
<td class="col-1"><div class="col-1-text">Website:</div></td>
<td class="col-2"><div class="col-2-text"><a href="http://www.somecompany.com" target="_blank">www.somecompany.com</a></div></td>
</tr>
<tr>
<td colspan="2" class="sep"></td>
</tr>
<tr>
<td class="col-1">
<div class="col-1-text">Representatives:</div>
</td>
<td class="col-2">
<div class="col-2-text">
<div class="box-message">
<p class="desc">To access information, please</p>
<p>
<a href="#" onclick="return loginClicked(this, '#');"
class="btn btn-small btn-purple link-login">Log in</a>
</p>
</div>
</div>
</td>
</tr>
<tr>
<td colspan="2" class="sep"></td>
</tr>
<tr>
<td class="col-1">
<div class="col-1-text">
Main activity:
<span class="tip info" title=""
data-original-title="Activities are classified according to EMTAK 2008"></span>
</div>
</td>
<td class="col-2">
<div class="col-2-text" id="activity_top5ffe2eab23d13">
<ul>
<li>
Computer consultancy activities
<div class="main_activities_top_link_wrapper">
<a href="https://www.somesite.com/" target="_blank"
onclick="ga('send', 'event', 'check', 'top_btn', 'Anonym');"
class="btn btn-simple btn-open-graph">
<span>Open TOP 20</span> </a>
</div>
</li>
</ul>
</div>
</td>
</tr>
</tbody>
</table>
Python 代码:
import csv
import requests
import datetime
import time
from requests import get
from bs4 import BeautifulSoup
with open('data.csv', encoding='utf8') as csvfile:
reader = csv.reader(csvfile, delimiter=';')
next(reader)
count = 0
for row in reader:
timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
url = f'https://www.somedomain.com/result?country=en&q={row[1]}'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
cookies = {'__test': '1bb6e881021f013463740eeb74840b18'}
content = get(url, headers=headers, cookies=cookies).content
soup = BeautifulSoup(content, "lxml")
table_info = soup.select_one('.table-info')
mail = table_info.select_one('.col-2 a[href^=mailto]')
mail = mail.get('href')
mail_clean = mail.split(':')[1]
website = soup.find(text='Website:')
website = table_info.select_one('.col-2 a[target^=_blank]')
website = website.get('href')
collected_data = row[1], mail_clean, website, timestamp
data_list = [["Regcode", "Email", "Website", "Timestamp"],collected_data]
with open('extracted.csv', 'w', newline='') as file:
writer = csv.writer(file, delimiter=';')
writer.writerows(data_list)
print(row[1], "|", mail_clean,"|", website,"|", timestamp)
#print("Waiting 3 seconds...")
#time.sleep(3)
count+=1
【问题讨论】:
-
可以分享几行
data.csv吗? -
另外,如果您希望追加到文件而不是覆盖,请将
'w'更改为'a'。看看这是否有效:with open('extracted.csv', 'a', newline='') as file:
标签: python web-scraping beautifulsoup