【发布时间】:2018-04-15 16:13:03
【问题描述】:
我正在尝试使用 BeautifulSoup 库在 python 上构建网络爬虫。我想从比特币论坛主题的所有页面获取信息。 我正在使用以下代码从论坛https://bitcointalk.org/index.php?topic=2056041.0获取用户名、状态、发布日期和时间、发布文本、活动、优点
url='https://bitcointalk.org/index.php?topic=2056041.0'
from bs4 import BeautifulSoup
import requests
import re
def get_html(url):
r = requests.get(url)
return r.text
html=get_html(url)
soup=BeautifulSoup(html, 'lxml')
results= soup.findAll("td", {"valign" : "top"})
usernames=[]
for i in results:
x=i.findAll('b')
try:
s=str(x[0])
if 'View the profile of' in s :
try:
found = re.search('of (.+?)">', s).group(1)
if found.isdigit()==False:
usernames.append(found)
except Exception as e :print(e)
except Exception as e :pass#print(e)
print(len(usernames))
status=[]
for i in results:
x=i.findAll("div", {"class": "smalltext"})
s=str(x)
try:
found = re.search(' (.+?)<br/>', s).group(1)
if len(found)<25:
status.append(found)
except:pass
print(len(status))
activity=[]
for i in results:
x=i.findAll("div", {"class": "smalltext"})
s=str(x)
try:
x=s.split('Activity: ')[1]
x=x.split('<br/>')[0]
activity.append(x)
except Exception as e :pass
print(activity)
print(len(activity))
posts=[]
for i in results:
x=i.findAll("div", {"class": "post"})
s=str(x)
try:
x=s.split('="post">')[1]
x=x.split('</div>]')[0]
if x.isdigit()!=True:
posts.append(x)
except Exception as e :pass
print(len(posts))
我觉得这是一个非常丑陋且不正确的解决方案,使用 re all these try except and etc. 是否有更直接和优雅的解决方案来完成这项任务?
【问题讨论】:
标签: python python-3.x web-scraping beautifulsoup web-crawler