【问题标题】:the right approach to use BeautifulSoup n python3在 python3 中使用 BeautifulSoup 的正确方法
【发布时间】:2018-04-15 16:13:03
【问题描述】:

我正在尝试使用 BeautifulSoup 库在 python 上构建网络爬虫。我想从比特币论坛主题的所有页面获取信息。 我正在使用以下代码从论坛https://bitcointalk.org/index.php?topic=2056041.0获取用户名、状态、发布日期和时间、发布文本、活动、优点

url='https://bitcointalk.org/index.php?topic=2056041.0'
from bs4 import BeautifulSoup
import requests
import re

def get_html(url):
    r = requests.get(url)
    return r.text


html=get_html(url)
soup=BeautifulSoup(html, 'lxml')

results= soup.findAll("td", {"valign" : "top"})
usernames=[]
for i in results:
    x=i.findAll('b')
    try:
        s=str(x[0])
        if 'View the profile of' in s :
            try:
              found = re.search('of (.+?)">', s).group(1)
              if found.isdigit()==False:
                usernames.append(found)
            except Exception as e :print(e)

    except Exception as e :pass#print(e)
print(len(usernames))
status=[]


for i in results:
    x=i.findAll("div", {"class": "smalltext"})
    s=str(x)
    try:
       found = re.search(' (.+?)<br/>', s).group(1)
       if len(found)<25:
          status.append(found)
    except:pass
print(len(status))


activity=[]
for i in results:
    x=i.findAll("div", {"class": "smalltext"})
    s=str(x)
    try:
        x=s.split('Activity: ')[1]
        x=x.split('<br/>')[0]
        activity.append(x)

    except Exception as e :pass   
print(activity)
print(len(activity))
posts=[]
for i in results:
    x=i.findAll("div", {"class": "post"})
    s=str(x)
    try:
        x=s.split('="post">')[1]
        x=x.split('</div>]')[0]
        if x.isdigit()!=True:
            posts.append(x)

    except Exception as e :pass


print(len(posts))

我觉得这是一个非常丑陋且不正确的解决方案,使用 re all these try except and etc. 是否有更直接和优雅的解决方案来完成这项任务?

【问题讨论】:

    标签: python python-3.x web-scraping beautifulsoup web-crawler


    【解决方案1】:

    你是对的。太丑了。

    您说您正在尝试使用BeautifulSoup 进行抓取,但您没有在任何地方使用已解析的soup 对象。如果您要将soup 对象转换为字符串并使用正则表达式对其进行解析,那么您不妨跳过BeautifulSoup 的导入并直接在r.text 上使用正则表达式。

    使用正则表达式解析 HTML 是个坏主意。原因如下:

    RegEx match open tags except XHTML self-contained tags

    您似乎只是发现BeautifulSoup 可用于解析 HTML,但没有费心阅读文档:

    BeautifulSoup Documentation

    了解如何浏览 HTML 树。他们的官方文档对于像这样的简单任务绰绰有余:

    usernames = []
    statuses = []
    activities = []
    posts = []
    
    for i in soup.find_all('td', {'class': 'poster_info'}):
        j = i.find('div', {'class': 'smalltext'}).find(text=re.compile('Activity'))
        if j:
            usernames.append(i.b.a.text)
            statuses.append(i.find('div', {'class': 'smalltext'}).contents[0].strip())
            activities.append(j.split(':')[1].strip())
            posts.append(i.find_next('td').find('div', {'class': 'post'}).text.strip())
    

    这是打印它们的长度的结果:

    >>> len(usernames), len(statuses), len(activities), len(posts)
    (20, 20, 20, 20)
    

    以下是实际内容:

    for i, j, k, l in zip(usernames, statuses, activities, posts):
        print('{} - {} - {}:\n{}\n'.format(i, j, k, l))
    

    结果:

    hous26 - Full Member - 280:
    Just curious.  Not counting anything less than a dollar in total worth.  I own 9 coin types:
    
    satoshforever - Member - 84:
    I own three but plan to add three more soon. But is this really a useful question without the size of the holdings?
    
    .
    .
    .
    
    papajamba - Full Member - 134:
    7 coins as of the moment. Thinking of adding xrp again though too. had good profit when it was only 800-900 sats
    

    【讨论】:

    • 非常感谢。我想我需要这样的东西: for i in soup.find_all('td', {'class': 'td_headerandpost'}): jj = i.find('div', {'class': 'smalltext'})如果 jj : go.append(jj) print(jj.text)
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2018-02-18
    • 1970-01-01
    • 2022-08-17
    • 2019-04-13
    • 2020-06-29
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多