【问题标题】:Web Scraping through Python BeautifulSoup通过 Python BeautifulSoup 进行网页抓取
【发布时间】:2018-08-03 03:03:39
【问题描述】:

我只是 Python 的初学者。

我正在尝试从网站上抓取数据,并设法编写了以下代码。

但是,我不确定如何继续,因为我无法获取 href 标签,因此我可以转到每个列表并获取数据。我也不是很了解 HTML 标签,所以我怀疑我没有正确识别标签。

这是我的代码:

import requests 
from bs4 import BeautifulSoup

urls = []
for i in range(1,5):
    pages = "https://directory.singaporefintech.org/?p={0}&category=0&zoom=15&is_mile=0&directory_radius=0&view=list&hide_searchbox=0&hide_nav=0&hide_nav_views=0&hide_pager=0&featured_only=0&feature=1&perpage=20&sort=random".format(i)
    urls.append(pages)

Data = []
for info in urls:
    page = requests.get(info)
    soup = BeautifulSoup(page.content, 'html.parser')
    links = soup.find_all('a', attrs ={'class' :'sabai-directory-title'})
    hrefs = [link['href'] for link in links]

上面的代码将hrefs 生成为一个空白列表。 任何帮助将不胜感激!

谢谢!!!

【问题讨论】:

    标签: python python-3.x web-scraping beautifulsoup


    【解决方案1】:

    代码很好,您要查找的类在这些页面上不存在。例如,在检查 https://directory.singaporefintech.org/hello-world/?category=0&zoom=15&is_mile=0&directory_radius=0&view=list&hide_searchbox=0&hide_nav=0&hide_nav_views=0&hide_pager=0&featured_only=0&feature=1&perpage=20&sort=random 后用评论回复链接替换 ​​sabai-directory-title 类,并在我添加打印语句时得到结果

    【讨论】:

    • 我很抱歉,但我不太擅长识别标签,我确实检查了元素并发现我需要单击以打开该特定列表的 href 位于 div 类下sabai-目录-标题标签。下面是 HTML 标签,请提出解决方案:-
    • directory.singaporefintech.org/directory/listing/amaas" title="AMaaS" class=" sabai-entity-permalink sabai-entity-id-43 sabai-entity-type-content sabai-entity-bundle- name-directory-listing sabai-entity-bundle-type-directory-listing">AMaaS
    • 您引用的带有 a 标签的页面的 URL 是什么?
    • directory.singaporefintech.org 这是我试图从中抓取数据的站点。它有 27 页,当我单击每个列表时,每个列表的信息都是可见的。
    • 哦,我明白了 - 您需要做的是首先获取该类的 div,然后获取您需要的 a 标签并提取 href。 “sabai-directory-title”类不在“a”元素/标签中,而是在 div 元素中。似乎您设置为附加到 url 的“pages”变量的基本 URL 可能不正确,因为它与您在下面的评论中提供的链接不同(directory.singaporefintech.org 的 div 元素带有“sabai-directory-标题”类)。
    【解决方案2】:

    您可以使用 CSS 选择器来废弃链接。选择器 div.sabai-directory-title a 将在 <div> 标记内找到任何 <a> 标记,类为 sabai-directory-title(我更新了 URL,你的 URL 给了我错误页面):

    from bs4 import BeautifulSoup
    import requests
    from pprint import pprint
    
    r = requests.get('https://directory.singaporefintech.org/')
    soup = BeautifulSoup(r.text, 'lxml')
    
    hrefs = [a['href'] for a in soup.select('div.sabai-directory-title a')]
    
    pprint(hrefs)
    

    这将打印:

    ['https://directory.singaporefintech.org/directory/listing/silent-eight',
     'https://directory.singaporefintech.org/directory/listing/incomlend',
     'https://directory.singaporefintech.org/directory/listing/bizgrow',
     'https://directory.singaporefintech.org/directory/listing/makerscut',
     'https://directory.singaporefintech.org/directory/listing/soho-fintech',
     'https://directory.singaporefintech.org/directory/listing/dxmarkets',
     'https://directory.singaporefintech.org/directory/listing/fundrevo',
     'https://directory.singaporefintech.org/directory/listing/money4money',
     'https://directory.singaporefintech.org/directory/listing/onelyst',
     'https://directory.singaporefintech.org/directory/listing/hearti-lab',
     'https://directory.singaporefintech.org/directory/listing/samurai-fintech-singapore-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/ceo-1',
     'https://directory.singaporefintech.org/directory/listing/arcadier',
     'https://directory.singaporefintech.org/directory/listing/plmp-fintech-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/cash-in-asia',
     'https://directory.singaporefintech.org/directory/listing/grc-systems',
     'https://directory.singaporefintech.org/directory/listing/sendexpense',
     'https://directory.singaporefintech.org/directory/listing/jinjerjade',
     'https://directory.singaporefintech.org/directory/listing/hatcher',
     'https://directory.singaporefintech.org/directory/listing/fintech-consortium']
    

    【讨论】:

      【解决方案3】:

      您好,我对代码做了一些更改:

      import requests
      from bs4 import BeautifulSoup
      from pprint import pprint
      
      urls = []
      for i in range(1,5):
          pages = "https://directory.singaporefintech.org"
          urls.append(pages)
      
      Data = []
      hrefs = []
      for info in urls:
          page = requests.get(info)
          soup = BeautifulSoup(page.content, 'html.parser')
          links = soup.find_all('div', attrs ={'class' :'sabai-directory-title'})
          for link in links:
              Data.extend([a['href'].encode('ascii') for a in link.find_all('a', href=True) if a.text])
      pprint (Data)
      

      输出:

           ['https://directory.singaporefintech.org/directory/listing/silent-eight',
           'https://directory.singaporefintech.org/directory/listing/moolahsense',
           'https://directory.singaporefintech.org/directory/listing/myfinb',
           'https://directory.singaporefintech.org/directory/listing/wefinance',
           'https://directory.singaporefintech.org/directory/listing/quber',
           'https://directory.singaporefintech.org/directory/listing/ayondo-asia-pte-ltd',
           'https://directory.singaporefintech.org/directory/listing/ceo-1',
           'https://directory.singaporefintech.org/directory/listing/acekards',
           'https://directory.singaporefintech.org/directory/listing/paper-ink-pte-ltd',
           'https://directory.singaporefintech.org/directory/listing/alpha-payments-cloud',
           'https://directory.singaporefintech.org/directory/listing/samurai-fintech-singapore-pte-ltd',
           'https://directory.singaporefintech.org/directory/listing/corris-asset-management-pte-ltd',
           'https://directory.singaporefintech.org/directory/listing/fundmylife',
           'https://directory.singaporefintech.org/directory/listing/mooments',
           'https://directory.singaporefintech.org/directory/listing/venture-capital-network-pte-ltd',
           'https://directory.singaporefintech.org/directory/listing/junotele_',
           'https://directory.singaporefintech.org/directory/listing/mobilecover',
           'https://directory.singaporefintech.org/directory/listing/cherrypay',
           'https://directory.singaporefintech.org/directory/listing/toast',
           'https://directory.singaporefintech.org/directory/listing/cashdab',
           'https://directory.singaporefintech.org/directory/listing/silent-eight',
           'https://directory.singaporefintech.org/directory/listing/moolahsense',
           'https://directory.singaporefintech.org/directory/listing/myfinb',
           'https://directory.singaporefintech.org/directory/listing/wefinance',
           'https://directory.singaporefintech.org/directory/listing/quber',
           'https://directory.singaporefintech.org/directory/listing/ayondo-asia-pte-ltd',
           'https://directory.singaporefintech.org/directory/listing/ceo-1',
           'https://directory.singaporefintech.org/directory/listing/acekards',
           'https://directory.singaporefintech.org/directory/listing/paper-ink-pte-ltd',
           'https://directory.singaporefintech.org/directory/listing/alpha-payments-cloud',
           'https://directory.singaporefintech.org/directory/listing/samurai-fintech-singapore-pte-ltd',
           'https://directory.singaporefintech.org/directory/listing/corris-asset-management-pte-ltd',
           'https://directory.singaporefintech.org/directory/listing/fundmylife',
           'https://directory.singaporefintech.org/directory/listing/mooments',
           'https://directory.singaporefintech.org/directory/listing/venture-capital-network-pte-ltd',
           'https://directory.singaporefintech.org/directory/listing/junotele_',
           'https://directory.singaporefintech.org/directory/listing/mobilecover',
           'https://directory.singaporefintech.org/directory/listing/cherrypay',
           'https://directory.singaporefintech.org/directory/listing/toast',
           'https://directory.singaporefintech.org/directory/listing/cashdab',
           'https://directory.singaporefintech.org/directory/listing/silent-eight',
           'https://directory.singaporefintech.org/directory/listing/moolahsense',
           'https://directory.singaporefintech.org/directory/listing/myfinb',
           'https://directory.singaporefintech.org/directory/listing/wefinance',
           'https://directory.singaporefintech.org/directory/listing/quber',
           'https://directory.singaporefintech.org/directory/listing/ayondo-asia-pte-ltd',
           'https://directory.singaporefintech.org/directory/listing/ceo-1',
           'https://directory.singaporefintech.org/directory/listing/acekards',
           'https://directory.singaporefintech.org/directory/listing/paper-ink-pte-ltd',
           'https://directory.singaporefintech.org/directory/listing/alpha-payments-cloud',
           'https://directory.singaporefintech.org/directory/listing/samurai-fintech-singapore-pte-ltd',
           'https://directory.singaporefintech.org/directory/listing/corris-asset-management-pte-ltd',
           'https://directory.singaporefintech.org/directory/listing/fundmylife',
           'https://directory.singaporefintech.org/directory/listing/mooments',
           'https://directory.singaporefintech.org/directory/listing/venture-capital-network-pte-ltd',
           'https://directory.singaporefintech.org/directory/listing/junotele_',
           'https://directory.singaporefintech.org/directory/listing/mobilecover',
           'https://directory.singaporefintech.org/directory/listing/cherrypay',
           'https://directory.singaporefintech.org/directory/listing/toast',
           'https://directory.singaporefintech.org/directory/listing/cashdab',
           'https://directory.singaporefintech.org/directory/listing/silent-eight',
           'https://directory.singaporefintech.org/directory/listing/moolahsense',
           'https://directory.singaporefintech.org/directory/listing/myfinb',
           'https://directory.singaporefintech.org/directory/listing/wefinance',
           'https://directory.singaporefintech.org/directory/listing/quber',
           'https://directory.singaporefintech.org/directory/listing/ayondo-asia-pte-ltd',
           'https://directory.singaporefintech.org/directory/listing/ceo-1',
           'https://directory.singaporefintech.org/directory/listing/acekards',
           'https://directory.singaporefintech.org/directory/listing/paper-ink-pte-ltd',
           'https://directory.singaporefintech.org/directory/listing/alpha-payments-cloud',
           'https://directory.singaporefintech.org/directory/listing/samurai-fintech-singapore-pte-ltd',
           'https://directory.singaporefintech.org/directory/listing/corris-asset-management-pte-ltd',
           'https://directory.singaporefintech.org/directory/listing/fundmylife',
           'https://directory.singaporefintech.org/directory/listing/mooments',
           'https://directory.singaporefintech.org/directory/listing/venture-capital-network-pte-ltd',
           'https://directory.singaporefintech.org/directory/listing/junotele_',
           'https://directory.singaporefintech.org/directory/listing/mobilecover',
           'https://directory.singaporefintech.org/directory/listing/cherrypay',
           'https://directory.singaporefintech.org/directory/listing/toast',
           'https://directory.singaporefintech.org/directory/listing/cashdab']
      

      这是您期望的数据输出吗?

      希望对你有帮助!!

      【讨论】:

      • 是的,它给出了你提到的输出,但是我如何循环它以从各个页面上的各种列表中获取信息,很抱歉,我被困在这个问题上
      猜你喜欢
      • 2020-10-04
      • 2021-01-31
      • 1970-01-01
      • 2018-10-16
      • 2020-08-09
      • 1970-01-01
      • 1970-01-01
      • 2018-04-25
      • 2014-06-20
      相关资源
      最近更新 更多