【问题标题】:How to select a specific row from a table using BeautifulSoup?如何使用 BeautifulSoup 从表中选择特定行?
【发布时间】:2020-05-14 15:50:12
【问题描述】:

所以我有一个与上一个问题相关的问题,但我意识到我需要再上一层才能获得 11 位 NDC 代码,而不是 10 位 NDC 代码。与其稍后转换它们,我想我可以一开始就抓住它们。这是上一个问题的链接。 Is there a way to parse data from multiple pages from a parent webpage? 我想做的是点击这里的链接(顺便说一句,这是第二级)

然后抓取下一页上生成的 11 位 NDC 代码

我能够编写代码以访问该页面,但我不确定如何选择它。数字在一个标签中,然后是一个标签,但我只想要表中的特定行,所以我想我可以像这样获得索引,但我在整个列表中都得到了 None 类型和 td 。这是我的代码

import requests
from bs4 import BeautifulSoup    
url ='https://ndclist.com/?s=Trospium'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

all_data = []
for a in soup.select('[data-title="NDC"] a[href]'):
    link_url = a['href']
    print('Processin link {}...'.format(link_url))

    soup2 = BeautifulSoup(requests.get(link_url).content, 'html.parser')
    for b in soup2.select('#product-packages a'):
        link_url2 = b['href']
        print('Processing link {}... '.format(link_url2))
        soup3 = BeautifulSoup(requests.get(link_url2).content, 'html.parser')
        for link in soup3.findAll('tr', limit=7)[1]:
            print(link.name)
            all_data.append(link.name)

print('Trospium')
print(all_data)

【问题讨论】:

    标签: python-3.x parsing web-scraping beautifulsoup


    【解决方案1】:

    对您的代码稍作修改:

    import requests
    from bs4 import BeautifulSoup
    url ='https://ndclist.com/?s=Trospium'
    soup = BeautifulSoup(requests.get(url).content, 'html.parser')
    
    all_data = []
    for a in soup.select('[data-title="NDC"] a[href]'):
        link_url = a['href']
        print('Processing link {}...'.format(link_url))
    
        soup2 = BeautifulSoup(requests.get(link_url).content, 'html.parser')
        for b in soup2.select('#product-packages a'):
            link_url2 = b['href']
            print('\tProcessing link {}... '.format(link_url2))
            soup3 = BeautifulSoup(requests.get(link_url2).content, 'html.parser')
            ndc_billing_format = soup3.select_one('td:contains("11-Digit NDC Billing Format") + td').contents[0].strip()
            print('\t\t{}'.format(ndc_billing_format))
            all_data.append(ndc_billing_format)
    
    print('Trospium')
    print(all_data)
    

    打印:

    Processing link https://ndclist.com/ndc/0574-0118...
        Processing link https://ndclist.com/ndc/0574-0118/package/0574-0118-30... 
            00574011830
    Processing link https://ndclist.com/ndc/0574-0145...
        Processing link https://ndclist.com/ndc/0574-0145/package/0574-0145-60... 
            00574014560
    Processing link https://ndclist.com/ndc/0591-3636...
        Processing link https://ndclist.com/ndc/0591-3636/package/0591-3636-05... 
            00591363605
        Processing link https://ndclist.com/ndc/0591-3636/package/0591-3636-30... 
            00591363630
        Processing link https://ndclist.com/ndc/0591-3636/package/0591-3636-60... 
            00591363660
    Processing link https://ndclist.com/ndc/23155-530...
        Processing link https://ndclist.com/ndc/23155-530/package/23155-530-02... 
            23155053002
        Processing link https://ndclist.com/ndc/23155-530/package/23155-530-05... 
            23155053005
        Processing link https://ndclist.com/ndc/23155-530/package/23155-530-06... 
            23155053006
    Processing link https://ndclist.com/ndc/42291-846...
        Processing link https://ndclist.com/ndc/42291-846/package/42291-846-60... 
            42291084660
    Processing link https://ndclist.com/ndc/60429-098...
        Processing link https://ndclist.com/ndc/60429-098/package/60429-098-30... 
            60429009830
    Processing link https://ndclist.com/ndc/60505-3454...
        Processing link https://ndclist.com/ndc/60505-3454/package/60505-3454-5... 
            60505345405
        Processing link https://ndclist.com/ndc/60505-3454/package/60505-3454-6... 
            60505345406
        Processing link https://ndclist.com/ndc/60505-3454/package/60505-3454-8... 
            60505345408
    Processing link https://ndclist.com/ndc/68001-228...
        Processing link https://ndclist.com/ndc/68001-228/package/68001-228-04... 
            68001022804
    Processing link https://ndclist.com/ndc/68462-461...
        Processing link https://ndclist.com/ndc/68462-461/package/68462-461-05... 
            68462046105
        Processing link https://ndclist.com/ndc/68462-461/package/68462-461-30... 
            68462046130
        Processing link https://ndclist.com/ndc/68462-461/package/68462-461-60... 
            68462046160
    Processing link https://ndclist.com/ndc/69097-912...
        Processing link https://ndclist.com/ndc/69097-912/package/69097-912-02... 
            69097091202
        Processing link https://ndclist.com/ndc/69097-912/package/69097-912-03... 
            69097091203
        Processing link https://ndclist.com/ndc/69097-912/package/69097-912-15... 
            69097091215
    Processing link https://ndclist.com/ndc/69150-258...
        Processing link https://ndclist.com/ndc/69150-258/package/69150-258-06... 
            69150025806
    Processing link https://ndclist.com/ndc/76282-336...
        Processing link https://ndclist.com/ndc/76282-336/package/76282-336-60... 
            76282033660
    Trospium
    ['00574011830', '00574014560', '00591363605', '00591363630', '00591363660', '23155053002', '23155053005', '23155053006', '42291084660', '60429009830', '60505345405', '60505345406', '60505345408', '68001022804', '68462046105', '68462046130', '68462046160', '69097091202', '69097091203', '69097091215', '69150025806', '76282033660']
    

    【讨论】:

    • 再次感谢!我以为我可以索引它,起初我尝试过这种方式,但我有 .select("11-Digit NDC Billing Format"),但没有得到任何结果,我明白为什么,它需要 'td:包含()+ td')
    • 我在您的 select_one 行上收到此错误“仅实现以下伪类:nth-​​of-type。”) NotImplementedError:仅实现以下伪类:nth-​​of -类型。但我会尝试调查一下,看看这个问题是什么。
    • @Alex 你使用的是旧版本的 BeautifulSoup,我使用的是 beautifulsoup4==4.9.0。你需要升级它。
    猜你喜欢
    • 1970-01-01
    • 2011-01-01
    • 1970-01-01
    • 1970-01-01
    • 2012-05-14
    • 2010-12-07
    • 1970-01-01
    • 1970-01-01
    • 2021-08-30
    相关资源
    最近更新 更多