【问题标题】:Web scraping under div tag using beautiful soup使用漂亮的汤在 div 标签下抓取网页
【发布时间】:2019-08-26 18:12:19
【问题描述】:

我正在尝试抓取一个网站,其中详细信息存在于我尝试过的各种 div 标签中,但不知何故我无法抓取,因为每个元素都存在于 div 标签内,并且在 div 下还有 span 标签我也有编写返回空字符串的代码

这是我的代码

    unspsc_link = "https://order.besse.com/Orders/Search/ProductSearch?query=34431"    
    link = requests.get(unspsc_link).text
    soup = BeautifulSoup(link, 'lxml')
    
    prdItemNumbers = []
    prdTitles = []
    prdSubTitles = []
    prdNDCs = []
    prdUOM = []
    prdForm = []
    
    
    for row in soup.select('.row'):
        prdItemNumbers = row.select_one('.font-xs bg-teal')
        if prdItemNumbers is None:
            prdItemNumbers.append('N/A')
        else:
            prdItemNumbers.append(prdItemNumbers.text.strip().replace('\u200b',''))
    
        prdTitles = row.select_one('.header1')
        if prdTitles is None:
            prdTitles.append('N/A')
        else:
            prdTitles.append(prdTitles.text.strip())
    
        prdSubTitles = row.select_one('.header2')
        if prdSubTitles is None:
            prdSubTitles.append('N/A')
        else:
            prdSubTitles.append(prdSubTitles.text.strip())    
    
        prdNDCs = row.select_one('.col-sm-5')
        if prdNDCs is None:
            prdNDCs.append('N/A')
        else:
            prdNDCs.append(prdNDCs.text.strip())
    
        prdUOM = row.select_one('.col-sm-3')
        if prdUOM is None:
            prdUOM.append('N/A')
        else:
            prdUOM.append(prdUOM.text.strip())
    
        prdForm = row.select_one('.col-sm-4')
        if prdForm is None:
            prdForm.append('N/A')
        else:
            prdForm.append(prdForm.text.strip())

报错

    prdItemNumbers.append('N/A')

   AttributeError: 'NoneType' object has no attribute 'append'

【问题讨论】:

    标签: python html web-scraping beautifulsoup


    【解决方案1】:

    这个

    for row in soup.select('.row'):
        prdItemNumbers = row.select_one('.font-xs bg-teal')
        if prdItemNumbers is None:
            prdItemNumbers.append('N/A')
        else:
            prdItemNumbers.append(prdItemNumbers.text.strip().replace('\u200b',''))
    

    应该是

    for row in soup.select('.list-group-item'):
        prdItemNumber = row.select_one('.font-xs bg-teal')
        if prdItemNumber is None:
            prdItemNumbers.append('N/A')
        else:
            prdItemNumbers.append(prdItemNumber.text.strip().replace('\u200b',''))
    

    测试应该在prdItemNumber 上进行,这是当前尝试设置元素而不是要附加到的列表。其他原则相同;并且你想让所有列表变量名都变成复数。此外,要循环的父类应该是list-group-item

    内容似乎也是从 XHR POST 请求动态加载的。您可以使用 selenium 加载页面并像以前一样使用 page_source 继续

    from bs4 import BeautifulSoup
    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    d = webdriver.Chrome(r'C:\Users\HarrisQ\Documents\chromedriver.exe')
    d.get('https://order.besse.com/Orders/Search/ProductSearch?query=34431')
    WebDriverWait(d,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".list-group-item")))
    soup = BeautifulSoup(d.page_source, 'lxml')
    prdItemNumbers = []
    prdTitles = []
    prdSubTitles = []
    prdNDCs = []
    prdUOMs = []
    prdForms = []
    
    for row in soup.select('.list-group-item'):
    
        prdItemNumber = row.select_one('.font-xs bg-teal')
        if prdItemNumber is None:
            prdItemNumbers.append('N/A')
        else:
            prdItemNumbers.append(prdItemNumber.text.strip().replace('\u200b',''))
    
        prdTitle = row.select_one('.header1')
        if prdTitle is None:
            prdTitles.append('N/A')
        else:
            prdTitles.append(prdTitle.text.strip())
    
        prdSubTitle = row.select_one('.header2')
        if prdSubTitle is None:
            prdSubTitles.append('N/A')
        else:
            prdSubTitles.append(prdSubTitle.text.strip())    
    
        prdNDC = row.select_one('.col-sm-5')
        if prdNDC is None:
            prdNDCs.append('N/A')
        else:
            prdNDCs.append(prdNDC.text.strip())
    
        prdUOM = row.select_one('.col-sm-3')
        if prdUOM is None:
            prdUOMs.append('N/A')
        else:
            prdUOMs.append(prdUOM.text.strip())
    
        prdForm = row.select_one('.col-sm-4')
        if prdForm is None:
            prdForms.append('N/A')
        else:
            prdForms.append(prdForm.text.strip())
    d.quit()
    

    【讨论】:

    • 嘿@QHarr 两者有什么区别?
    • 要附加到的prdItemNumbers列表与prdItemNumber相比,在这种情况下,您尝试匹配的元素可以是Tag或None。
    • 嘿 QHarr 它只检索 N/A 和无用的值我想我这次没有使用正确的类它不会抛出错误
    • 嗨,很可能。如图所示,我正在修复您的错误。
    • 这个页面似乎在某种程度上也依赖于javascript。
    猜你喜欢
    • 1970-01-01
    • 2022-01-08
    • 2022-01-20
    • 2019-09-06
    • 2021-11-12
    • 1970-01-01
    • 2015-01-26
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多