【问题标题】:Extracting integer from the paragraph从段落中提取整数
【发布时间】:2020-12-24 14:38:01
【问题描述】:

我试图仅从段落中提取费用金额,但我遇到了问题。有两笔费用,我想要其中两笔。这是我的代码:http://www.reading.ac.uk/ready-to-study/study/subject-area/modern-languages-and-european-studies-ug/ba-spanish-and-history.aspx

fees_div = soup.find('div', class_='Fees hiddenContent pad-around-large tabcontent')
if fees_div:
    fees_list = fees_div.find_all('\d+','p')
    course_data['Fees'] = fees_list
    print('fees : ', fees_list)

【问题讨论】:

    标签: python web-scraping beautifulsoup python-re web-scraping-language


    【解决方案1】:

    请试试这个:

    In [10]: import requests
    In [11]: from bs4 import *
    In [12]: page = requests.get('http://www.reading.ac.uk/ready-to-study/study/subject-area/modern-languages-and-european-studies-ug/ba-spanish-and-history.aspx')
    In [13]: soup = BeautifulSoup(page.content, 'html.parser')
    In [14]: [x for x in soup.find('div', class_='Fees hiddenContent pad-around-large tabcontent').text.split() if u"\xA3" in x]
    Out[14]: ['£9,250*', '£17,320']
    

    【讨论】:

    • 这适用于我,但第一笔费用后面有一个“*”,我认为这是不可取的。
    • [x.replace('*','') for x in soup.find('div', class_='Fees hiddenContent pad-around-large tabcontent').text.split() if u"\xA3" in x] 应该可以完成这项工作。
    • 非常感谢,如果我这样运行它可以工作,但我正在使用另一个链接提取 python 文件,所以我认为这会导致问题
    【解决方案2】:

    试一试:

    import re
    import requests
    from bs4 import BeautifulSoup
    
    r = requests.get('http://www.reading.ac.uk/ready-to-study/study/subject-area/modern-languages-and-european-studies-ug/ba-spanish-and-history.aspx')
    soup = BeautifulSoup(r.text,'html.parser')
    item = soup.find(id='Panel5').text
    fees = re.findall(r"students:[^£]+(.*?)[*\s]",item)
    print(fees)
    

    输出:

    ['£9,250', '£17,320']
    

    【讨论】:

    • 非常感谢它单独工作,但我的代码不起作用,我正在尝试调整它。
    【解决方案3】:
    import requests
    from bs4 import BeautifulSoup
    import re
    
    r = requests.get('http://www.reading.ac.uk/ready-to-study/study/subject-area/modern-languages-and-european-studies-ug/ba-spanish-and-history.aspx')
    soup = BeautifulSoup(r.text,  'html.parser')
    fees_div = soup.find('div', class_='Fees hiddenContent pad-around-large tabcontent')
    m = re.search(r'£[\d,]+', fees_div.select('p:nth-of-type(2)')[0].get_text())
    fee1 = m[0]
    m = re.search(r'£[\d,]+', fees_div.select('p:nth-of-type(3)')[0].get_text())
    fee2 = m[0]
    print(fee1, fee2)
    

    打印:

    £9,250 £17,320
    

    更新

    您也可以使用 Selenium 抓取页面,尽管在这种情况下它没有任何优势。例如(使用 Chrome):

    from selenium import webdriver
    from bs4 import BeautifulSoup
    import re
    
    
    options = webdriver.ChromeOptions()
    options.add_argument("headless")
    options.add_experimental_option('excludeSwitches', ['enable-logging'])
    driver = webdriver.Chrome(options=options)
    
    driver.get('http://www.reading.ac.uk/ready-to-study/study/subject-area/modern-languages-and-european-studies-ug/ba-spanish-and-history.aspx')
    soup = BeautifulSoup(driver.page_source,  'html.parser')
    fees_div = soup.find('div', class_='Fees hiddenContent pad-around-large tabcontent')
    m = re.search(r'£[\d,]+', fees_div.select('p:nth-of-type(2)')[0].get_text())
    fee1 = m[0]
    m = re.search(r'£[\d,]+', fees_div.select('p:nth-of-type(3)')[0].get_text())
    fee2 = m[0]
    print(fee1, fee2)
    driver.quit()
    

    更新

    考虑只使用以下内容:只需扫描整个 HTML 源代码而不使用 BeautifulSoup 使用简单的正则表达式 findall 查找费用:

    import requests
    import re
    
    r = requests.get('http://www.reading.ac.uk/ready-to-study/study/subject-area/modern-languages-and-european-studies-ug/ba-spanish-and-history.aspx')
    print(re.findall(r'£[\d,]+', r.text))
    

    打印:

    ['£9,250', '£17,320']
    

    【讨论】:

    • Traceback(最近一次调用最后):文件“/Users/Downloads/pythonProject/Undergraduate/Undergraduate_script.py”,第 107 行,在 m = re.search(r'£[\ d,]+', fee_div.select('p:nth-of-type(2)')[0].get_text()) AttributeError: 'NoneType' object has no attribute 'select'
    • 我的代码和@idar 提供的代码似乎都不适合你运行,你运行的是最新版本的 BeautifulSoup 吗?你使用的是什么版本的 Python?
    • 也许我获取链接的方式正在影响它
    • 我想显示代码但我不能显示它,因为它改变了格式。我有另一个提取代码的 python 文件,所以我必须使用该 python 文件作为链接
    • 您需要发布一个最小可重现示例。见How to create a Minimal, Reproducible Example
    猜你喜欢
    • 2011-07-15
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2022-01-25
    • 1970-01-01
    • 2022-12-04
    • 2021-02-18
    • 2019-08-14
    相关资源
    最近更新 更多