【问题标题】:Scrape text from svg using BeautifulSoup使用 BeautifulSoup 从 svg 中抓取文本
【发布时间】:2019-05-27 05:35:27
【问题描述】:

我是 python 的初学者,我正在尝试使用 BeautifulSoup 获取实际的年度支出价格。我很难找到我应该使用什么来从 svg 中提取文本。

到目前为止我编写的代码:

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'http://abacus.realendpoints.com/ConsoleTemplate.aspx?act=qlrd&req=nav&mop=abacus!main&pk=ed5a81ad-9367-41c8-aa6b-18a08199ddcf&ab-eff=1000&ab-tox=0.1&ab-nov=1&ab-rare=1&ab-pop=1&ab-dev=1&ab-prog=1.0&ab-need=1&ab-time=1543102810'

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html, "html.parser")

【问题讨论】:

    标签: python html web-scraping


    【解决方案1】:

    月度数据:

    使用 selenium,您可以通过移动到每一行来获取每月信息

    from selenium import webdriver
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.webdriver.common.by import By
    from selenium.webdriver.common.action_chains import ActionChains
    
    url = 'http://abacus.realendpoints.com/ConsoleTemplate.aspx?act=qlrd&req=nav&mop=abacus!main&pk=ed5a81ad-9367-41c8-aa6b-18a08199ddcf&ab-eff=1000&ab-tox=0.1&ab-nov=1&ab-rare=1&ab-pop=1&ab-dev=1&ab-prog=1.0&ab-need=1&ab-time=1543102810'
    d = webdriver.Chrome()
    actions = ActionChains(d)
    d.get(url)
    paths = WebDriverWait(d,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".highcharts-plot-lines-0 path")))
    results = []
    for path in paths:
        actions.move_to_element(path).perform()
        actions.click_and_hold(path).perform()
        items = d.find_elements_by_css_selector('#priceChart path + text tspan')
        result = [item.text for item in items]
        if result:
            results.append(result)
    
    print(results)
    


    年度数据:

    有点难看,但您可以正则表达式从脚本标签之一中提取信息。这是针对年度而非每月的数据。

    import requests
    from bs4 import BeautifulSoup as bs
    import re
    import locale
    
    res = requests.get('http://abacus.realendpoints.com/ConsoleTemplate.aspx?act=qlrd&req=nav&mop=abacus!main&pk=ed5a81ad-9367-41c8-aa6b-18a08199ddcf&ab-eff=1000&ab-tox=0.1&ab-nov=1&ab-rare=1&ab-pop=1&ab-dev=1&ab-prog=1.0&ab-need=1&ab-time=1543102810')
    soup = bs(res.content, 'lxml')
    script = soup.select('script')[19]
    items = str(script).split('series:')
    item = items[2].split('exporting')[0][:-15]
    p1 = re.compile('name:(.*)]')
    p2 = re.compile('(\d+\.\d+)+')
    it = re.finditer(p1, item)
    names = [match.group(1).split(',')[0].strip().replace("'",'') for match in it]
    it2 = re.finditer(p2, item)
    allNumbers = [float(match.group(1)) for match in it2]
    actualAnnuals = allNumbers[0::2]
    abacusAnnuals = allNumbers[1::2]
    actuals = list(zip(names,actualAnnuals))
    abacus = list(zip(names,abacusAnnuals))
    
    #Examples:
    print(actuals,abacus)
    
    locale.setlocale(locale.LC_ALL, 'English')
    print(locale.format('%.2f',sum(actualAnnuals) , True))
    

    使用 selenium,您可以使用 css 类型选择器轻松获取标题年度数据

    from selenium import webdriver
    
    d = webdriver.Chrome()
    d.get('http://abacus.realendpoints.com/ConsoleTemplate.aspx?act=qlrd&req=nav&mop=abacus!main&pk=ed5a81ad-9367-41c8-aa6b-18a08199ddcf&ab-eff=1000&ab-tox=0.1&ab-nov=1&ab-rare=1&ab-pop=1&ab-dev=1&ab-prog=1.0&ab-need=1&ab-time=1543102810')
    print(d.find_element_by_css_selector('tspan').text)
    

    年度算盘、价格表和情景:

    print(d.find_elements_by_css_selector('tspan')[3].text, d.find_element_by_css_selector('#Options_price_sheet_id [selected]').text, d.find_element_by_css_selector('#Options_scenario_id [selected]').text ) 
    

    【讨论】:

    • 是的,它有效。谢谢你。对于年度数字。我将如何提取算盘年度金额?
    • 我也想刮一下“价格表”和“场景”
    • print(d.find_elements_by_css_selector('tspan')[3].text, d.find_element_by_css_selector('#Options_price_sheet_id [selected]').text, d.find_element_by_css_selector('#Options_scenario_id [selected]' ).文本)
    • 感谢您的帮助
    • 很抱歉打扰您,而不是年度数据的总数,当您将鼠标悬停在图表上时,我想要单个药物的总数。那要怎么做呢?
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2016-11-11
    • 1970-01-01
    • 2020-11-10
    • 1970-01-01
    • 2018-01-09
    • 1970-01-01
    • 2021-10-24
    相关资源
    最近更新 更多