【问题标题】:Get content of table in website with Python Selenium使用 Python Selenium 获取网站中表格的内容
【发布时间】:2019-02-23 11:34:46
【问题描述】:

我正在尝试使用 selenium 获取网站上表格的内容。似乎该网站的设置方式相当复杂。我找不到在find_element_by_... 函数中使用的任何元素、类或内容。

如果有人知道如何获取以标题 StaffelNr.Datum...ErgebnisBem. 开头的第二个表的内容,那对我来说将是一个很大的帮助.我尝试了很多(从 urllib2 开始,...)。主要是以下脚本起作用 - 加载站点并循环通过高级容器。但我不确定如何获取提到的表格内容。

from selenium import webdriver
from selenium.webdriver.common.by import By

the_url = 'https://www.hvw-online.org/spielbetrieb/ergebnissetabellen/#/league?ogId=3&lId=37133&allGames=1'

driver = webdriver.Chrome()
driver.get(the_url)

elem_high = driver.find_elements(By.CLASS_NAME, 'container')
for e in elem_high:
    print(e)

# what class or element to search for second table
elem_deep = driver.find_elements(By.CLASS_NAME, 'row.game')

driver.close()

欢迎任何想法或 cmets。谢谢。

【问题讨论】:

    标签: python selenium web-scraping


    【解决方案1】:

    要获取行,您必须使用WebDriverWait 等待页面加载,您可以找到详细信息here

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    the_url = 'https://www.hvw-online.org/spielbetrieb/ergebnissetabellen/#/league?ogId=3&lId=37133&allGames=1'
    
    driver = webdriver.Chrome()
    wait = WebDriverWait(driver, 10)
    
    driver.get(the_url)
    
    elem_deep = wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.schedule tbody > tr")))
    for e in elem_deep:
        print(e.text)
        # Link in last column
        href = e.find_element_by_css_selector("a[ng-if='row.game.sGID']").get_attribute("href")
        print(href)
    

    但更好的解决方案是使用requests 包从网站获取所有信息。下面的代码是如何更快、更轻松地抓取的示例:

    import requests
    
    url = 'https://spo.handball4all.de/service/if_g_json.php?ca=1&cl=37133&cmd=ps&og=3'
    response = requests.get(url).json()
    
    futureGames = response[0]["content"]["futureGames"]["games"]
    for game in futureGames:
        print(game["gHomeTeam"])
        print(game["gGuestTeam"])
        # Link in last column
        print("http://spo.handball4all.de/misc/sboPublicReports.php?sGID=%s" % game["sGID"])
    
        # You can use example of data below to get all you need
        # {
        #     'gID': '2799428',
        #     'sGID': '671616',
        #     'gNo': '61330',
        #     'live': False,
        #     'gToken': '',
        #     'gAppid': '',
        #     'gDate': '30.09.18',
        #     'gWDay': 'So',
        #     'gTime': '14:00',
        #     'gGymnasiumID': '303',
        #     'gGymnasiumNo': '6037',
        #     'gGymnasiumName': 'Sporthalle beim Sportzentrum',
        #     'gGymnasiumPostal': '71229',
        #     'gGymnasiumTown': 'Leonberg',
        #     'gGymnasiumStreet': 'Steinstraße 18',
        #     'gHomeTeam': 'SV Leonb/Elt',
        #     'gGuestTeam': 'JSG Echaz-Erms 2',
        #     'gHomeGoals': '33',
        #     'gGuestGoals': '20',
        #     'gHomeGoals_1': '19',
        #     'gGuestGoals_1': '7',
        #     'gHomePoints': '2',
        #     'gGuestPoints': '0',
        #     'gComment': ' ',
        #     'gGroupsortTxt': ' ',
        #     'gReferee': ' '
        # }
    

    【讨论】:

    • 很好用,谢谢 - 还有一件事:最后一列有一个链接 - 为什么e.text 中没有显示。我需要改变什么来获得它?
    • 你是怎么知道这个网址的。我到处都找不到?
    • devtools 网络选项卡。你可以找到https://spo.handball4all.de/service/if_g_json.php url 和其他部分你可以从 header
    【解决方案2】:

    你可以使用css类选择器

    .schedule
    

    即:

    table = driver.find_element_by_css_selector(".schedule")
    

    您可能需要等待。

    然后循环内容

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait 
    from selenium.webdriver.support import expected_conditions as EC
    import pandas as pd
    
    driver = webdriver.Chrome()
    url ='https://www.hvw-online.org/spielbetrieb/ergebnissetabellen/#/league?ogId=3&lId=37133&allGames=1'
    driver.get(url)
    
    table = WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.CSS_SELECTOR , '.schedule')))
    headers = [elem.text for elem in driver.find_elements_by_css_selector('.schedule th')]
    results = []
    i = 1
    for row in table.find_elements_by_css_selector('tr'):
        if i > 1:
            results.append([td.text for td in row.find_elements_by_css_selector('td')])
        i+=1
    df = pd.DataFrame(results, columns = headers)
    print(df)
    driver.quit()
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2021-01-06
      • 1970-01-01
      • 1970-01-01
      • 2021-07-10
      • 2018-07-17
      • 1970-01-01
      • 2020-10-03
      • 1970-01-01
      相关资源
      最近更新 更多