【问题标题】:Python Selenium Web Scraping Hidden DivPython Selenium Web Scraping 隐藏的 Div
【发布时间】:2020-04-06 11:06:01
【问题描述】:

好吧,正如标题所示,我正在尝试从网站上抓取一些数据(example) 使用 Selenium,但是我无法从 Pro Results 表中获取隐藏在每一行中的数据,该表显示当您单击“显示详细信息”按钮 (+) 时。

这是我的代码:

from bs4 import BeautifulSoup

from selenium import webdriver

# Set some Selenium Options
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

# Webdriver
wd = webdriver.Chrome('chromedriver',options=options)

# URL
url = 'https://www.tapology.com/fightcenter/fighters/30449-sultan-aliev'

# Load URL
wd.get(url)

# Get HTML
soup = BeautifulSoup(wd.page_source, 'html.parser')

# All rows of the Pro Record table 
rows = soup.findAll('div', {'class': 'result'})

print(len(rows)) 

# [Out] 18

# Try to find all hidden data
hidden = soup.findAll('div', {'class': 'detail tall'})

print(hidden)

# [Out] []

如您所见,我可以轻松获取表格的行,但是当我尝试获取隐藏数据时,我找不到获取它的方法。

我对 Selenium 也不是很熟悉,因此欢迎任何指导。

【问题讨论】:

    标签: python selenium web-scraping


    【解决方案1】:

    正在使用js 请求从tapology api 获取包含您需要的信息的json
    要检索此信息,请安装 seleniumwire 并使用:

    from seleniumwire import webdriver
    import requests
    # ...
    driver = webdriver.Firefox()
    driver.scopes = [ 'api.tapology.com'] # filter api.tapology.com requests only 
    driver.get('https://www.tapology.com/fightcenter/fighters/30449-sultan-aliev')
    
    for request in driver.requests:
        print(request.path)
        r = requests.get(request.path, headers=request.headers)
        print(r.json())  # the info you need is here
    

    https://api.tapology.com/v1/internal_ranking_items/47211352261#排名数据 https://api.tapology.com/v1/internal_fighters/472130449#战机数据


    【讨论】:

    • 谢谢!这个解决方案非常简单,返回的数据比我预期的要多。
    【解决方案2】:

    也许您不需要从 HTML 中提取数据。在 Chrome 的开发人员工具中进行的快速检查显示该站点具有用于查询数据的 API,但您需要使用完全相同的请求标头。

    internal_fighters in JSON format

    internal_ranking_items in JSON format

    该问题的另一种替代方法是模拟按钮上的“单击”动作。

    您的“隐藏” div 的问题在于,div 标记是在用户单击 (+) 按钮时动态添加的。

    # click submit button
    submit_button = wd.find_elements_by_xpath('//*[@id="fighterRecord"]/section[1]/ul/li[1]/div/div[4]/i')[0]
    submit_button.click()
    

    【讨论】:

    • 愚蠢的问题,¿你是如何找到 xpath 的?,试过这个并且它有效,但是,我只得到第一行来显示隐藏的 div。我怎样才能对其余的行做同样的事情?
    • 在chrome开发者工具中右键点击元素>复制>复制XPath
    【解决方案3】:

    如果您只想使用 selenium,请尝试以下代码。您需要单击每个展开按钮以获取下一个表信息。然后使用 element.get_attribute("textContent")

    代码

    from selenium import webdriver
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    
    driver=webdriver.Chrome()
    driver.get("https://www.tapology.com/fightcenter/fighters/30449-sultan-aliev")
    WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.CSS_SELECTOR,"span.closebutton_closeButton--3abym"))).click()
    tablerecords=WebDriverWait(driver,20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,"div.result")))
    print(len(tablerecords))
    for row in range(len(tablerecords)):
        tablerecords = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.result")))
        try:
            expand_btn=tablerecords[row].find_element_by_xpath(".//div[@class='more']/i")
            driver.execute_script("arguments[0].click();", expand_btn)
            time.sleep(2)
            hiddenelements=tablerecords[row].find_element_by_xpath("./following-sibling::div[1]").get_attribute('textContent')
            print(hiddenelements)
        except:
            continue
    

    输出

    25
    Billing:Preliminary CardDuration:3 x 5 Minute RoundsWeight:Welterweight · 170 lbs (77.1 kg) · Weigh-In 170.0 lbs (77.1 kgs)Odds:-120 · Near EvenReferee:Leon Roberts
    UFC on ESPN+ 7
    UFC 230: Cormier vs. Lewis· Aliev Injury
    Billing:Preliminary CardDuration:3 x 5 Minute RoundsWeight:Welterweight · 170 lbs (77.1 kg) · Weigh-In 171.0 lbs (77.6 kgs)Odds:+250 · Moderate UnderdogReferee:Osiris Maia
    UFC on FOX 26· Aliev Injury
    Billing:Preliminary CardDuration:3 x 5 Minute RoundsWeight:Welterweight · 170 lbs (77.1 kg) · Weigh-In 171.0 lbs (77.6 kgs)Odds:+135 · Slight UnderdogReferee:Ed CollantesDisclosed Pay:$20,000 ($10K Base, $10K Bonus)
    UFC 202: Diaz vs. McGregor 2· Aliev Injury
    Billing:Preliminary CardDuration:3 x 5 Minute RoundsWeight:Welterweight · 170 lbs (77.1 kg) · Weigh-In 170.0 lbs (77.1 kgs)Odds:-180 · Slight FavoriteReferee:Bobby RehmanUFC on FOX 14 Performance of the Night
    Billing:Main CardDuration:3 x 5 Minute RoundsWeight:Light Heavyweight · 205 lbs (93.0 kg)
    Billing:Main CardDuration:3 x 5 Minute RoundsWeight:Middleweight · 185 lbs (84.0 kg) · Weigh-In 185.4 lbs (84.1 kgs)Referee:Valentin Tarasov
    Billing:Main CardDuration:3 x 5 Minute RoundsWeight:Middleweight · 185 lbs (83.9 kg) · Weigh-In 185.8 lbs (84.3 kgs)Odds:-350 · Moderate FavoriteReferee:Herb Dean
    Billing:Preliminary CardDuration:3 x 5 Minute RoundsWeight:Middleweight · 185 lbs (83.9 kg) · Weigh-In 185.5 lbs (84.1 kgs)Odds:+145 · Slight UnderdogReferee:Joseph Hawes
    Billing:Main CardDuration:3 x 5 Minute RoundsWeight:Light Heavyweight · 205 lbs (93.0 kg)
    Billing:Main CardDuration:2 x 5 Minute Rounds
    Billing:Main CardDuration:3 x 5 Minute Rounds
    Title Bout:Tournament ChampionshipBilling:Main CardDuration:3 x 5 Minute RoundsWeight:Light Heavyweight · 205 lbs (93.0 kg)
    ProFC 39: Global Grand Prix (Stage 6)· Omari Akhmedov injury
    Title Bout:Tournament ChampionshipBilling:Main CardDuration:2 x 5 Minute RoundsWeight:Light Heavyweight · 205 lbs (93.0 kg)
    Billing:Preliminary CardDuration:3 x 5 Minute RoundsWeight:Light Heavyweight · 205 lbs (93.0 kg)
    Billing:Main CardDuration:2 x 5 Minute RoundsWeight:Light Heavyweight · 205 lbs (93.0 kg)
    Title Bout:Tournament ChampionshipBilling:Main EventDuration:2 x 5 Minute RoundsWeight:Light Heavyweight · 205 lbs (93.0 kg)
    

    【讨论】:

      猜你喜欢
      • 2018-08-24
      • 1970-01-01
      • 1970-01-01
      • 2021-03-12
      • 1970-01-01
      • 2019-01-31
      • 1970-01-01
      • 2019-01-06
      • 2017-05-14
      相关资源
      最近更新 更多