【问题标题】:Get all the values with same class name in selenium获取 selenium 中具有相同类名的所有值
【发布时间】:2020-06-19 02:21:25
【问题描述】:

我想获取具有相同类名的文章的文章名称和网址。 问题是,它一次又一次地只打印一个信息,而不是所有的文章。

from selenium import webdriver
driver = webdriver.Chrome(r'C:\Users\muhammad.usman\Downloads\chromedriver_win32\chromedriver.exe')
driver.get('https://www.aljazeera.com/news/')
# to get the current location ...
driver.current_url
button = driver.find_element_by_id('btn_showmore_b1_418')
driver.execute_script("arguments[0].click();", button)
content = driver.find_element_by_class_name('topics-sec-block')
print(content)
container = content.find_elements_by_xpath('//div[@class="col-sm-7 topics-sec-item-cont"]')
print(container)
i=0
for i in range(0, 12):
    title = []
    url = []
    heading=container[i].find_element_by_xpath('//div[@class="col-sm-7 topics-sec-item-cont"]/a/h2').text
    link = container[i].find_element_by_xpath('//div[@class="col-sm-7 topics-sec-item-cont"]/a')
    title.append(heading)
    url.append(link.get_attribute('href'))
    print(title)
    print(url)
    i += 1
names = driver.find_elements_by_css_selector('div.topics-sec-item-cont')
for name in names:

    heading=name.find_element_by_xpath('//div[@class="col-sm-7 topics-sec-item-cont"]/a/h2').text
    link = name.find_element_by_xpath('//div[@class="col-sm-7 topics-sec-item-cont"]/a')
    print(heading)
    print(link.get_attribute('href'))

【问题讨论】:

    标签: python selenium selenium-webdriver web-scraping web-crawler


    【解决方案1】:

    使用 Selenium 和 BeautifulSoup

    from selenium import webdriver
    from bs4 import BeautifulSoup
    
    driver = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
    driver.get('https://www.aljazeera.com/news/')
    # to get the current location ...
    driver.current_url
    button = driver.find_element_by_id('btn_showmore_b1_418')
    driver.execute_script("arguments[0].click();", button)
    content = driver.find_element_by_class_name('topics-sec-block')
    print(content)
    
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    container = soup.select('div.topics-sec-item-cont')
    
    titleList = []
    urlList = []
    for item in container:
        heading=item.find('h2').text
        link = item.find('a')['href']
        titleList.append(heading)
        urlList.append(link)
        print('HEADLINE: %s\nUrl: https://www.aljazeera.com%s\n' %(heading, link) + '-'*70 + '\n' )
    
    
    
    driver.close()
    

    输出:

    HEADLINE: Trump's Remain in Mexico policy endangers migrants headed to US
    Url: https://www.aljazeera.com/news/2020/03/trumps-remain-mexico-policy-endangers-migrants-headed-200306102155930.html
    ----------------------------------------------------------------------
    
    HEADLINE: India, South Korea report new coronavirus cases: Live updates
    Url: https://www.aljazeera.com/topics/events/coronavirus-outbreak.html
    ----------------------------------------------------------------------
    
    HEADLINE: Clashes between Greek police, migrants reported on Turkish border
    Url: https://www.aljazeera.com/topics/subjects/refugees.html
    ----------------------------------------------------------------------
    
    HEADLINE: Congo protests against unpaid pensions as gov't debt balloons
    Url: https://www.aljazeera.com/topics/regions/africa.html
    ----------------------------------------------------------------------
    
    HEADLINE: Is India prepared for coronavirus outbreak?
    Url: https://www.aljazeera.com/topics/events/coronavirus-outbreak.html
    ----------------------------------------------------------------------
    
    HEADLINE: India protest violence leaves thousands displaced
    Url: https://www.aljazeera.com/topics/regions/asia.html
    ----------------------------------------------------------------------
    
    HEADLINE: Guinea protests: One dead in anti-government demonstration
    Url: https://www.aljazeera.com/topics/regions/africa.html
    ----------------------------------------------------------------------
    
    HEADLINE: Brazil recalls diplomats, officials from Venezuela
    Url: https://www.aljazeera.com/topics/country/brazil.html
    ----------------------------------------------------------------------
    
    HEADLINE: US coronavirus: rise in cases in New York state
    Url: https://www.aljazeera.com/topics/events/coronavirus-outbreak.html
    ----------------------------------------------------------------------
    
    HEADLINE: Australia urged to take action amid rising violence against women
    Url: https://www.aljazeera.com/topics/country/australia.html
    ----------------------------------------------------------------------
    
    HEADLINE: Turkey, Russia announce ceasefire in Syria's Idlib
    Url: https://www.aljazeera.com/topics/regions/middleeast.html
    ----------------------------------------------------------------------
    
    HEADLINE: 'Good morning, Codogno!': A coronavirus radio station in Italy
    Url: https://www.aljazeera.com/topics/country/italy.html
    ----------------------------------------------------------------------
    

    【讨论】:

    • 我想用 selenium 解决它,如果有人可以帮忙的话。
    • @Usmankhan 我专门使用硒添加了解决方案。请接受该解决方案。
    【解决方案2】:

    仅使用 Selenium

    from selenium import webdriver
    
    driver = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
    driver.get('https://www.aljazeera.com/news/')
    # to get the current location ...
    driver.current_url
    button = driver.find_element_by_id('btn_showmore_b1_418')
    driver.execute_script("arguments[0].click();", button)
    content = driver.find_element_by_class_name('topics-sec-block')
    print(content)
    
    div_nodes = driver.find_elements_by_css_selector("div.topics-sec-item-cont")
    
    titleList = []
    urlList = []
    for div in div_nodes:
        w=1
        heading=div.find_element_by_tag_name('h2').text
        link = div.find_element_by_tag_name('a').get_attribute('href')
        titleList.append(heading)
        urlList.append(link)
        print('HEADLINE: %s\nUrl: https://www.aljazeera.com%s\n' %(heading, link) + '-'*70 + '\n' )
    
    driver.close()
    

    输出:

    HEADLINE: Georgia priests bless Tbilisi city in bid to contain COVID-19
    Url: https://www.aljazeera.comhttps://www.aljazeera.com/topics/country/georgia.html
    ----------------------------------------------------------------------
    
    HEADLINE: India's banking crisis: Government rescues fourth-largest bank
    Url: https://www.aljazeera.comhttps://www.aljazeera.com/ajimpact
    ----------------------------------------------------------------------
    
    HEADLINE: Art world's 'cold case': Heist of the century still intrigues
    Url: https://www.aljazeera.comhttps://www.aljazeera.com/topics/subjects/art.html
    ----------------------------------------------------------------------
    
    HEADLINE: Italy's coronavirus death toll surges past 2,500 - Live updates
    Url: https://www.aljazeera.comhttps://www.aljazeera.com/topics/events/coronavirus-outbreak.html
    ----------------------------------------------------------------------
    
    HEADLINE: Coronavirus: All you need to know in 500 words
    Url: https://www.aljazeera.comhttps://www.aljazeera.com/topics/categories/health.html
    ----------------------------------------------------------------------
    
    HEADLINE: Timeline: How the new coronavirus spread
    Url: https://www.aljazeera.comhttps://www.aljazeera.com/topics/events/coronavirus-outbreak.html
    ----------------------------------------------------------------------
    
    HEADLINE: How long does coronavirus last on surfaces and in air?
    Url: https://www.aljazeera.comhttps://www.aljazeera.com/topics/events/coronavirus-outbreak.html
    ----------------------------------------------------------------------
    
    HEADLINE: India's poor testing rate may have masked coronavirus cases
    Url: https://www.aljazeera.comhttps://www.aljazeera.com/topics/events/coronavirus-outbreak.html
    ----------------------------------------------------------------------
    
    HEADLINE: Turkey announces first coronavirus death amid jump in cases
    Url: https://www.aljazeera.comhttps://www.aljazeera.com/topics/country/turkey.html
    ----------------------------------------------------------------------
    
    HEADLINE: Footballer Obi Mikel quits Turkish club over coronavirus fears
    Url: https://www.aljazeera.comhttps://www.aljazeera.com/topics/categories/sport.html
    ----------------------------------------------------------------------
    
    HEADLINE: Pakistan PM: 'Cannot afford' to shut down cities over coronavirus
    Url: https://www.aljazeera.comhttps://www.aljazeera.com/topics/events/coronavirus-outbreak.html
    ----------------------------------------------------------------------
    
    HEADLINE: Tension, fear as South Africa steps up coronavirus fight
    Url: https://www.aljazeera.comhttps://www.aljazeera.com/topics/categories/health.html
    ----------------------------------------------------------------------
    
    HEADLINE: China to expel more US journalists in escalating row over media
    Url: https://www.aljazeera.comhttps://www.aljazeera.com/topics/country/china.html
    ----------------------------------------------------------------------
    
    HEADLINE: High treatment costs stop Americans from testing for coronavirus
    Url: https://www.aljazeera.comhttps://www.aljazeera.com/topics/events/coronavirus-outbreak.html
    ----------------------------------------------------------------------
    
    HEADLINE: Saudi Arabia urges G20 virtual talk on coronavirus, shuts mosques
    Url: https://www.aljazeera.comhttps://www.aljazeera.com/topics/events/coronavirus-outbreak.html
    ----------------------------------------------------------------------
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2017-07-21
      • 2021-07-10
      相关资源
      最近更新 更多