【问题标题】:Selenium printing same information repeatedly硒重复打印相同的信息
【发布时间】:2020-01-15 12:25:49
【问题描述】:

您好,我正在尝试从在其“dl”标签中包含数据的网站中抓取一些数据,这是网站结构的外观

<div class="ecord-overview col-md-5">
<h2><span itemprop="name">Donald Duck</span></h2>
dl class="row">
</dd>
<dt class="col-md-4">Email</dt>
<dd class="col-md-8">myemail.com</dd>
</dl>
<div class="ecord-overview col-md-5">
<h2><span itemprop="name">Mickey mouse</span></h2>
dl class="row">
</dd>
<dt class="col-md-4">Email</dt>
<dd class="col-md-8">youremail.com</dd>
</dl>
... data goes on but value differs 

为了刮掉这个我正在使用硒:

我的抓取代码

for element in driver.find_elements_by_class_name('ThatsThem-record-overview'): # here im scraping name
   #print(Style.RESET_ALL)
   print(Fore.RED + element.text + Style.RESET_ALL)
   #print(Style.RESET_ALL)
   time.sleep(1)
   dl= driver.find_element_by_tag_name('dl') # scraping data under dl tag 
   print(dl.text)
   print('-----------------------')# seperator

所以发生了什么,每当我执行程序时,它会为每个这样的名称和数据打印相同的 dl 内容

donald duck
Email
myemail.com
-------------
mickey mouse
Email
myemail.com

我已经尝试将 dl 放入 for 循环中,就像我打印名称一样,但它也会打印其他我不想要的东西

我能做什么?

【问题讨论】:

  • 你得到的额外数据是什么,你不想打印什么?

标签: python python-3.x selenium


【解决方案1】:

driver.find_element_by_tag_name('dl') 将始终返回第一个匹配元素。您需要使用element 来定位&lt;dl&gt;s

for element in driver.find_elements_by_class_name('ThatsThem-record-overview'):
    dl = element.find_element_by_tag_name('dl') # scraping data under dl tag 
    print(dl.text)

或者直接定位那些元素

for element in driver.find_elements_by_css_selector('.ThatsThem-record-overview dl'):
    print(element.text)

【讨论】:

    【解决方案2】:

    看来你很接近了。使用 class record-overview 应该已经为您获取了所有需要的数据。但是,最好通过遍历子标签来定位单个 nameemail。此外,引入 WebDriverWait 将优化您的程序性能。

    因此,理想情况下,您需要为visibility_of_all_elements_located() 诱导WebDriverWait,您可以使用以下任一Locator Strategies

    • 使用CSS_SELECTOR

      names[] = [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.record-overview>h2>span")))]
      emails[] = [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.record-overview dl.row dd")))]
      for name, email in zip(names, emails):
          print("{} Email is {}".format(name, email))
      
    • 使用XPATH

      names[] = [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[contains(@class, 'record-overview')]/h2/span")))]
      emails[] = [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[contains(@class, 'record-overview')]//dl[@class='row']//dd")))]
      for name, email in zip(names, emails):
          print("{} Email is {}".format(name, email))
      
    • 注意:您必须添加以下导入:

      from selenium.webdriver.support.ui import WebDriverWait
      from selenium.webdriver.common.by import By
      from selenium.webdriver.support import expected_conditions as EC
      

    【讨论】:

      猜你喜欢
      • 2021-01-10
      • 1970-01-01
      • 2021-12-26
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多