【问题标题】:Testing consistency with Selenium scraping使用 Selenium 刮擦测试一致性
【发布时间】:2016-06-08 20:27:31
【问题描述】:

此脚本从包含即将举行的体育比赛的网站中将数据抓取到字典 (www.oddsportal.com) 中。耗时不到 2.5 分钟。

from selenium.common.exceptions import TimeoutException
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import datetime
import time

upcoming = ['http://www.oddsportal.com/basketball/usa/wnba/']
nextgames = []

def rescrape(urls, cs):

    driver = webdriver.PhantomJS(executable_path=r'C:/phantomjs.exe') 
    driver.get('http://www.oddsportal.com/set-timezone/15/')
    # The above link sets the timezone. I believe problem lies here, explicit wait?    
    driver.implicitly_wait(3)

    for url in urls:        
        for i in range(2): 
            #This is to run the the scrape twice within function. It scrapes the same way both times        
            wait = WebDriverWait(driver, 5)            
            driver.get(url)
            # this is to ensure the table with games has appeared            
            try:     
                wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table#tournamentTable tr.odd")))
            except TimeoutException:
                continue
            # below is the script to get details from each game            
            for match in driver.find_element_by_css_selector("table#tournamentTable").find_elements_by_tag_name('tr')[3:]:
                try:
                    home, away = match.find_element_by_class_name("table-participant").text.split(" - ")
                except:
                    continue

                date = match.find_element_by_xpath(".//preceding::th[contains(@class, 'first2')][1]").text
                kickoff = match.find_element_by_class_name("table-time").text
                # following deals with exceptions to a recognized date format
                if "oday" in date:
                    date = datetime.date.today().strftime("%d %b %Y")
                    event = "Not specified"

                elif "omorrow" in date:
                    date = datetime.date.today() + datetime.timedelta(days=1)
                    date = date.strftime("%d %b %Y")                

                elif "esterday" in date:
                    date = datetime.date.today() + datetime.timedelta(days=-1)
                    date = date.strftime("%d %b %Y")                            
                elif " - " in date:
                    date, event = date.split(" - ", 1)                    


                nextgames.append({
                    "current time": time.ctime(),                
                    "home": home.strip(),
                    "away": away.strip(),
                    "date": date,
                    "time": kickoff.strip()})

                time.sleep(3)
                print len(nextgames)

        print len(nextgames)
    driver.close()
    df = pd.DataFrame(nextgames)
    df.to_csv(cs, encoding='utf-8')
    return df

for i in range(3):
    rescrape(upcoming, 'trial' + str(i) + '.csv')

问题在于设置时区driver.get('http://www.oddsportal.com/set-timezone/15/') 并不总是有效。它会在大约 20% 的时间恢复到 GMT 的默认时区。此输出显示第三轮错误的日期和时间,在第一次正确处理 2 次之后。请注意最后一个 range(2) 循环如何两次出错但只有第二个日期是错误的 - 意味着它可以在任一循环中更改时区:

pd.set_option('display.max_colwidth', 10)

    Unnamed: 0       away current time       date       home   time
0           0   Washin...  Wed Ju...     8-Jun-16  Dallas...  20:30
1           1   San An...  Wed Ju...     9-Jun-16  Phoeni...  22:00
2           2   Washin...  Wed Ju...     8-Jun-16  Dallas...  20:30
3           3   San An...  Wed Ju...     9-Jun-16  Phoeni...  22:00
4           4   Washin...  Wed Ju...     8-Jun-16  Dallas...  20:30
5           5   San An...  Wed Ju...     9-Jun-16  Phoeni...  22:00
6           6   Washin...  Wed Ju...     8-Jun-16  Dallas...  20:30
7           7   San An...  Wed Ju...     9-Jun-16  Phoeni...  22:00
8           8   Phoeni...  Wed Ju...     8-Jun-16  Minnes...   0:00
9           9   New Yo...  Wed Ju...     8-Jun-16  Los An...   2:00
10         10   Washin...  Wed Ju...     9-Jun-16  Dallas...   0:30
11         11   San An...  Wed Ju...    10-Jun-16  Phoeni...   2:00

那么我如何确保时区.get 每次都能正常工作?目前我有一个隐式等待,并尝试过显式等待无济于事。

【问题讨论】:

    标签: python datetime selenium selenium-webdriver


    【解决方案1】:

    我注意到该网站为用户时区创建了一个 cookie,您可以通过自己添加来利用它

    driver.add_cookie({'name': 'op_user_time_zone', 'value': '-4'})
    

    这应该可以解决问题

    如果它不起作用,请将当前代码编辑为您尝试过的代码,同时检查docs 以确保您正确实施它。

    【讨论】:

      猜你喜欢
      • 2019-03-09
      • 1970-01-01
      • 1970-01-01
      • 2021-04-17
      • 2013-09-21
      • 2020-12-11
      • 2022-01-02
      • 2019-09-28
      • 1970-01-01
      相关资源
      最近更新 更多