【问题标题】:How to solve Scrapy and Selenium Uncaught ReferenceError?如何解决 Scrapy 和 Selenium Uncaught ReferenceError?
【发布时间】:2023-03-27 21:25:01
【问题描述】:

我正在尝试通过 Scrapy 使用 selenium 抓取网站。我已经用 selenium 更改了 Scrapy 响应 URL,但是当我尝试使用以下代码返回 start_urls 时:

Spider.py (start_urls):

    @property
    def start_urls(self):
        url = 'https://www.adana.bel.tr/home/hal_listesi' #The URL that script will scrape
        opts = Options() #Set options for headless and user-agent etc.
        #opts.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36")
        opts.add_argument('headless')#It helps to work without opening browser
        driver = webdriver.Chrome(options=opts,executable_path="chromedriver.exe")
        driver.get(url) #URL starts

        self.day = int(driver.find_element_by_xpath("/html/body/div/div[3]/div/div[2]/main/div/div/div/div/div/div/div[1]/div[1]/span[1]").text)
        self.month = months[driver.find_element_by_xpath("/html/body/div/div[3]/div/div[2]/main/div/div/div/div/div/div/div[1]/div[1]/span[2]").text]
        self.year = int(driver.find_element_by_xpath("/html/body/div/div[3]/div/div[2]/main/div/div/div/div/div/div/div[1]/div[1]/span[3]").text)

        product_list =driver.find_element_by_xpath("/html/body/div/div[3]/div/div[2]/main/div/div/div/div/div/div/div[1]/div[3]/a/img")
        product_list.click()
        new_url = driver.current_url
        driver.quit
        return [new_url]

我正在使用自我。日期,因为我必须从第一页获取日期。

它开始返回 3 次并给我 3 次以下错误。它花费了太长时间,我不明白为什么它总是给我错误。

输出:

DevTools listening on ws://127.0.0.1:50639/devtools/browser/cd5830e4-5a11-4f28-a12d-cb605e96075d
[1103/153027.438:INFO:CONSOLE(54)] "Mixed Content: The page at 'https://www.adana.bel.tr/home/hal_listesi' was loaded over HTTPS, but requested an insecure stylesheet 'http://netdna.bootstrapcdn.com/font-awesome/4.1.0/css/font-awesome.min.css'. This request has been blocked; the content must be served over HTTPS.", source: https://www.adana.bel.tr/home/hal_listesi (54)
[1103/153032.344:INFO:CONSOLE(54)] "Mixed Content: The page at 'https://www.adana.bel.tr/hal-detay/396' was loaded over HTTPS, but requested an insecure stylesheet 'http://netdna.bootstrapcdn.com/font-awesome/4.1.0/css/font-awesome.min.css'. This request has been blocked; the content must 
be served over HTTPS.", source: https://www.adana.bel.tr/hal-detay/396 (54)
[1103/153032.391:INFO:CONSOLE(1520)] "Uncaught ReferenceError: $ is not defined", source: https://www.adana.bel.tr/hal-detay/396 (1520)

DevTools listening on ws://127.0.0.1:50673/devtools/browser/e664e7e2-1c13-4128-bb20-a3df6437d2c7
[1103/153035.939:INFO:CONSOLE(54)] "Mixed Content: The page at 'https://www.adana.bel.tr/home/hal_listesi' was loaded over HTTPS, but requested an insecure stylesheet 'http://netdna.bootstrapcdn.com/font-awesome/4.1.0/css/font-awesome.min.css'. This request has been blocked; the content must be served over HTTPS.", source: https://www.adana.bel.tr/home/hal_listesi (54)
[1103/153038.668:INFO:CONSOLE(54)] "Mixed Content: The page at 'https://www.adana.bel.tr/hal-detay/396' was loaded over HTTPS, but requested an insecure stylesheet 'http://netdna.bootstrapcdn.com/font-awesome/4.1.0/css/font-awesome.min.css'. This request has been blocked; the content must 
be served over HTTPS.", source: https://www.adana.bel.tr/hal-detay/396 (54)
[1103/153038.710:INFO:CONSOLE(1520)] "Uncaught ReferenceError: $ is not defined", source: https://www.adana.bel.tr/hal-detay/396 (1520)

DevTools listening on ws://127.0.0.1:50707/devtools/browser/5fcb91e4-a076-4aa2-9173-7fd3565f741f
[1103/153042.020:INFO:CONSOLE(54)] "Mixed Content: The page at 'https://www.adana.bel.tr/home/hal_listesi' was loaded over HTTPS, but requested an insecure stylesheet 'http://netdna.bootstrapcdn.com/font-awesome/4.1.0/css/font-awesome.min.css'. This request has been blocked; the content must be served over HTTPS.", source: https://www.adana.bel.tr/home/hal_listesi (54)
[1103/153045.407:INFO:CONSOLE(54)] "Mixed Content: The page at 'https://www.adana.bel.tr/hal-detay/396' was loaded over HTTPS, but requested an insecure stylesheet 'http://netdna.bootstrapcdn.com/font-awesome/4.1.0/css/font-awesome.min.css'. This request has been blocked; the content must 
be served over HTTPS.", source: https://www.adana.bel.tr/hal-detay/396 (54)
[1103/153045.459:INFO:CONSOLE(1520)] "Uncaught ReferenceError: $ is not defined", source: https://www.adana.bel.tr/hal-detay/396 (1520)

Setting.py:

BOT_NAME = 'first_bot'

SPIDER_MODULES = ['first_bot.spiders']
NEWSPIDER_MODULE = 'first_bot.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36"

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

DOWNLOAD_DELAY = 3

ITEM_PIPELINES = {
    'first_bot.pipelines.FirstBotPipeline': 300,
}

那么我该如何解决这个问题呢?大约需要 30 秒,对于一个 URL 来说太长了。

【问题讨论】:

    标签: python selenium web-scraping scrapy


    【解决方案1】:

    我使用解析请求更改起始 URL:

        def parse(self, response):
            now_date = datetime.today()-timedelta(days=1)
            self.day = response.xpath("//*[@class='day']/text()").extract()
            self.month = response.xpath("//*[@class='month']/text()").extract()
            self.year = response.xpath("//*[@class='year']/text()").extract()
    
            count = 0
            for check in self.day:
                if now_date.day == int(check):
                    url =response.xpath("//*[@class='indir']/a/@href").extract()[count]
                    self.curt_day = self.day[count]
                    self.curt_month = self.month[count]
                    self.curt_year = self.year[count]
                count +=1
    
            absolute_url = response.urljoin(url)
            request = scrapy.Request(
                absolute_url, callback=self.parse_contractors)
            yield request
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2017-07-06
      • 1970-01-01
      • 2019-09-17
      • 1970-01-01
      • 2015-07-04
      • 2014-12-28
      • 1970-01-01
      相关资源
      最近更新 更多