【发布时间】:2019-09-22 07:10:47
【问题描述】:
我已经阅读了针对我的问题的不同文章,但它仍然无法正常工作。基本上,我使用 Scrapy 和 Selenium 来抓取网站。该网站的 URL 当前保存在文本文件中。此文本文件仅包含一列。在此列的每一行中都有一个 URL。
我仍然收到错误消息:selenium.common.exceptions.InvalidArgumentException: Message: invalid argument: 'url' must be a string
这是我当前的代码:
class AlltipsSpider(Spider):
name = 'alltips'
allowed_domains = ['blogabet.com']
def start_requests(self):
with open ("urls.txt", "rt") as f:
start_urls = [l.strip() for l in open('urls.txt').readlines()]
self.driver = webdriver.Chrome('C:\webdrivers\chromedriver.exe')
self.driver.get(start_urls)
self.driver.find_element_by_id('currentTab').click()
[更新]
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Spider
from selenium import webdriver
from scrapy.selector import Selector
from scrapy.http import Request
from time import sleep
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
import re
import csv
class AlltipsSpider(Spider):
name = 'alltips'
allowed_domains = ['blogabet.com']
def start_requests(self):
self.driver = webdriver.Chrome('C:\webdrivers\chromedriver.exe')
with open("urls.txt", "rt") as f:
start_urls = [l.strip() for l in f.readlines()]
self.driver = webdriver.Chrome('C:\webdrivers\chromedriver.exe')
for url in start_urls:
self.driver.get(url)
self.driver.find_element_by_id('currentTab').click()
sleep(3)
self.logger.info('Sleeping for 5 sec.')
self.driver.find_element_by_xpath('//*[@id="_blog-menu"]/div[2]/div/div[2]/a[3]').click()
sleep(7)
self.logger.info('Sleeping for 7 sec.')
yield Request(self.driver.current_url, callback=self.crawltips)
def crawltips(self, response):
sel = Selector(text=self.driver.page_source)
allposts = sel.xpath('//*[@class="block media _feedPick feed-pick"]')
for post in allposts:
username = post.xpath('.//div[@class="col-sm-7 col-lg-6 no-padding"]/a/@title').extract()
publish_date = post.xpath('.//*[@class="bet-age text-muted"]/text()').extract()
yield{'Username': username,
'Publish date': publish_date
}
【问题讨论】:
标签: python selenium selenium-webdriver scrapy web-crawler