我个人更喜欢在单独的容器中使用 scrapy 和 selenium 和 dockerizing。通过这种方式,您既可以轻松安装,也可以抓取几乎所有包含一种或另一种形式的 JavaScript 的现代网站。这是一个例子:
使用scrapy startproject创建你的爬虫并编写你的蜘蛛,骨架可以这么简单:
import scrapy
class MySpider(scrapy.Spider):
name = 'my_spider'
start_urls = ['https://somewhere.com']
def start_requests(self):
yield scrapy.Request(url=self.start_urls[0])
def parse(self, response):
# do stuff with results, scrape items etc.
# now were just checking everything worked
print(response.body)
真正的魔法发生在 middlewares.py。覆盖下载器中间件中的两个方法__init__和process_request,方法如下:
# import some additional modules that we need
import os
from copy import deepcopy
from time import sleep
from scrapy import signals
from scrapy.http import HtmlResponse
from selenium import webdriver
class SampleProjectDownloaderMiddleware(object):
def __init__(self):
SELENIUM_LOCATION = os.environ.get('SELENIUM_LOCATION', 'NOT_HERE')
SELENIUM_URL = f'http://{SELENIUM_LOCATION}:4444/wd/hub'
chrome_options = webdriver.ChromeOptions()
# chrome_options.add_experimental_option("mobileEmulation", mobile_emulation)
self.driver = webdriver.Remote(command_executor=SELENIUM_URL,
desired_capabilities=chrome_options.to_capabilities())
def process_request(self, request, spider):
self.driver.get(request.url)
# sleep a bit so the page has time to load
# or monitor items on page to continue as soon as page ready
sleep(4)
# if you need to manipulate the page content like clicking and scrolling, you do it here
# self.driver.find_element_by_css_selector('.my-class').click()
# you only need the now properly and completely rendered html from your page to get results
body = deepcopy(self.driver.page_source)
# copy the current url in case of redirects
url = deepcopy(self.driver.current_url)
return HtmlResponse(url, body=body, encoding='utf-8', request=request)
不要忘记通过取消注释 settings.py 文件中的下一行来启用此中间件:
DOWNLOADER_MIDDLEWARES = {
'sample_project.middlewares.SampleProjectDownloaderMiddleware': 543,}
接下来进行码头化。从轻量级图像创建您的Dockerfile(我在这里使用python Alpine),将您的项目目录复制到其中,安装要求:
# Use an official Python runtime as a parent image
FROM python:3.6-alpine
# install some packages necessary to scrapy and then curl because it's handy for debugging
RUN apk --update add linux-headers libffi-dev openssl-dev build-base libxslt-dev libxml2-dev curl python-dev
WORKDIR /my_scraper
ADD requirements.txt /my_scraper/
RUN pip install -r requirements.txt
ADD . /scrapers
最后在docker-compose.yaml:
version: '2'
services:
selenium:
image: selenium/standalone-chrome
ports:
- "4444:4444"
shm_size: 1G
my_scraper:
build: .
depends_on:
- "selenium"
environment:
- SELENIUM_LOCATION=samplecrawler_selenium_1
volumes:
- .:/my_scraper
# use this command to keep the container running
command: tail -f /dev/null
运行docker-compose up -d。如果您是第一次这样做,它需要一段时间才能获取最新的 selenium/standalone-chrome 并构建您的刮板图像。
完成后,您可以使用 docker ps 检查您的容器是否正在运行,并检查 selenium 容器的名称是否与我们传递给刮板容器的环境变量的名称匹配(这里是 SELENIUM_LOCATION=samplecrawler_selenium_1 )。
用docker exec -ti YOUR_CONTAINER_NAME sh进入你的scraper容器,我的命令是docker exec -ti samplecrawler_my_scraper_1 sh,cd到正确的目录并用scrapy crawl my_spider运行你的scraper。
整个内容都在我的 github 页面上,你可以从 here 获得它