Python Scrapy - 从 mysql 填充 start_urls答案

【问题标题】：Python Scrapy - populate start_urls from mysqlPython Scrapy - 从 mysql 填充 start_urls
【发布时间】：2013-12-05 19:00:48
【问题描述】：

我正在尝试使用 spider.py 使用 MYSQL 表中的 SELECT 填充 start_url。当我运行“scrapy runspider spider.py”时，我没有得到任何输出，只是它没有错误地完成。

我已经在 python 脚本中测试了 SELECT 查询，并且 start_url 填充了 MYSQL 表中的条目。

spider.py

from scrapy.spider import BaseSpider
from scrapy.selector import Selector
import MySQLdb


class ProductsSpider(BaseSpider):
    name = "Products"
    allowed_domains = ["test.com"]
    start_urls = []

    def parse(self, response):
        print self.start_urls

    def populate_start_urls(self, url):
        conn = MySQLdb.connect(
                user='user',
                passwd='password',
                db='scrapy',
                host='localhost',
                charset="utf8",
                use_unicode=True
                )
        cursor = conn.cursor()
        cursor.execute(
            'SELECT url FROM links;'
            )
    rows = cursor.fetchall()

    for row in rows:
        start_urls.append(row[0])
    conn.close()

【问题讨论】：

标签： python mysql scrapy web-crawler

【解决方案1】：

将填充写入__init__：

def __init__(self):
    super(ProductsSpider,self).__init__()
    self.start_urls = get_start_urls()

假设get_start_urls() 返回网址。

【讨论】：

【解决方案2】：

更好的方法是覆盖start_requests 方法。

这可以查询您的数据库，很像populate_start_urls，并返回一系列Request 对象。

您只需将 populate_start_urls 方法重命名为 start_requests 并修改以下行：

for row in rows:
    yield self.make_requests_from_url(row[0])

【讨论】：

感谢您的回复。它起作用了，我只需要将def populate_start_urls(self, url): 更改为def start_requests(self):。我已将此标记为已接受，因为它最接近我发布的代码。
如果你有 2200 万个网站要广泛抓取，你怎么能做到？我想你必须一次迭代 1000 次。您能否展示如何使用 start_requests 对其进行迭代？