如何将scrapy抓取数据保存到postgresql django中的2个相关表中？答案

【问题标题】：How to save scrapy crawl data to 2 related tables in postgresql django?如何将scrapy抓取数据保存到postgresql django中的2个相关表中？
【发布时间】：2022-01-11 15:32:22
【问题描述】：

我正在使用 Sqlalchemy 将抓取的数据从 spinder 存储到 Postgresql Django 数据库中。但是我的 Image 字段有问题。

因为 Image 和 Product 位于 2 个不同的表中，我不知道如何处理 Piplines。

我的代码在没有 Image 字段的情况下可以正常工作。

我的代码如下：

Django：

models.py

# models.py - django
from django.db import models

class Product(models.Model):
    title = models.CharField(max_length=200)
    slug = models.SlugField(max_length=200, unique=True)

class ProductImage(models.Model):
    product = models.ForeignKey(Product, on_delete=models.CASCADE, related_name="product_image")
    image = models.ImageField(upload_to="product_images/", default="/no_image.jpg")

Scrapy：

items.py

# items.py - scrapy
import scrapy

class ShoppProduct(scrapy.Item):
    slug = scrapy.Field()
    title = scrapy.Field()
    image = scrapy.Field()

piplines.py

# piplines.py
from sqlalchemy.orm import sessionmaker
from .models import Shop_Product, create_items_table, db_connect

class ShoppSpinderPipeline(object):
    def __init__(self):       
        engine = db_connect()
        create_items_table(engine)
        Session = sessionmaker(bind=engine)
        self.session = Session()

    def process_item(self, item, spider):  #
    item_exists = self.session.query(Shop_Product).filter_by(slug=item['slug']).first()

    if item_exists:
        item_exists.title = item['title']
        print('SP {} updated.'.format(item['title']))
    else:
        new_item = Shop_Product(**item)
        self.session.add(new_item)
        print('Item {} created.'.format(item['title']))
    return item

def close_spider(self, spider):
    try:
        self.session.commit()
    except:
        self.session.rollback()
        raise
    finally:
        self.session.close()

models.py - Sqlalchemy

# models.py - Sqlalchemy
from sqlalchemy import Column, Integer, String, create_engine, Boolean, DateTime, ForeignKey, Table, event
# ...
from . import settings

DeclarativeBase = declarative_base()

# ...

class Shop_Product(DeclarativeBase):
    __tablename__ = "shop_product"
    id = Column(Integer, primary_key=True)
    slug = Column("slug", String, unique=True)
    title = Column("title", String)
    
class Shop_ProductImage(DeclarativeBase):
    __tablename__ = "shop_productimage"
    id = Column(Integer, primary_key=True)
    product_id = Column("product_id", Integer) # Id of product
    image = Column("image", String)

models.py 文件中的关系与我有什么关系 - Sqlalchemy？

【问题讨论】：

请修剪您的代码，以便更容易找到您的问题。请按照以下指南创建minimal reproducible example。

标签： postgresql django-models sqlalchemy scrapy scrapy-pipeline

【解决方案1】：

上述问题的初步解决方案包括：

在models.py中添加关系
修改文件pippelines.py如下：

models.py - Sqlalchemy

# models.py - Sqlalchemy
class Shop_Product(DeclarativeBase):
    __tablename__ = "shop_product"
    id = Column(Integer, primary_key=True)
    slug = Column("slug", String, unique=True)
    title = Column("title", String)
    
class Shop_ProductImage(DeclarativeBase):
    __tablename__ = "shop_productimage"
    id = Column(Integer, primary_key=True)
    product_id = Column("product_id", Integer) # Id of product
    image = Column("image", String)
    # relationship
    products = relationship('Shop_Product', backref='image1') # add new. image1 will be used in pipelines.py

pipelines.py

class ShoppSpinderPipeline(object):
    def __init__(self):

        engine = db_connect()
        create_items_table(engine)
        self.Session = sessionmaker(bind=engine)

    def process_item(self, item, spider):
        session = self.Session()
        pro = Shop_Product()
        img = Shop_ProductImage()
        pro.slug = item['slug']
        pro.title = item['title']
        pro.regular_price = item['regular_price']
        pro.discount_price = item['discount_price']
        img.image = item['image']

        exist_img = session.query(Shop_ProductImage).filter_by(image=img.image).first()
        if exist_img is not None:
            pro.image1 = exist_img
        else:
            pro.image1.append(img)

        try:
            session.add(pro)
            session.commit()

        except:
            session.rollback()
            raise

        finally:
            session.close()

        return item

此答案不完整，因为它没有检查重复的产品网址。如果产品已经存在于数据库中，再次运行会报错。

当有更完整的解决方案可用时，我会更新答案。

我问了另一个问题：https://stackoverflow.com/questions/70376528/how-to-update-price-only-when-product-already-exists-in-pipelines-scrapy

【讨论】：