【问题标题】:How to Serialize Scrapy Fields that are Lists of Items in XML Exporter如何序列化作为 XML 导出器中项目列表的 Scrapy 字段
【发布时间】:2019-11-28 07:51:54
【问题描述】:

我构建了复杂的项目,其中字段可能是其他项目类型的列表。当我使用默认的XmlItemExporter 导出它时,子列表项以<value> 标记为前缀。我正在寻找如何将子项标识符分配给这些值标签的示例。

文档的 Item Exporters 页面解释了这句话:

除非在serialize_field() 方法中被覆盖,否则通过序列化<value> 元素内的每个值来导出多值字段。这是为了方便,因为多值字段很常见。

文档页面还提供了在字段中声明序列化程序重写 Serialize_Field() 方法的简单示例,但两者都适用于单值字段,没有建议如何为多值字段自定义它们。

我在网上搜索了一个如何完成的示例,但我没有找到任何示例。

这是我用于测试的示例项目树:

class Course(scrapy.Item):
    title = scrapy.Field()
    lessons = scrapy.Field()

class Lesson(scrapy.Item):
    session = scrapy.Field()
    topic = scrapy.Field()
    assignment = scrapy.Field()

class ReadingAssignment(scrapy.Item):
    textBook = scrapy.Field()
    pages = scrapy.Field()

course = Course()
course['title'] = 'Greatness'
course['lessons'] = []

lesson = Lesson()
lesson['session'] = 'Week 1'
lesson['topic'] = 'Think Great'
lesson['assignment'] = []

reading =  ReadingAssignment()
reading['textBook'] = 'Great Book 1'
reading['pages'] = '1-20'
lesson['assignment'].append(reading)
course['lessons'].append(lesson)

lesson = Lesson()
lesson['session'] = 'Week 2'
lesson['topic'] = 'Act Great'
lesson['assignment'] = []

reading =  ReadingAssignment()
reading['textBook'] = 'Great Book 2'
reading['pages'] = '21-40'
lesson['assignment'].append(reading)
course['lessons'].append(lesson)

lesson = Lesson()
lesson['session'] = 'Week 3'
lesson['topic'] = 'Look Great'
lesson['assignment'] = []

reading =  ReadingAssignment()
reading['textBook'] = 'Great Book 3'
reading['pages'] = '41-60'
lesson['assignment'].append(reading)
course['lessons'].append(lesson)

lesson = Lesson()
lesson['session'] = 'Week 4'
lesson['topic'] = 'Be Great'
lesson['assignment'] = []

reading =  ReadingAssignment()
reading['textBook'] = 'Great Book 4'
reading['pages'] = '61-80'
lesson['assignment'].append(reading)
course['lessons'].append(lesson)

输出:

>>> course
{'lessons': [{'assignment': [{'pages': '1-20', 'textBook': 'Great Book 1'}],
              'session': 'Week 1',
              'topic': 'Think Great'},
             {'assignment': [{'pages': '21-40', 'textBook': 'Great Book 2'}],
              'session': 'Week 2',
              'topic': 'Act Great'},
             {'assignment': [{'pages': '41-60', 'textBook': 'Great Book 3'}],
              'session': 'Week 3',
              'topic': 'Look Great'},
             {'assignment': [{'pages': '61-80', 'textBook': 'Great Book 4'}],
              'session': 'Week 4',
              'topic': 'Be Great'}],
 'title': 'Greatness'}

当我通过XmlItemExporter 运行它时,我得到:

<?xml version="1.0" encoding="utf-8"?>
<items>
  <course>
    <title>Greatness</title>
    <lessons>
      <value>
        <session>Week 1</session>
        <topic>Think Great</topic>
        <assignment>
          <value>
            <textBook>Great Book 1</textBook>
            <pages>1-20</pages>
          </value>
        </assignment>
      </value>
      <value>
        <session>Week 2</session>
        <topic>Act Great</topic>
        <assignment>
          <value>
            <textBook>Great Book 2</textBook>
            <pages>21-40</pages>
          </value>
        </assignment>
      </value>
      <value>
        <session>Week 3</session>
        <topic>Look Great</topic>
        <assignment>
          <value>
            <textBook>Great Book 3</textBook>
            <pages>41-60</pages>
          </value>
        </assignment>
      </value>
      <value>
        <session>Week 4</session>
        <topic>Be Great</topic>
        <assignment>
          <value>
            <textBook>Great Book 4</textBook>
            <pages>61-80</pages>
          </value>
        </assignment>
      </value>
    </lessons>
  </course>
</items>

我想做的是将那些&lt;value&gt; 标记更改为附加到列表中的项目的名称。像这样:

<items>
  <course>
    <title>Greatness</title>
    <lessons>
      <lesson>
        <session>Week 1</session>
        <topic>Think Great</topic>
        <assignment>
          <reading>
            <textBook>Great Book 1</textBook>
            <pages>1-20</pages>
          </reading>
        </assignment>
      </lesson>
      <lesson>
        <session>Week 2</session>
        <topic>Act Great</topic>
        <assignment>
          <reading>
            <textBook>Great Book 2</textBook>
            <pages>21-40</pages>
          </reading>
        </assignment>
      </lesson>
      <lesson>
        <session>Week 3</session>
        <topic>Look Great</topic>
        <assignment>
          <reading>
            <textBook>Great Book 3</textBook>
            <pages>41-60</pages>
          </reading>
        </assignment>
      </lesson>
      <lesson>
        <session>Week 4</session>
        <topic>Be Great</topic>
        <assignment>
          <reading>
            <textBook>Great Book 4</textBook>
            <pages>61-80</pages>
          </reading>
        </assignment>
      </lesson>
    </lessons>
  </course>
</items>

【问题讨论】:

  • 抱歉,起初我错过了 Scrapy 的导出器在嵌套项目导出方面存在致命缺陷(它不会递归到嵌套对象字段以尊重序列化程序!)我已经正确更新了我的答案考虑到这一点。

标签: python xml serialization scrapy exporter


【解决方案1】:

这确实没有很好的记录,我们不得不求助于阅读XmlItemExporter source code,结果发现&lt;value&gt;标签选项已经硬编码在XmlItemExporter._export_xml_field() method中:

elif is_listlike(serialized_value):
    self._beautify_newline()
    for value in serialized_value:
        self._export_xml_field('value', value, depth=depth+1)
    self._beautify_indent(depth=depth)

幸运的是,有出路,就在前面的几行:

if hasattr(serialized_value, 'items'):
    self._beautify_newline()
    for subname, value in serialized_value.items():
        self._export_xml_field(subname, value, depth=depth+1)
    self._beautify_indent(depth=depth)

这意味着处理 字典,但实际上它会接受任何具有 .items() 方法的东西,该方法返回字符串和项目的元组!

但是,导出器中缺少一个重要步骤:递归。您基本上只能在顶级项目字段上设置 serializer 标志,当前 Scrapy 实现完全忽略顶级项目之外的 Item 子类上的任何 Field() 元素。每个导出器在如何驱动内部BaseItemExporter._get_serialized_fields() method 方面都有自己的特点,因此我们不能预先处理递归,因为每个特定的导出器(JSON、XML 等)在需要序列化字段的方式上有所不同。我们可以使用XmlItemExporter 类的子类来解决这个问题,更多内容如下。

所以这里的第一个技巧是创建一个具有.items() 方法的专用对象,并为您提供&lt;container&gt; 标签。请注意,您必须自己处理序列化的递归! Scrapy 序列化程序本身不处理递归到嵌套结构:

class CustomXMLValuesSerializer:
    @classmethod
    def serialize_as(cls, name):
        def serializer(items, serialize):
            return cls(name, items, serialize)
        return serializer

    def __init__(self, name, items, serialize=None):
        self._name = name
        self._items = items
        self._serialize = serialize if serialise is not None else lambda x: x

    def items(self):
        for item in self._items:
            yield (self._name, self._serialize(item))

然后使用CustomXMLValuesSerializer.serialize_as() 类方法为您的列表字段创建自定义序列化器:

class Course(scrapy.Item):
    title = scrapy.Field()
    lessons = scrapy.Field(
        serializer=CustomXMLValuesSerializer.serialize_as("lesson")
    )

class Lesson(scrapy.Item):
    session = scrapy.Field()
    topic = scrapy.Field()
    assignment = scrapy.Field(
        serializer=CustomXMLValuesSerializer.serialize_as("reading")
    )

class ReadingAssignment(scrapy.Item):
    textBook = scrapy.Field()
    pages = scrapy.Field()

最后,我们需要一个稍微定制的导出器,它实际上可以让我们递归处理嵌套项:

from functools import partial

class RecursingXmlItemExporter(XmlItemExporter):
    def _recursive_serialized_fields(self, item):
        if isinstance(item, scrapy.Item):
            return dict(self._get_serialized_fields(item, default_value=''))
        return item

    def serialize_field(self, field, name, value):
        serializer = field.get('serializer', lambda x: x)
        try:
            return serializer(value, self._recursive_serialized_fields)
        except TypeError:
            return serializer(value)

请注意,这会传入default_value='',因为that's what the base XmlItemExporter.export_item() implementation uses

确保使用此自定义导出器,因为它传入所需的上下文以序列化嵌套项:

exporter = RecursingXmlItemExporter(some_file, indent=2, item_element='course')
exporter.start_exporting()
exporter.export_item(course)
exporter.finish_exporting()

现在容器实际上是使用name 字符串作为容器元素导出的:

<?xml version="1.0" encoding="utf-8"?>
<items>
  <course>
    <title>Greatness</title>
    <lessons>
      <lesson>
        <session>Week 1</session>
        <topic>Think Great</topic>
        <assignment>
          <reading>
            <textBook>Great Book 1</textBook>
            <pages>1-20</pages>
          </reading>
        </assignment>
      </lesson>
      <lesson>
        <session>Week 2</session>
        <topic>Act Great</topic>
        <assignment>
          <reading>
            <textBook>Great Book 2</textBook>
            <pages>21-40</pages>
          </reading>
        </assignment>
      </lesson>
      <lesson>
        <session>Week 3</session>
        <topic>Look Great</topic>
        <assignment>
          <reading>
            <textBook>Great Book 3</textBook>
            <pages>41-60</pages>
          </reading>
        </assignment>
      </lesson>
      <lesson>
        <session>Week 4</session>
        <topic>Be Great</topic>
        <assignment>
          <reading>
            <textBook>Great Book 4</textBook>
            <pages>61-80</pages>
          </reading>
        </assignment>
      </lesson>
    </lessons>
  </course>
</items>

我使用 Scrapy 输入 issue #3888 以查看项目是否有兴趣更好地支持嵌套的 Item 结构。

另一种方法是通过单独调用XmlItemExporter.export_item() 方法导出嵌套项,但这要求导出器可以作为与序列化程序相同的命名空间中的全局访问,或者您将导出器子类化。 .. 将导出器传递给序列化器。然后您必须满足于 XmlItemExporter.export_item() 硬编码缩进这一事实。

【讨论】:

  • CustomXMLValuesSerializer 类中,lambda 函数是否应该放在括号中? self._serialize = serialize or (lambda x: x)
  • @jox58 啊,是的,我已经用条件表达式在本地修复了这个问题
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2022-10-17
相关资源
最近更新 更多