【问题标题】:python scrapy code prints out the file I am reading frompython scrapy代码打印出我正在读取的文件
【发布时间】:2017-05-12 05:23:10
【问题描述】:

我用 scrapy 编写了一些 python 代码来从网站中提取一些地址。

代码的第一部分是通过从单独的文件 googlecoords.txt 中读取纬度和经度坐标来组合 start_urls,然后构成 start_urls 的一部分。 (我之前准备的 googlecoords.txt 文件将英国邮政编码转换为 googlemaps 的 google 坐标)。

因此,例如,start_url 列表中的第一项是“https://www.howdens.com/process/searchLocationsNear.php?lat=53.674434&lon=-1.4908923&distance=1000&units=MILES”,其中“lat=53.674434&lon=-1.4908923”来自 googlecoors.txt 文件。

但是,当我运行代码时,它运行良好,只是它首先打印出 googlecoords.txt 文件——我不需要。

如何阻止这种打印发生? (虽然我可以忍受。)

import scrapy
import sys

from scrapy.http import FormRequest, Request
from Howdens.items import HowdensItem

class howdensSpider(scrapy.Spider):
    name = "howdens"
    allowed_domains = ["www.howdens.com"]

    # read the file that has a list of google coordinates that are converted from postcodes
    with open("googlecoords.txt") as f:
        googlecoords = [x.strip('\n') for x in f.readlines()]

    # from the goole coordinates build the start URLs
    start_urls = []
    for a in range(len(googlecoords)):
        start_urls.append("https://www.howdens.com/process/searchLocationsNear.php?{}&distance=1000&units=MILES".format(googlecoords[a]))

    # cycle through 6 of the first relevant items returned in the text
    def parse(self, response):
        for sel in response.xpath('/html/body'):
            for i in range(0,6):
                try:
                    item = HowdensItem()
                    item['name'] =sel.xpath('.//text()').re(r'(?<="name":")(.*?)(?=","street")')[i]
                    item['street'] =sel.xpath('.//text()').re(r'(?<="street":")(.*?)(?=","town")')[i]
                    item['town'] = sel.xpath('.//text()').re(r'(?<="town":")(.*?)(?=","pc")')[i]
                    item['pc'] = sel.xpath('.//text()').re(r'(?<="pc":")(.*?)(?=","state")')[i]
                    yield item
                except IndexError:
                    pass

【问题讨论】:

  • 数据是json...使用json解析器来处理...

标签: python web-scraping scrapy


【解决方案1】:

就像 cmets 中的某个人指出的那样,您应该在 start_requests() 方法中使用 json 模块加载它:

import scrapy
import json

class MySpider(scrapy.Spider):
    start_urls = ['https://www.howdens.com/process/searchLocationsNear.php?lat=53.674434&lon=-1.4908923&distance=1000&units=MILES']

    def parse(self, response):
        data = json.loads(response.body_as_unicode())
        items = data['response']['depots'] 
        for item in items:
            url_template = "https://www.howdens.com/process/searchLocationsNear.php?{}&distance=1000&units=MILES"
            url = url_template.format(item['lat'])  # format in your location here
            yield scrapy.Request(url, self.parse_item)

    def parse_item(self, response): 
        print(response.url)

【讨论】:

    猜你喜欢
    • 2015-05-24
    • 1970-01-01
    • 2021-05-30
    • 2017-11-18
    • 2016-03-27
    • 2018-01-23
    • 1970-01-01
    • 2022-08-13
    • 1970-01-01
    相关资源
    最近更新 更多