【发布时间】:2021-02-12 19:17:20
【问题描述】:
我正在构建一个网络爬虫来从产品链接中提取产品信息。
网址如下:https://scrapingclub.com/exercise/detail_header/
我使用 chrome Dev Tools 找到了产品详细信息的 HTTP 请求链接。
这是我的代码
class quoteSpider(scrapy.Spider):
name = 'Practice'
start_urls = ['https://scrapingclub.com/exercise/detail_header/']
def parse(self,response):
yield scrapy.Request('https://scrapingclub.com/exercise/ajaxdetail_header/', callback = self.parse_detail, headers={'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'es-ES,es;q=0.9,pt;q=0.8',
'Connection': 'keep-alive',
'Cookie': '__cfduid=da54d7e9c59cf35860825eabc96d7f1c41612805624; _ga=GA1.2.1229230175.1612805628; _gid=GA1.2.205529574.1613135874',
'Host': 'scrapingclub.com',
'Referer': 'https://scrapingclub.com/exercise/detail_header/',
'sec-ch-ua': '"Chromium";v="88", "Google Chrome";v="88", ";Not A Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest'})
def parse_detail(self, response):
product = ProductClass()
data = response
# im still debugging so im not putting it into an item yet
# data = json.loads(response.text)
# product['product_name'] = data['title']
# product['detail'] = data['description']
# product['price'] = data['price']
yield {
'value' : data
}
当我跑步时
scrapy crawl ProductSpider -O test.json
这是我的输出文件
[
{"value": "<TextResponse 200 https://scrapingclub.com/exercise/ajaxdetail_header/>"}
]
为什么不返回 JSON 内容?
【问题讨论】:
-
你只返回标题而不是正文
data = response.headers -
我的错,我试图提取尸体。无论如何,放置 response.body 不会提取 JSON
标签: python web-scraping scrapy