【发布时间】:2021-06-15 18:06:59
【问题描述】:
我正在使用 json 的示例网站上学习一些抓取。例如,采用以下示例网站:http://www.charitystars.com/product/juve-chelsea-3-0-champions-league-jersey-autographed-by-giorgio-chiellini。源代码在这里view-source:https://www.charitystars.com/product/juve-chelsea-3-0-champions-league-jersey-autographed-by-giorgio-chiellini。我想在第 388-396 行获取信息:
<script>
var js_data = {"first_time_bid":true,"yourbid":0,"product":{"id":55,"item_number":"P55","type":"PRODUCT","fixed":0,"price":1000,"tot_price":1000,"min_bid_value":1010,"currency":"EUR","raise_bid":10,"stamp_end":"2013-06-14 12:00:00","bids_number":12,"estimated_value":200,"extended_time":0,"url":"https:\/\/www.charitystars.com\/product\/juve-chelsea-3-0-champions-league-jersey-autographed-by-giorgio-chiellini","conversion_value":1,"eid":0,"user_has_bidded":false},"bid":{"id":323,"uid":126,"first_name":"Fabio","last_name":"Gastaldi","company_name":"","is_company":0,"title":"fab1","nationality":"IT","amount":1000,"max_amount":0,"table":"","stamp":1371166006,"real_stamp":"2013-06-14 01:26:46"}};
var p_currency = '€';
var conversion_value = '1';
var merch_items = [];
var gallery_items = [];
var inside_gala = false;
</script>
并将每个变量用引号(即“id”、“item_number”、“type”...)保存在同名变量中。
到目前为止,我设法运行以下内容
import requests
from bs4 import BeautifulSoup
from urllib import urlopen
import re
import json
import time
import csv
from bs4 import BeautifulSoup as soup
from pandas import DataFrame
import urllib2
hdr = {"User-Agent": "My Agent"}
req = urllib2.Request(http://www.charitystars.com/product/juve-chelsea-3-0-champions-league-jersey-autographed-by-giorgio-chiellini)
response = urllib2.urlopen(req)
htmlSource = response.read()
soup = BeautifulSoup(htmlSource)
title = soup.find_all("span", {"itemprop": "name"}) # get the title
script_soup = soup.find_all("script")
出于某种原因,script_soup 有很多我不需要的信息。我相信我需要的部分在script_soup[9],但我不知道如何访问它(以有效的方式)。非常感谢您的帮助。
【问题讨论】: