【发布时间】:2020-02-25 12:34:13
【问题描述】:
所以我正在尝试从具有深度嵌套的 <script> 标记的站点中获取特定数据。
使用import json,希望尝试使事情变得更容易,导致著名的Expecting value: line 1 column 1 (char 0) 错误。所以,我尝试了以下方法1,但成功率为零。
本质上,连接到站点的相对简单的步骤,捕捉特定的<script>标签是没有问题的。从中获取我需要的数据似乎有问题。
假设以下元素:
script_tag = '''
<script id="startup" type="text/javascript">
$(document).ready(function () {createJsonChart({
"series":[{"name":"BNames","color":"#0043de","legendIndex":0,
"stack":null,
"data":[{"name":"BNames","color":"#0043de","y":0.0,
"legendIndex":0,
"events":{"click":function(){return false;}},
"subtotal":0.0,"displayValue":"0","tooltip":""},
{"name":"BNames","color":"#0043de","y":114.6,
"legendIndex":0,
"events":{"click":function(){return false;}},
"subtotal":0.0,"displayValue":"0",
"tooltip":"BNames: 114,60 % <br/> Month: oktober 2018"},
{"name":"BNames","color":"#0043de","y":108.5,
"legendIndex":0,
"events":{"click":function(){return false;}},
"subtotal":0.0,"displayValue":"0",
"tooltip":"BNames: 108,50 % <br/> Month: september 2019"},
{"name":"BNames","color":"#0043de","y":0.0,
"legendIndex":0,
"events":{"click":function(){return false;}},
"subtotal":0.0,"displayValue":"0","tooltip":""}]},
{"type":"line","marker":{"enabled":false,
"linecolor":null,"lineWidth":0,
"fillColor":null,"symbol":null,"radius":4},
"dashStyle":"Solid","lineWidth":2,
"step":"center","zIndex":"2","name":"Mandatory","color":"#f20808",
"legendIndex":0,"stack":1,
"data":[{"name":"Mandatory","color":"#f20808","y":104.1,
"legendIndex":0,
"events":{"click":function(){return false;}},"subtotal":0.0,"displayValue":"0",
"tooltip":"Mandatory: 104,10 %: 104,10 %"},
{"name":"Mandatory","color":"#f20808","y":104.1,
"legendIndex":0,
"events":{"click":function(){return false;}},
"subtotal":0.0,"displayValue":"0",
"tooltip":"Mandatory: 104,10 %"},
{"name":"Mandatory","color":"#f20808","y":104.1,
"legendIndex":0,
"events":{"click":function(){return false;}},
"subtotal":0.0,"displayValue":"0",
"tooltip":"Mandatory: 104,10 %"}]},
{"type":"line","marker":{"enabled":false,
"linecolor":null,"lineWidth":0,"fillColor":null,
"symbol":null,"radius":4},"dashStyle":"Solid","lineWidth":2,
"step":"center", "zIndex":"2","name":"Preferred","color":"#38d615",
"legendIndex":0,"stack":2,
"data":[{"name":"Preferred","color":"#38d615","y":121.0,
"legendIndex":0,
"events":{"click":function(){return false;}},"subtotal":0.0,"displayValue":"0",
"tooltip":"Preferred: 121,00 %: 121,00 %"},
{"name":"Preferred","color":"#38d615","y":121.0,
"legendIndex":0,
"events":{"click":function(){return false;}},"subtotal":0.0,"displayValue":"0",
"tooltip":"Preferred: 121,00 %"},
{"name":"Preferred","color":"#38d615","y":121.0,
"legendIndex":0,
"events":{"click":function(){return false;}},"subtotal":0.0,"displayValue":"0",
"tooltip":"Preferred: 121,00 %"}]}],
"resizeElement":null,"credits":{"enabled":false}});$('#__Page').lumnaInit('');});
</script>
'''
实际上这个<script> 标签更大。它包含 3 部分数据,此处命名为 BNames、Mandatory 和 Preferred。我需要来自BNames 的数据,特别是最后一个条目。因此,预期结果将来自"tooltip":"BNames: 108,50 % <br/> Month: september 2019"} 部分,其中BNames: 108,50 % 在一个变量中,Month: september 2019 在另一个变量中。
使用正则表达式回答
url_part=soup.find("script", attrs={'id':'startup'}).text
info=re.findall(r'\s\w*\s\d*', url_part)[-1]
result=re.findall(r'(BNames: (\d+[,]\d+\s[%]))', url_part)[-1][1]
首先定义要处理的 HTML 标记。其次,查找所有出现的实例,其中包含任意大小的字母 (\w*),后跟空格 (\s) 和任意大小的数字 (\d*)。这与 2019 年 9 月或 2019 年 8 月之类的任何内容相匹配。最后,查找与 BNames: 匹配的实例以及此设置中的数字:数字、逗号、数字、空格和百分号。因此(\d+[,]\d+\s[%] 这确实匹配从 80,6 % 到 120,05 % 的所有内容
【问题讨论】:
-
不用深入,用正则搜索脚本标签内的文本,我不喜欢用抓取功能来处理javascript标签,正则表达式更快.我已经在这里回答了这个javascript-scrape
标签: python python-3.x web-scraping beautifulsoup