【问题标题】:Scraping Javascript variables into Python将 Javascript 变量抓取到 Python 中
【发布时间】:2019-11-11 18:21:09
【问题描述】:

我想从http://maps.latimes.com/neighborhoods/population/density/neighborhood/list/ 中抓取以下数据:

  var hoodFeatures = {
            type: "FeatureCollection",
            features: [{
                type: "Feature",
                properties: {
                    name: "Koreatown",
                    slug: "koreatown",
                    url: "/neighborhoods/neighborhood/koreatown/",
                    has_statistics: true,
                    label: 'Rank: 1<br>Population per Sqmi: 42,611',
                    population: "115,070",
                    stratum: "high"
                },
                geometry: { "type": "MultiPolygon", "coordinates": [ [ [ [ -118.286908, 34.076510 ], [ -118.289208, 34.052511 ], [ -118.315909, 34.052611 ], [ -118.323009, 34.054810 ], [ -118.319309, 34.061910 ], [ -118.314093, 34.062362 ], [ -118.313709, 34.076310 ], [ -118.286908, 34.076510 ] ] ] ] }
            },

从上面的html中,我要分别取:

name
population per sqmi
population
geometry

并按名称将其转换为数据框

到目前为止我已经尝试过

import requests
import json
from bs4 import BeautifulSoup

response_obj = requests.get('http://maps.latimes.com/neighborhoods/population/density/neighborhood/list/').text
soup = BeautifulSoup(response_obj,'lxml')

该对象具有脚本信息,但我不明白如何使用该线程中建议的 json 模块: Parsing variable data out of a javascript tag using python

json_text = '{%s}' % (soup.partition('{')[2].rpartition('}')[0],)
value = json.loads(json_text)
value

我收到此错误

TypeError                                 Traceback (most recent call last)
<ipython-input-12-37c4c0188ed0> in <module>
      1 #Splits the text on the first bracket and last bracket of the javascript into JSON format
----> 2 json_text = '{%s}' % (soup.partition('{')[2].rpartition('}')[0],)
      3 value = json.loads(json_text)
      4 value
      5 #import pprint

TypeError: 'NoneType' object is not callable

有什么建议吗?谢谢

【问题讨论】:

  • soup 不是字符串,它可能会将partition 作为标签名称&lt;partition&gt; 不存在而您得到None。您必须使用 soup.text 这是一个字符串。您还可以找到标签 &lt;script&gt; 仅适用于可能具有 javascript 代码的文本 - code = soup.find('script').text

标签: javascript python beautifulsoup


【解决方案1】:

您不能真正使用json.loads,因为hoodFeatures 对象并不是真正的json。在正确的 json 中,每个键都用双引号括起来 "

您可以尝试手动在键周围添加引号(使用正则表达式)。
另一种选择是使用 Selenium 执行该 JS 并获取它的 JSON.stringify 输出。

使用手动清理回答:

这个会清理 JS 代码并将其转换为可以正确解析的 JSON。但是请记住,它绝不是健壮的,并且可能会在看到意外输入时中断。

import json
import re

js = '''
 var hoodFeatures = {
            type: "FeatureCollection",
            features: [
            {
                type: "Feature",
                properties: {
                    name: "Beverlywood",
                    slug: "beverlywood",
                    url: "/neighborhoods/neighborhood/beverlywood/",
                    has_statistics: true,
                    label: 'Rank: 131<br>Population per Sqmi: 7,654',
                    population: "6,080",
                    stratum: "middle"
                },
                geometry: {  }
            }]
        }
'''

if __name__ == '__main__':
    unprefixed = js.split('{', maxsplit=1)[1]
    unsuffixed = unprefixed.rsplit('}', maxsplit=1)[0]
    quotes_replaced = unsuffixed.replace('\'', '"')
    rebraced = f'{{{quotes_replaced}}}'
    keys_quoted = []
    for line in rebraced.splitlines():
        line = re.sub('^\s+([^:]+):', '"\\1":', line)
        keys_quoted.append(line)
    json_raw = '\n'.join(keys_quoted)
    # print(json_raw)
    parsed = json.loads(json_raw)
    for feat in parsed['features']:
        props = feat['properties']
        name, pop = props['name'], int(props['population'].replace(',', ''))
        geo = feat['geometry']
        pop_per_sqm = re.findall('per Sqmi: ([\d,]+)', props['label'])[0].replace(',', '')
        pop_per_sqm = int(pop_per_sqm)

        print(name, pop, pop_per_sqm, geo)

【讨论】:

    【解决方案2】:

    我不太确定如何用漂亮的汤做到这一点,但另一种选择可能是设计一个表达式并提取我们想要的值:

    (?:name|population per sqmi|population)\s*:\s*"?(.*?)\s*["']|(?:geometry)\s*:\s*({.*})
    

    Demo

    测试

    import re
    
    regex = r"(?:name|population per sqmi|population)\s*:\s*\"?(.*?)\s*[\"']|(?:geometry)\s*:\s*({.*})"
    
    test_str = ("var hoodFeatures = {\n"
        "            type: \"FeatureCollection\",\n"
        "            features: [{\n"
        "                type: \"Feature\",\n"
        "                properties: {\n"
        "                    name: \"Koreatown\",\n"
        "                    slug: \"koreatown\",\n"
        "                    url: \"/neighborhoods/neighborhood/koreatown/\",\n"
        "                    has_statistics: true,\n"
        "                    label: 'Rank: 1<br>Population per Sqmi: 42,611',\n"
        "                    population: \"115,070\",\n"
        "                    stratum: \"high\"\n"
        "                },\n"
        "                geometry: { \"type\": \"MultiPolygon\", \"coordinates\": [ [ [ [ -118.286908, 34.076510 ], [ -118.289208, 34.052511 ], [ -118.315909, 34.052611 ], [ -118.323009, 34.054810 ], [ -118.319309, 34.061910 ], [ -118.314093, 34.062362 ], [ -118.313709, 34.076310 ], [ -118.286908, 34.076510 ] ] ] ] }\n"
        "            },\n")
    
    matches = re.finditer(regex, test_str, re.MULTILINE | re.IGNORECASE)
    
    for matchNum, match in enumerate(matches, start=1):
    
        print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
    
        for groupNum in range(0, len(match.groups())):
            groupNum = groupNum + 1
    
            print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2023-03-16
      • 1970-01-01
      • 1970-01-01
      • 2021-03-20
      • 2011-08-07
      • 2018-11-17
      • 2014-09-30
      • 1970-01-01
      相关资源
      最近更新 更多