【问题标题】：Scraping Javascript variables into Python将 Javascript 变量抓取到 Python 中
【发布时间】：2019-11-11 18:21:09
【问题描述】：

我想从http://maps.latimes.com/neighborhoods/population/density/neighborhood/list/ 中抓取以下数据：

  var hoodFeatures = {
            type: "FeatureCollection",
            features: [{
                type: "Feature",
                properties: {
                    name: "Koreatown",
                    slug: "koreatown",
                    url: "/neighborhoods/neighborhood/koreatown/",
                    has_statistics: true,
                    label: 'Rank: 1<br>Population per Sqmi: 42,611',
                    population: "115,070",
                    stratum: "high"
                },
                geometry: { "type": "MultiPolygon", "coordinates": [ [ [ [ -118.286908, 34.076510 ], [ -118.289208, 34.052511 ], [ -118.315909, 34.052611 ], [ -118.323009, 34.054810 ], [ -118.319309, 34.061910 ], [ -118.314093, 34.062362 ], [ -118.313709, 34.076310 ], [ -118.286908, 34.076510 ] ] ] ] }
            },

从上面的html中，我要分别取：

name
population per sqmi
population
geometry

并按名称将其转换为数据框

到目前为止我已经尝试过

import requests
import json
from bs4 import BeautifulSoup

response_obj = requests.get('http://maps.latimes.com/neighborhoods/population/density/neighborhood/list/').text
soup = BeautifulSoup(response_obj,'lxml')

该对象具有脚本信息，但我不明白如何使用该线程中建议的 json 模块： Parsing variable data out of a javascript tag using python

json_text = '{%s}' % (soup.partition('{')[2].rpartition('}')[0],)
value = json.loads(json_text)
value

我收到此错误

TypeError                                 Traceback (most recent call last)
<ipython-input-12-37c4c0188ed0> in <module>
      1 #Splits the text on the first bracket and last bracket of the javascript into JSON format
----> 2 json_text = '{%s}' % (soup.partition('{')[2].rpartition('}')[0],)
      3 value = json.loads(json_text)
      4 value
      5 #import pprint

TypeError: 'NoneType' object is not callable

有什么建议吗？谢谢

【问题讨论】：

soup 不是字符串，它可能会将partition 作为标签名称<partition> 不存在而您得到None。您必须使用 soup.text 这是一个字符串。您还可以找到标签 <script> 仅适用于可能具有 javascript 代码的文本 - code = soup.find('script').text

标签： javascript python beautifulsoup

【解决方案1】：

您不能真正使用json.loads，因为hoodFeatures 对象并不是真正的json。在正确的 json 中，每个键都用双引号括起来 "

您可以尝试手动在键周围添加引号（使用正则表达式）。
另一种选择是使用 Selenium 执行该 JS 并获取它的 JSON.stringify 输出。

使用手动清理回答：

这个会清理 JS 代码并将其转换为可以正确解析的 JSON。但是请记住，它绝不是健壮的，并且可能会在看到意外输入时中断。

import json
import re

js = '''
 var hoodFeatures = {
            type: "FeatureCollection",
            features: [
            {
                type: "Feature",
                properties: {
                    name: "Beverlywood",
                    slug: "beverlywood",
                    url: "/neighborhoods/neighborhood/beverlywood/",
                    has_statistics: true,
                    label: 'Rank: 131<br>Population per Sqmi: 7,654',
                    population: "6,080",
                    stratum: "middle"
                },
                geometry: {  }
            }]
        }
'''

if __name__ == '__main__':
    unprefixed = js.split('{', maxsplit=1)[1]
    unsuffixed = unprefixed.rsplit('}', maxsplit=1)[0]
    quotes_replaced = unsuffixed.replace('\'', '"')
    rebraced = f'{{{quotes_replaced}}}'
    keys_quoted = []
    for line in rebraced.splitlines():
        line = re.sub('^\s+([^:]+):', '"\\1":', line)
        keys_quoted.append(line)
    json_raw = '\n'.join(keys_quoted)
    # print(json_raw)
    parsed = json.loads(json_raw)
    for feat in parsed['features']:
        props = feat['properties']
        name, pop = props['name'], int(props['population'].replace(',', ''))
        geo = feat['geometry']
        pop_per_sqm = re.findall('per Sqmi: ([\d,]+)', props['label'])[0].replace(',', '')
        pop_per_sqm = int(pop_per_sqm)

        print(name, pop, pop_per_sqm, geo)

【讨论】：

【解决方案2】：

我不太确定如何用漂亮的汤做到这一点，但另一种选择可能是设计一个表达式并提取我们想要的值：

(?:name|population per sqmi|population)\s*:\s*"?(.*?)\s*["']|(?:geometry)\s*:\s*({.*})

Demo

测试

import re

regex = r"(?:name|population per sqmi|population)\s*:\s*\"?(.*?)\s*[\"']|(?:geometry)\s*:\s*({.*})"

test_str = ("var hoodFeatures = {\n"
    "            type: \"FeatureCollection\",\n"
    "            features: [{\n"
    "                type: \"Feature\",\n"
    "                properties: {\n"
    "                    name: \"Koreatown\",\n"
    "                    slug: \"koreatown\",\n"
    "                    url: \"/neighborhoods/neighborhood/koreatown/\",\n"
    "                    has_statistics: true,\n"
    "                    label: 'Rank: 1<br>Population per Sqmi: 42,611',\n"
    "                    population: \"115,070\",\n"
    "                    stratum: \"high\"\n"
    "                },\n"
    "                geometry: { \"type\": \"MultiPolygon\", \"coordinates\": [ [ [ [ -118.286908, 34.076510 ], [ -118.289208, 34.052511 ], [ -118.315909, 34.052611 ], [ -118.323009, 34.054810 ], [ -118.319309, 34.061910 ], [ -118.314093, 34.062362 ], [ -118.313709, 34.076310 ], [ -118.286908, 34.076510 ] ] ] ] }\n"
    "            },\n")

matches = re.finditer(regex, test_str, re.MULTILINE | re.IGNORECASE)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

【讨论】：