【问题标题】:How to scrape Dynamic Web pages when bs4 and other python libraries do not work?bs4等python库不工作时如何抓取动态网页?
【发布时间】:2021-10-16 04:17:20
【问题描述】:

我正在抓取这个网站:https://www.eafo.eu/alternative-fuels/electricity/charging-infra-stats

我无法使用 bs4 或 Selenium 提取动态图表值。我可以获取 html 但没有数据值。当我使用 Selenium 时,我能够捕获 html 但没有数据。有什么我缺少的东西来抓住这个或更强大的工具来操纵动态网络pages?

【问题讨论】:

    标签: python html css automation


    【解决方案1】:

    是的,这是一个有趣的问题,实际上可以在网络抓取数据时欺骗很多人...问题是图表是在 JavaScript 中准备好文档后加载的,您可以了解有关 doc ready here 的更多信息。但本质上,图表是在加载完所有 HTML、CSS 和 JS 之后呈现的,并且数据绑定到一个 data-attr。

    我创建了一个代码示例,它使用 NodeJS Express 服务器以 JSON 格式返回所有图表中的数据。本质上,它会点击 URL,定位图表所在的类,然后查找包含图表所有数据的 data-* attr。这样,当基于 JavaScript 的图表呈现出现这些情况时,您将拥有可以使用和派生的工作代码。

    带有 NodeJS 和 Python 解决方案的 GitHub 存储库: https://github.com/joehoeller/dynamic-chart-parser-for-webscraping

    【讨论】:

    • 喜欢 JS 解决方案,非常有用。是的,这两个解决方案都是正确的,感谢您的输入和 NodeJS 解决方案。
    【解决方案2】:

    页面上的六个图表中的每一个都填充了来自各个 API 调用的数据,这些数据可以在浏览器的网络设置下找到。您可以自己向这些端点发送请求并解析响应:

    import urllib.parse, requests, json
    headers = {'authority': 'www.eafo.eu', 'pragma': 'no-cache', 'cache-control': 'no-cache', 'sec-ch-ua': '"Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"', 'accept': 'application/json, text/javascript, */*; q=0.01', 'x-requested-with': 'XMLHttpRequest', 'sec-ch-ua-mobile': '?0', 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36', 'sec-fetch-site': 'same-origin', 'sec-fetch-mode': 'cors', 'sec-fetch-dest': 'empty', 'referer': 'https://www.eafo.eu/alternative-fuels/electricity/charging-infra-stats', 'accept-language': 'en-US,en;q=0.9', 'cookie': 'yearFilter=2020; activeSubMenu=electricity; subMenuActiveItem=charging_infra_stats; fuelFilter=Electricity; _ga=GA1.2.1782486955.1628797896; _gid=GA1.2.47726291.1628797896; _gat_gtag_UA_129775638_1=1'}
    params = (('compare', 'false'),)
    urls = ['https://www.eafo.eu/normal-and-fast-charge-points/-1/-1/-1/false/false/nvt?compare=false', 'https://www.eafo.eu/charging-positions-per-10-evs/-1/-1/-1/false/false/nvt?compare=false', 'https://www.eafo.eu/normal-power-charging-positions/-1/-1/-1/false/false/nvt?compare=false', 'https://www.eafo.eu/fillingstations-electricity-top-5/-1/-1/-1/false/false/nvt?compare=false', 'https://www.eafo.eu/fast-charging/-1/-1/-1/false/false/nvt?compare=false', 'https://www.eafo.eu/top-5-countries-charging-positions-per-10-evs/-1/-1/-1/false/false/nvt?compare=false'] 
    data = [[urllib.parse.urlparse(url).path.split('/')[1], json.loads(requests.get(url, headers=headers, params=params).text)] for url in urls]
    result = {a:[[i['c'][0]['v'], i['c'][1]['v']] for i in b['data']['rows']] for a, b in data}
    

    输出:

    {'normal-and-fast-charge-points': [[2008, 0], [2009, 0], [2010, 0], [2011, 13], [2012, 257], [2013, 751], [2014, 1474], [2015, 3396], [2016, 5190], [2017, 8723], [2018, 11138], [2019, 15136], [2020, 24987]], 'charging-positions-per-10-evs': [['2008', 0], ['2009', 0], ['2010', '14'], ['2011', '6'], ['2012', '3'], ['2013', '4'], ['2014', '5'], ['2015', '5'], ['2016', '5'], ['2017', '5'], ['2018', '6'], ['2019', '7'], ['2020', '9']], 'normal-power-charging-positions': [['2008', 0], ['2009', 0], ['2010', 400], ['2011', 2379], ['2012', 10250], ['2013', 17093], ['2014', 24917], ['2015', 44786], ['2016', 70012], ['2017', 97287], ['2018', 107446], ['2019', 148880], ['2020', 199250]], 'fillingstations-electricity-top-5': [['Netherlands', 66461], ['France', 45413], ['Germany', 43633], ['Sweden', 13564], ['Italy', 13214]], 'fast-charging': [['2008', 0], ['2009', 0], ['2010', 0], ['2011', 13], ['2012', 257], ['2013', 751], ['2014', 1474], ['2015', 3396], ['2016', 5190], ['2017', 8723], ['2018', 11138], ['2019', 15136], ['2020', 24987]], 'top-5-countries-charging-positions-per-10-evs': [['Latvia', '3.15'], ['Slovakia', '4.34'], ['Croatia', '5.14'], ['Estonia', '5.31'], ['Netherlands', '5.71']]}
    

    采用更简洁的 JSON 格式:

    t = {' '.join(map(str.capitalize, a.split('-'))):b for a, b in result.items()}
    print(json.dumps(t, indent=4))
    

    输出:

    {
        "Normal And Fast Charge Points": [
            [
                2008,
                0
            ],
            [
                2009,
                0
            ],
            [
                2010,
                0
            ],
            [
                2011,
                13
            ],
            [
                2012,
                257
            ],
            [
                2013,
                751
            ],
            [
                2014,
                1474
            ],
            [
                2015,
                3396
            ],
            [
                2016,
                5190
            ],
            [
                2017,
                8723
            ],
            [
                2018,
                11138
            ],
            [
                2019,
                15136
            ],
            [
                2020,
                24987
            ]
        ],
        "Charging Positions Per 10 Evs": [
            [
                "2008",
                0
            ],
            [
                "2009",
                0
            ],
            [
                "2010",
                "14"
            ],
            [
                "2011",
                "6"
            ],
            [
                "2012",
                "3"
            ],
            [
                "2013",
                "4"
            ],
            [
                "2014",
                "5"
            ],
            [
                "2015",
                "5"
            ],
            [
                "2016",
                "5"
            ],
            [
                "2017",
                "5"
            ],
            [
                "2018",
                "6"
            ],
            [
                "2019",
                "7"
            ],
            [
                "2020",
                "9"
            ]
        ],
        "Normal Power Charging Positions": [
            [
                "2008",
                0
            ],
            [
                "2009",
                0
            ],
            [
                "2010",
                400
            ],
            [
                "2011",
                2379
            ],
            [
                "2012",
                10250
            ],
            [
                "2013",
                17093
            ],
            [
                "2014",
                24917
            ],
            [
                "2015",
                44786
            ],
            [
                "2016",
                70012
            ],
            [
                "2017",
                97287
            ],
            [
                "2018",
                107446
            ],
            [
                "2019",
                148880
            ],
            [
                "2020",
                199250
            ]
        ],
        "Fillingstations Electricity Top 5": [
            [
                "Netherlands",
                66461
            ],
            [
                "France",
                45413
            ],
            [
                "Germany",
                43633
            ],
            [
                "Sweden",
                13564
            ],
            [
                "Italy",
                13214
            ]
        ],
        "Fast Charging": [
            [
                "2008",
                0
            ],
            [
                "2009",
                0
            ],
            [
                "2010",
                0
            ],
            [
                "2011",
                13
            ],
            [
                "2012",
                257
            ],
            [
                "2013",
                751
            ],
            [
                "2014",
                1474
            ],
            [
                "2015",
                3396
            ],
            [
                "2016",
                5190
            ],
            [
                "2017",
                8723
            ],
            [
                "2018",
                11138
            ],
            [
                "2019",
                15136
            ],
            [
                "2020",
                24987
            ]
        ],
        "Top 5 Countries Charging Positions Per 10 Evs": [
            [
                "Latvia",
                "3.15"
            ],
            [
                "Slovakia",
                "4.34"
            ],
            [
                "Croatia",
                "5.14"
            ],
            [
                "Estonia",
                "5.31"
            ],
            [
                "Netherlands",
                "5.71"
            ]
        ]
    }
    

    【讨论】:

    • 是的,我先查看了网络选项卡,但什么也没看到。好收获!
    • @joehoeller 还应该注意的是,必须提供原始的headersparams 才能从每个端点获得完整、完整的响应。这可以通过右键单击网络设置中的请求然后选择复制 > 复制为 cURL 来访问
    • 因为我没有看到,我认为这是一个文档就绪问题,从页面加载到图表呈现时。如果追求 data-* attr,这可能仍然是正确的,但这感觉更轻量级。尽管我的解决方案也能正常工作,但返回的 JSON 需要一个附加步骤,将 cols 作为键映射到行作为值。
    • 感谢您发布此替代解决方案!我知道现在要查看标题信息。
    • @joehoeller 很高兴你发现它有用!
    猜你喜欢
    • 1970-01-01
    • 2016-02-21
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-05-24
    • 2021-05-07
    • 1970-01-01
    相关资源
    最近更新 更多