具有挑战性...我想这个网页是不可能抓取的答案

【问题标题】：Challenging... I guess it's impossible to scrape this webpage具有挑战性...我想这个网页是不可能抓取的
【发布时间】：2019-09-18 23:44:35
【问题描述】：

所以我正在尝试抓取此页面：https://icd.who.int/browse10/2016/en

问题是我试图抓取的页面内容在页面源中找不到。

例如，我正在尝试从页面左侧抓取菜单，导航，在“ICD-10 Version:2016”下>“I某些传染病和寄生虫病”>A00-A09肠道传染病。 (

A00 霍乱

A01 伤寒和副伤寒

A02 其他沙门氏菌感染

。 . )

问题是由于某种原因无法在页面源中找到所有这些。所以当我抓取它时，我根本没有得到这些数据。

from bs4 import BeautifulSoup
import requests
import pandas as pd
from pandas import Series, DataFrame

url = 'https://icd.who.int/browse10/2016/en'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'}
result = requests.get(url, headers=headers)

c=result.content
soup = BeautifulSoup(c, 'html5lib')

【问题讨论】：

该数据可能在页面加载后异步加载。您需要能够执行加载数据的 JavaScript。我认为 Selenium 可以做到这一点

标签： pandas web-scraping beautifulsoup

【解决方案1】：

这并非不可能，只是因为数据是异步加载的（正如@Carcigenicate 在 cmets 中所说），因此更加困难。向服务器发出请求以加载数据，并且可以在 google chrome 上的开发人员工具的网络选项卡中看到。我使用的方法需要一段时间（由于所有请求），但它确实有效。

import requests

chapter_url = 'https://icd.who.int/browse10/2016/en/JsonGetRootConcepts?useHtml=false'
chapters = requests.get(chapter_url).json()
chapter_IDS = [chapter['ID'] for chapter in chapters]

inner_section_names = []
section_url = 'https://icd.who.int/browse10/2016/en/JsonGetChildrenConcepts?ConceptId={}&useHtml=false'
for ID in chapter_IDS:
    sections = requests.get(section_url.format(ID)).json()
    section_IDS = [section['ID'] for section in sections]

    for inner_ID in section_IDS:
        inner_sections = requests.get(section_url.format(inner_ID)).json()
        temp = [inner_section['label'] for inner_section in inner_sections]
        inner_section_names.extend(temp)

print(inner_section_names)

【讨论】：

【解决方案2】：

请考虑使用官方 API

https://icd.who.int/icdapi/docs/APIDoc-Version2.html

以及更多关于 swagger 开放 API 端点的详细信息

https://id.who.int/swagger/index.html

用于检索 ICD10 类别信息的端点直接在上面的链接中列出。例如

GET /icd/release/10

列出可用的 ICD-10 版本

GET /icd/release/10/{releaseId}

此端点返回有关 ICD-10 已发布版本的基本信息以及其中的章节

GET /icd/release/10/{code}

列出请求类别的可用 ICD-10 版本

GET /icd/release/10/{releaseId}/{code}

此端点返回有关该类别及其子类别的信息

注意涵盖 ICD10 和 11。可以使用 requests 通过 GET 访问它们。注册访问here。

注意some benefits 列出：

试用 API 从这个大张旗鼓的 URL，可以通过发出请求来试用 API，然后查看响应

自动客户端代码生成。有几个免费和开源软件可以生成各种客户端代码使用 Open API 文档的编程语言。这将使使用您的编程语言更容易使用我们的 API 选择。

【讨论】：