Web 使用 Python 抓取隐藏的表格答案

【问题标题】：Web scraping a hidden table using PythonWeb 使用 Python 抓取隐藏的表格
【发布时间】：2020-09-02 06:05:43
【问题描述】：

我正在尝试从该网站https://www.ebi.ac.uk/gwas/genes/SAMD12 上抓取“Traits”表（实际上，URL 可以根据我的需要更改，但结构将是相同的）。

问题是我在网络抓取方面的知识非常有限，而且我无法使用我在这里看到的基本 BeautifulSoup 工作流程来获取此表。

这是我的代码：

import requests
from bs4 import BeautifulSoup

url = 'https://www.ebi.ac.uk/gwas/genes/SAMD12'
page = requests.get(url)

我正在寻找“efotrait-table”：

efotrait = soup.find('div', id='efotrait-table-loading')
print(efotrait.prettify())

<div class="row" id="efotrait-table-loading" style="margin-top:20px">
 <div class="panel panel-default" id="efotrait_panel">
  <div class="panel-heading background-color-primary-accent">
   <h3 class="panel-title">
    <span class="efotrait_label">
     Traits
    </span>
    <span class="efotrait_count badge available-data-btn-badge">
    </span>
   </h3>
   <span class="pull-right">
    <span class="clickable" onclick="toggleSidebar('#efotrait_panel span.clickable')" style="margin-left:25px">
     <span class="glyphicon glyphicon-chevron-up">
     </span>
    </span>
   </span>
  </div>
  <div class="panel-body">
   <table class="table table-striped borderless" data-export-types="['csv']" data-filter-control="true" data-flat="true" data-icons="icons" data-search="true" data-show-columns="true" data-show-export="true" data-show-multi-sort="false" data-sort-name="numberAssociations" data-sort-order="desc" id="efotrait-table">
   </table>
  </div>
 </div>
</div>

具体来说，这个：

soup.select('table#efotrait-table')[0]

<table class="table table-striped borderless" data-export-types="['csv']" data-filter-control="true" data-flat="true" data-icons="icons" data-search="true" data-show-columns="true" data-show-export="true" data-show-multi-sort="false" data-sort-name="numberAssociations" data-sort-order="desc" id="efotrait-table">
</table>

如您所见，表格的内容没有显示出来。在网站上，有一个将表格保存为 csv 的选项。如果我能以某种方式获得这个可下载的链接，那就太棒了。但是当我单击链接以复制它时，我得到的是“javascript:void(0)”。我没学过javascript吧？

表格是隐藏的，即使不是，我也需要在每页以交互方式选择更多行来获取整个表格（并且 URL 不会改变，所以我也无法获取表格）。

我想知道一种以编程方式访问此表的方法（非结构化信息），那么有关组织表的未成年人就可以了。任何有关如何做到这一点（或我应该研究什么）的线索将不胜感激。

提前致谢

【问题讨论】：

为什么不试试 selenium 呢？
@jaibalaji，我一定会试试这个！您认为没有 API，这是唯一的选择吗？或者至少是必去之地？

标签： python web-scraping beautifulsoup

【解决方案1】：

所需数据在 API 调用中可用。

import requests

data = {
    "q": "ensemblMappedGenes: \"SAMD12\" OR association_ensemblMappedGenes: \"SAMD12\"",
    "max": "99999",
    "group.limit": "99999",
    "group.field": "resourcename",
    "facet.field": "resourcename",
    "hl.fl": "shortForm,efoLink",
    "hl.snippets": "100",
    "fl": "accessionId,ancestralGroups,ancestryLinks,associationCount,association_rsId,authorAscii_s,author_s,authorsList,betaDirection,betaNum,betaUnit,catalogPublishDate,chromLocation,chromosomeName,chromosomePosition,context,countriesOfRecruitment,currentSnp,efoLink,ensemblMappedGenes,fullPvalueSet,genotypingTechnologies,id,initialSampleDescription,label,labelda,mappedLabel,mappedUri,merged,multiSnpHaplotype,numberOfIndividuals,orPerCopyNum,orcid_s,pValueExponent,pValueMantissa,parent,positionLinks,publication,publicationDate,publicationLink,pubmedId,qualifier,range,region,replicateSampleDescription,reportedGene,resourcename,riskFrequency,rsId,shortForm,snpInteraction,strongestAllele,studyId,synonym,title,traitName,traitName_s,traitUri,platform",
    "raw": "fq:resourcename:association or resourcename:study"
}


def main(url):
    r = requests.post(url, data=data).json()
    print(r)


main("https://www.ebi.ac.uk/gwas/api/search/advancefilter")

您可以关注r.keys() 并通过访问字典来加载所需的数据。

但这里有一个快速加载（惰性代码）：

import requests
import re
import pandas as pd

data = {
    "q": "ensemblMappedGenes: \"SAMD12\" OR association_ensemblMappedGenes: \"SAMD12\"",
    "max": "99999",
    "group.limit": "99999",
    "group.field": "resourcename",
    "facet.field": "resourcename",
    "hl.fl": "shortForm,efoLink",
    "hl.snippets": "100",
    "fl": "accessionId,ancestralGroups,ancestryLinks,associationCount,association_rsId,authorAscii_s,author_s,authorsList,betaDirection,betaNum,betaUnit,catalogPublishDate,chromLocation,chromosomeName,chromosomePosition,context,countriesOfRecruitment,currentSnp,efoLink,ensemblMappedGenes,fullPvalueSet,genotypingTechnologies,id,initialSampleDescription,label,labelda,mappedLabel,mappedUri,merged,multiSnpHaplotype,numberOfIndividuals,orPerCopyNum,orcid_s,pValueExponent,pValueMantissa,parent,positionLinks,publication,publicationDate,publicationLink,pubmedId,qualifier,range,region,replicateSampleDescription,reportedGene,resourcename,riskFrequency,rsId,shortForm,snpInteraction,strongestAllele,studyId,synonym,title,traitName,traitName_s,traitUri,platform",
    "raw": "fq:resourcename:association or resourcename:study"
}


def main(url):
    r = requests.post(url, data=data)
    match = {item.group(2, 1) for item in re.finditer(
        r'traitName_s":\"(.*?)\".*?mappedLabel":\["(.*?)\"', r.text)}
    df = pd.DataFrame.from_dict(match)
    print(df)


main("https://www.ebi.ac.uk/gwas/api/search/advancefilter")

输出：

0              heel bone mineral density                          Heel bone mineral density
1              interleukin-8 measurement  Chronic obstructive pulmonary disease-related ...
2   self reported educational attainment        Educational attainment (years of education)
3                        waist-hip ratio                                    Waist-hip ratio
4             eye morphology measurement                                     Eye morphology
5                       CC16 measurement  Chronic obstructive pulmonary disease-related ...
6         age-related hearing impairment  Age-related hearing impairment (SNP x SNP inte...
7    eosinophil percentage of leukocytes               Eosinophil percentage of white cells
8          coronary artery calcification  Coronary artery calcified atherosclerotic plaq...
9                     multiple sclerosis                                 Multiple sclerosis
10                  mathematical ability                    Highest math class taken (MTAG)
11                 risk-taking behaviour                      General risk tolerance (MTAG)
12         coronary artery calcification  Coronary artery calcified atherosclerotic plaq...
13  self reported educational attainment                      Educational attainment (MTAG)
14                          pancreatitis                                       Pancreatitis
15               hair colour measurement                                         Hair color
16                      breast carcinoma  Breast cancer specific mortality in breast cancer
17                      eosinophil count                                  Eosinophil counts
18                     self rated health                                  Self-rated health
19                          bone density                               Bone mineral density

【讨论】：

您好，非常感谢！真的！我现在正在研究那本字典……事实上，我已经看过 GWAS API (ebi.ac.uk/gwas/docs/api)，但我没有找到使用基因名称（或 ID）作为输入的方法。所以我想知道，你是怎么发现这个的？
@CainãMaxCouto-Silva 检查一下你就会明白stackoverflow.com/a/61515665/7658985