【问题标题】:Webscraping Single Wiki Table Using BeautifulSoup On Page with Multiple Tables在具有多个表的页面上使用 BeautifulSoup 抓取单个 Wiki 表
【发布时间】:2021-07-13 17:30:54
【问题描述】:

比较新,希望有方向!

对于一个项目,我希望将下表中的数据从该来源抓取到数据框中: https://en.wikipedia.org/wiki/List_of_highest-income_ZIP_Code_Tabulation_Areas_in_the_United_States 此页面上有两个表格 - 我对第二个“按人均收入排名的 ZCTA”感兴趣。

在查看页面的 html 时,我无法找到具体标识表格的内容(或者不确定要查找的内容)。我不确定在为表类调用soup.find_all() 时要查找什么标签。表格代码如下:

<table class="toccolours sortable jquery-tablesorter" align="center" cellpadding="4" cellspacing="3" style="border: 1px solid #707070;">

页面上的两个表属于同一个表类。我试图抓取的表格上方的标题列出了一个不同的 ID,“ZCTAs_ranked_by_per_capita_income”。我要抓取的表格正上方是以下代码:

<h2><span class="mw-headline" id="ZCTAs_ranked_by_per_capita_income">ZCTAs ranked by per capita income</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=List_of_highest-income_ZIP_Code_Tabulation_Areas_in_the_United_States&amp;action=edit&amp;section=3" title="Edit section: ZCTAs ranked by per capita income">edit</a><span class="mw-editsection-bracket">]</span></span></h2>

按人均收入排名

任何帮助将不胜感激 - 如果需要更多信息,请告诉我!

【问题讨论】:

    标签: python-3.x pandas dataframe web-scraping beautifulsoup


    【解决方案1】:

    您可以使用 panda 的 .read_html 和正确的索引:

    url = "https://en.wikipedia.org/wiki/List_of_highest-income_ZIP_Code_Tabulation_Areas_in_the_United_States"
    
    df = pd.read_html(url)[5]
    print(df)
    

    打印:

        Rank                       Designation   ZCTA  Population  Per CapitaIncome
    0      1           Montchanin, Delaware[2]  19710          68            654485
    1      2                    Houston, Texas  77010          76            283189
    2      3             Rockland, Delaware[3]  19732          77            279424
    3      4              Miami Beach, Florida  33109         467            236238
    4      5                 Pineland, Florida  33945          79            162075
    5      6                  Esopus, New York  12429          51            155540
    6      7                 Henderson, Nevada  89012         175            148899
    7      8              Atherton, California  94027        6857            114359
    8      9              Boca Grande, Florida  33921        1500            107297
    9     10        Deer Harbor, Washington[4]  98243         141            107173
    10    11       Rancho Santa Fe, California  92067        7601            104487
    11    12               Palm Beach, Florida  33480       11200            104294
    12    13             Indianapolis, Indiana  46290         189            103347
    13    14              Kenilworth, Illinois  60043        2617             99087
    14    15         Beverly Hills, California  90210       21396             97198
    15    16            Greenwich, Connecticut   6831       15167             97111
    16    17           Los Angeles, California  90077       10465             96584
    17    18        Portola Valley, California  94028        6595             96373
    18    19                New York, New York  10022       30642             95196
    19    20                Wyarno, Wyoming[5]  82845          49             94109
    20    21           Short Hills, New Jersey   7078       12849             92940
    21    22      Altamahaw, North Carolina[6]  27202          24             91666
    22    23          Santa Monica, California  90402       11492             91147
    23    24                New York, New York  10021      102078             91064
    24    25            Gladwyne, Pennsylvania  19035        4050             90940
    25    26                New York, New York  10069        1403             90113
    26    27              Point Clear, Alabama  36564         107             89571
    27    28             Boston, Massachusetts   2199        1005             88974
    28    29         San Francisco, California  94105        2058             88829
    29    30                 Glencoe, Illinois  60022        8490             88126
    30    31  Belvedere-Tiburon, California[7]  94920       13048             86992
    31    32                 Glencoe, Arkansas  72539         318             86724
    32    33           Los Angeles, California  90067        2524             86319
    33    34                  Atlanta, Georgia  30327       21003             85883
    34    35                New York, New York  10028       44987             85866
    35    36                    Houston, Texas  77046         471             85070
    36    37         Lake McDonald, Montana[8]  59921           2             85000
    37    38                New York, New York  10162        1726             84938
    38    39            Mullett Lake, Michigan  49761          31             84692
    39    40            Mc Afee, New Jersey[9]   7428         127             84595
    40    41                New York, New York  10280        6614             83639
    41    42             Yorklyn, Delaware[10]  19736          63             83524
    42    43                 Chicago, Illinois  60611       26522             82930
    43    44             Boston, Massachusetts   2110        1428             82736
    44    45             Boston, Massachusetts   2109        3428             82689
    45    46                New York, New York  10282        1574             82348
    46    47             Far Hills, New Jersey   7931        2766             82227
    47    48           New Canaan, Connecticut   6840       19402             81934
    48    49                Medina, Washington  98039        3050             81926
    49    50     Pacific Palisades, California  90272       22538             81609
    50    51             Los Altos, California  94022       18466             81257
    51    52         San Francisco, California  94123       22903             81044
    52    53             Longboat Key, Florida  34228        7603             80963
    53    54                 Davis, California  95618         643             80713
    54    55                Alpine, New Jersey   7620        1649             80621
    55    56                  Atlanta, Georgia  30326        1075             80161
    56    57                New York, New York  10023       62206             79736
    57    58                Winnetka, Illinois  60093       19528             79651
    58    59             Weston, Massachusetts   2493       11469             79640
    59    60              Bacova, Virginia[11]  24412          89             79439
    60    61                  Springboro, Ohio  45066       17409             78786
    61    62             Boston, Massachusetts   2108        3446             78771
    62    63               Chappaqua, New York  10514       12004             78647
    63    64               St. Louis, Missouri  63124        9819             78598
    64    65   Ardsley-on-Hudson, New York[13]  10503         115             78591
    65    66                New York, New York  10024       61414             77824
    66    67           Essex Fells, New Jersey   7021        2151             77787
    67    68                     Rye, New York  10580       16737             77721
    68    69             Glenbrook, Nevada[14]  89413         365             77639
    69    70               Darien, Connecticut   6820       19607             77519
    70    71                  Captiva, Florida  33924         339             77458
    71    72               Mill Neck, New York  11765         732             77420
    72    73               Rex, North Carolina  28378          49             77306
    73    74          Indian Wells, California  92210        3859             77302
    74    75         Newport Coast, California  92657        5586             76870
    75    76        Corona del Mar, California  92625       13407             76704
    76    77              Wilmington, Delaware  19807        7345             76651
    77    78                     Dallas, Texas  75225       20314             76203
    78    79                 Chicago, Illinois  60601        5591             76157
    79    80             Lake Forest, Illinois  60045       22248             75991
    80    81           Los Angeles, California  90049       33520             75965
    81    82               Vero Beach, Florida  32963       14077             75761
    82    83                 Bedford, New York  10506        5537             75723
    83    84         San Francisco, California  94111        3335             75344
    84    85               Weston, Connecticut   6883       10037             74817
    85    86          Paradise Valley, Arizona  85253       17560             74605
    86    87             Pound Ridge, New York  10576        4530             74127
    87    88             Westport, Connecticut   6880       25807             74064
    88    89                  Washington, D.C.  20004         901             73803
    89    90            Old Westbury, New York  11568        3992             72932
    90    91                New York, New York  10128       59856             72691
    91    92             Teterboro, New Jersey   7608          18             72613
    92    93    Old Greenwich, Connecticut[15]   6870        7092             72317
    93    94                     Austin, Texas  78730        4885             72110
    94    95        Bloomfield Hills, Michigan  48302       16409             71985
    95    96              Norwalk, Connecticut   6853        3466             71642
    96    97                Rumson, New Jersey   7760        9665             71585
    97    98           Corolla, North Carolina  27927         648             71301
    98    99                 Gates Mills, Ohio  44040        2883             71016
    99   100                 Chicago, Illinois  60606        1682             70878
    

    如果您想更具体,可以使用bs4 和 CSS 选择器:

    import requests
    import pandas as pd
    from bs4 import BeautifulSoup
    
    url = "https://en.wikipedia.org/wiki/List_of_highest-income_ZIP_Code_Tabulation_Areas_in_the_United_States"
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    
    tbl = soup.select("h2:has(#ZCTAs_ranked_by_per_capita_income) + table")
    df = pd.read_html(str(tbl))[0]
    print(df)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2021-06-23
      • 2020-07-31
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-10-02
      • 2012-02-12
      相关资源
      最近更新 更多