【问题标题】:how to scrape Wikipedia tables with Python如何使用 Python 抓取 Wikipedia 表格
【发布时间】:2019-03-19 05:59:06
【问题描述】:

我要提取的表 url 是 https://en.wikipedia.org/wiki/List_of_companies_of_Indonesia 我的代码没有给出数据。我们怎么能得到?

代码:

import requests
from bs4 import BeautifulSoup as bs
url = "https://en.wikipedia.org/wiki/List_of_companies_of_Indonesia"
html = requests.get(url).text
soup = bs(html, 'html.parser')
ta=soup.find_all('table',class_="wikitable sortable jquery-tablesorter")
print(ta)

【问题讨论】:

  • 那么错误/问题是什么?
  • 我没有拿到桌子
  • 你得到了什么?
  • 您是否尝试省略jquery-tablesorter 类,例如:find_all('table',class_="wikitable sortable")
  • 我得到空列表。

标签: python python-3.x url beautifulsoup python-requests


【解决方案1】:

如果我在拉桌子并看到 <table> 标签,我总是会先尝试 Pandas .read_html()。它会为您完成对行的迭代。大多数时候你可以得到你需要的东西,或者至少只需要对数据框做一些小的操作。在这种情况下,它可以很好地为您提供完整的表格:

import pandas as pd

url = "https://en.wikipedia.org/wiki/List_of_companies_of_Indonesia"
table = pd.read_html(url)[1]

输出:

print (table.to_string())
                                   0                   1                                  2                  3        4                                                  5
0                               Name            Industry                             Sector       Headquarters  Founded                                              Notes
1                  Airfast Indonesia   Consumer services                           Airlines          Tangerang     1971                                    Private airline
2                       Angkasa Pura         Industrials            Transportation services            Jakarta     1962                               State-owned airports
3                Astra International       Conglomerates                                  -            Jakarta     1957    Automotive, financials, industrials, technology
4                  Bank Central Asia          Financials                              Banks            Jakarta     1957                                               Bank
5                       Bank Danamon          Financials                              Banks            Jakarta     1956                                               Bank
6                       Bank Mandiri          Financials                              Banks            Jakarta     1998                                               Bank
7              Bank Negara Indonesia          Financials                              Banks            Jakarta     1946                                               Bank
8              Bank Rakyat Indonesia          Financials                              Banks            Jakarta     1895                                 Micro-finance bank
9                     Bumi Resources     Basic materials                     General mining            Jakarta     1973                                             Mining
10                            Djarum      Consumer goods                            Tobacco  Kudus and Jakarta     1951                                            Tobacco
11   Dragon Computer & Communication          Technology                  Computer hardware            Jakarta     1980                                  Computer hardware
12             Elex Media Komputindo   Consumer services                         Publishing            Jakarta     1985                                          Publisher
13                            Femina   Consumer services                              Media            Jakarta     1972                                    Weekly magazine
14                  Garuda Indonesia   Consumer services                   Travel & leisure          Tangerang     1949                                State-owned airline
15                      Gudang Garam      Consumer goods                            Tobacco             Kediri     1958                                            Tobacco
16                      Gunung Agung   Consumer services                Specialty retailers            Jakarta     1953                                         Bookstores
17       Indocement Tunggal Prakarsa         Industrials      Building materials & fixtures            Jakarta     1985         Cement, part of HeidelbergCement (Germany)
18                          Indofood      Consumer goods                      Food products            Jakarta     1968                                    Food production
19              Indonesian Aerospace         Industrials                          Aerospace            Bandung     1976                        State-owned aircraft design
20    Indonesian Bureau of Logistics      Consumer goods                      Food products            Jakarta     1967                                  Food distribution
21                           Indosat  Telecommunications      Fixed line telecommunications            Jakarta     1967                         Telecommunications network
22               Infomedia Nusantara   Consumer services                         Publishing            Jakarta     1975                                Directory publisher
23      Jalur Nugraha Ekakurir (JNE)         Industrials                  Delivery services            Jakarta     1990                                  Express logistics
24                       Kalbe Farma         Health care                    Pharmaceuticals            Jakarta     1966                                    Pharmaceuticals
25              Kereta Api Indonesia         Industrials                          Railroads            Bandung     1945                                State-owned railway
26                       Kimia Farma         Health care                    Pharmaceuticals            Jakarta     1971                                 State-owned pharma
27             Kompas Gramedia Group   Consumer services                     Media agencies            Jakarta     1965                                      Media holding
28                    Krakatau Steel     Basic materials                       Iron & steel            Cilegon     1970                                  State-owned steel
29                          Lion Air   Consumer services                           Airlines            Jakarta     2000                                   Low-cost airline
30                       Lippo Group          Financials  Real estate holding & development            Jakarta     1950                                        Development
31                          Matahari   Consumer services                Broadline retailers          Tangerang     1982                                  Department stores
32                       MedcoEnergi           Oil & gas           Exploration & production            Jakarta     1980                                Energy, oil and gas
33             Media Nusantara Citra   Consumer services       Broadcasting & entertainment            Jakarta     1997                                              Media
34                   Panin Sekuritas          Financials                Investment services            Jakarta     1989                                             Broker
35                         Pegadaian          Financials                   Consumer finance            Jakarta     1901                     State-owned financial services
36                             Pelni         Industrials              Marine transportation            Jakarta     1952                                           Shipping
37                     Pos Indonesia         Industrials                  Delivery services            Bandung     1995                         State-owned postal service
38                         Pertamina           Oil & gas               Integrated oil & gas            Jakarta     1957                    State-owned oil and natural gas
39             Perusahaan Gas Negara           Oil & gas           Exploration & production            Jakarta     1965                                                Gas
40             Perusahaan Gas Negara           Utilities                   Gas distribution            Jakarta     1965             State-owned natural gas transportation
41         Perusahaan Listrik Negara           Utilities           Conventional electricity            Jakarta     1945                State-owned electrical distribution
42  Phillip Securities Indonesia, PT          Financials                Investment services            Jakarta     1989                                 Financial services
43                            Pindad         Industrials                            Defense            Bandung     1808                                State-owned defense
44                PT Lapindo Brantas           Oil & gas           Exploration & production            Jakarta     1996                                        Oil and gas
45   PT Metro Supermarket Realty Tbk   Consumer services       Food retailers & wholesalers            Jakarta     1955                                       Supermarkets
46                       Salim Group       Conglomerates                                  -            Jakarta     1972            Industrials, financials, consumer goods
47                         Sampoerna      Consumer goods                            Tobacco           Surabaya     1913                                            Tobacco
48                   Semen Indonesia         Industrials      Building materials & fixtures             Gresik     1957                                             Cement
49                          Susi Air   Consumer services                           Airlines        Pangandaran     2004                                    Charter airline
50                  Telkom Indonesia  Telecommunications      Fixed line telecommunications            Bandung     1856                         Telecommunication services
51                         Telkomsel  Telecommunications          Mobile telecommunications            Jakarta     1995           Mobile network, part of Telkom Indonesia
52                        Trans Corp       Conglomerates                                  -            Jakarta     2006  Media, consumer services, real estate, part of...
53                Unilever Indonesia      Consumer goods                  Personal products            Jakarta     1933  Personal care products, part of Unilever (Neth...
54                   United Tractors         Industrials       Commercial vehicles & trucks            Jakarta     1972                                    Heavy equipment
55                           Waskita         Industrials                 Heavy construction            Jakarta     1961                           State-owned construction

【讨论】:

    【解决方案2】:
    import requests
    from bs4 import BeautifulSoup as bs
    URL = "https://en.wikipedia.org/wiki/List_of_companies_of_Indonesia"
    html = requests.get(url).text
    soup = bs(html, 'html.parser')
    ta=soup.find_all('table',{'class':'wikitable'})
    print(ta)
    

    您可以使用旧方法按类名搜索表。它似乎仍然有效。

    【讨论】:

      【解决方案3】:

      修复

      1. 在您的代码中使用 URL 而不是 url(第 4 行)
      2. 使用类wikitable
      3. 稍微优化了您的代码

      因此

      import requests
      from bs4 import BeautifulSoup
      
      page = requests.get("https://en.wikipedia.org/wiki/List_of_companies_of_Indonesia")
      soup = BeautifulSoup(page.content, 'html.parser')
      ta = soup.find_all('table',class_="wikitable")
      
      print(ta)
      

      输出

      [<table class="wikitable sortable">
      <tbody><tr>
      <th>Rank
      </th>
      <th>Image
      </th>
      <th>Name
      </th>
      <th>2016 Revenues (USD $M)
      </th>
      <th>Employees
      </th>
      <th>Notes
      .
      .
      .
      

      【讨论】:

      • 你好 - 许多坦克都为这个伟大的答案和提示。太棒了!
      【解决方案4】:

      也许这不是您想要的。不过你可以试试这个。

      import requests
      from bs4 import BeautifulSoup as bs
      
      url = "https://en.wikipedia.org/wiki/List_of_companies_of_Indonesia"
      html = requests.get(url).text
      soup = bs(html, 'html.parser')
      
      for data in soup.find_all('table', {"class":"wikitable"}):
          for td in data.find_all('td'):
              for link in td.find_all('a'):
                  print (link.text)
      

      【讨论】:

        【解决方案5】:

        试试下面的,

        import requests
        from bs4 import BeautifulSoup as bs
        URL = "https://en.wikipedia.org/wiki/List_of_companies_of_Indonesia"
        html = requests.get(URL).text
        soup = bs(html, 'html.parser')
        ta=soup.find("table",{"class":"wikitable sortable"})
        print(ta)
        

        获取所有表格

        ta=soup.find_all("table",{"class":"wikitable sortable"})
        

        【讨论】:

          【解决方案6】:

          如果您想解析表格数据,那么您可以使用pandas 执行此操作,如果您想操作表格数据非常高效,您可以使用 pandas DataFrame() 导航表格

          import pandas as pd
          
          url = "https://en.wikipedia.org/wiki/List_of_companies_of_Indonesia"
          table = pd.read_html(url,header=0)
          print(table[1])
          

          【讨论】:

            猜你喜欢
            • 2021-01-26
            • 1970-01-01
            • 2019-04-20
            • 1970-01-01
            • 2020-09-08
            • 2016-08-01
            • 1970-01-01
            • 1970-01-01
            • 2015-11-27
            相关资源
            最近更新 更多