【问题标题】:Convert a HTML that doesn't contain a table to pandas Dataframe将不包含表格的 HTML 转换为 pandas Dataframe
【发布时间】:2021-01-27 11:47:31
【问题描述】:

我有一个我想用 pandas 阅读的 HTML,问题是 HTML 不是表格,尽管在网站上它看起来像一个,但我有这样的:

table = '''
<div id="companyResults">
<div class="col-md-12 titles">
<div class="col-md-6"> </div>
<div class="col-md-4">LOCATION</div>
<div class="col-md-2 last">SALES REVENUE ($M)</div>
</div>
<div class="col-md-12 data">
<div class="col-md-6">
<a href="/business-directory/company-profiles.shenzhen_zhaoji_optical_co_ltd.bcf9d7eb4856eb739ec66272a6d9a361.html">
                                        Shenzhen Zhaoji Optical Co., Ltd.</a>
</div>
<div class="col-md-4">
<div class="show-mobile">Country:</div>
                                Shenzhen,
                                Guangdong,
                                <br/>
                                China</div>
<div class="col-md-2 last">
<div class="show-mobile">Sales Revenue ($M):</div>
</div>
</div>
<div class="col-md-12 data">
<div class="col-md-6">
<a href="/business-directory/company-profiles.foxconn_industrial_internet_co_ltd.0d4c40a311dbfb1169684a21caa8794c.html">
                                        Foxconn Industrial Internet Co., Ltd.</a>
</div>
<div class="col-md-4">
<div class="show-mobile">Country:</div>
                                Shenzhen,
                                Guangdong,
                                <br/>
                                China</div>
<div class="col-md-2 last">
<div class="show-mobile">Sales Revenue ($M):</div>
                                $40,833.44M</div>
</div>
<div class="col-md-12 data">
<div class="col-md-6">
<a href="/business-directory/company-profiles.boe_technology_group_co_ltd.61b87aa6bc863b69d8d7689703a3ac52.html">
                                        BOE Technology Group Co., Ltd.</a>
</div>
<div class="col-md-4">
<div class="show-mobile">Country:</div>
                                Beijing,
                                Beijing,
                                <br/>
                                China</div>
<div class="col-md-2 last">
<div class="show-mobile">Sales Revenue ($M):</div>
                                $16,495.55M</div>
</div>
<div class="col-md-12 data">
<div class="col-md-6">
<a href="/business-directory/company-profiles.futong_group_co_ltd.85c12cb0d89005d1280cd3c0c13879ff.html">
                                        Futong Group Co., Ltd.</a>
</div>
<div class="col-md-4">
<div class="show-mobile">Country:</div>
                                Hangzhou,
                                Zhejiang,
                                <br/>
                                China</div>
<div class="col-md-2 last">
<div class="show-mobile">Sales Revenue ($M):</div>
</div>
</div>
<div class="col-md-12 data">
<div class="col-md-6">
<a href="/business-directory/company-profiles.ofilm_group_co_ltd.515f10b35d850547d16fb6d6875a57d9.html">
                                        OFILM Group Co., Ltd.</a>
</div>
<div class="col-md-4">
<div class="show-mobile">Country:</div>
                                Shenzhen,
                                Guangdong,
                                <br/>
                                China</div>
<div class="col-md-2 last">
<div class="show-mobile">Sales Revenue ($M):</div>
                                $5,355.25M</div>
</div>
'''

我想要一个如下所示的输出:

                                                            LOCATION  \
0      Shenzhen Zhaoji Optical Co., Ltd.  Shenzhen, Guangdong, China   
1  Foxconn Industrial Internet Co., Ltd.  Shenzhen, Guangdong, China   
2         BOE Technology Group Co., Ltd.     Beijing, Beijing, China   
3                 Futong Group Co., Ltd.   Hangzhou, Zhejiang, China   
4                  OFILM Group Co., Ltd.  Shenzhen, Guangdong, China   

  SALES REVENUE ($M)  
0                     
1        $40,833.44M  
2        $16,495.55M  
3                     
4         $5,355.25M  

我试过了:

pd.read_html(str(table))

但是得到了这个:

ValueError: No tables found

那么实现这一目标的最佳方法是什么? PS:建议在行中添加更多细节(如 href 或其他),但不是必须的

更新:url

【问题讨论】:

    标签: python html beautifulsoup


    【解决方案1】:

    你可能想试试这个:

    import pandas as pd
    import requests
    from bs4 import BeautifulSoup
    from tabulate import tabulate
    
    url = "https://www.dnb.com/business-directory/company-information.semiconductorelectronic-component-manufacturing.cn.html?page=1"
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:84.0) Gecko/20100101 Firefox/84.0",
    }
    page = requests.get(url, headers=headers).text
    soup = BeautifulSoup(page, "html5lib").find_all("div", class_="col-md-12 data")
    
    companies = [d.find("a").getText(strip=True) for d in soup]
    
    countries = [
        ", ".join(
            c.strip() for c in
            d.find(
                "div", class_="col-md-4"
            ).getText(strip=True).rsplit(":")[-1].split(",")
        ) for d in soup
    ]
    
    revenue = [
        d.find("div", class_="col-md-2 last").getText(strip=True).rsplit(":")[-1]
        for d in soup
    ]
    
    columns = ["Company", "Location", "Revenue"]
    df = pd.DataFrame(
        list(zip(companies, countries, revenue)),
        columns=columns,
    )
    
    print(tabulate(df, headers=columns, tablefmt="pretty"))
    
    

    页面1的示例输出:

    +----+------------------------------------------------------------+----------------------------+-------------+
    |    |                          Company                           |          Location          |   Revenue   |
    +----+------------------------------------------------------------+----------------------------+-------------+
    | 0  |             Shenzhen Zhaoji Optical Co., Ltd.              | Shenzhen, Guangdong, China |             |
    | 1  |           Foxconn Industrial Internet Co., Ltd.            | Shenzhen, Guangdong, China | $40,833.44M |
    | 2  |               BOE Technology Group Co., Ltd.               |  Beijing, Beijing, China   | $16,495.55M |
    | 3  |                   Futong Group Co., Ltd.                   | Hangzhou, Zhejiang, China  |             |
    | 4  |                   OFILM Group Co., Ltd.                    | Shenzhen, Guangdong, China | $5,355.25M  |
    | 5  |    Universal Scientific Industrial (Shanghai) Co., Ltd.    | Shanghai, Shanghai, China  | $5,287.83M  |
    | 6  |           Huizhou Jinyang Electronics Co., Ltd.            | Huizhou, Guangdong, China  |             |
    | 7  |                        Goertek Inc.                        |  Weifang, Shandong, China  | $5,018.67M  |
    | 8  |                    AUX Group Co., Ltd.                     |  Ningbo, Zhejiang, China   |             |
    | 9  |                    Jinko Solar Co., Ltd                    |  Shangrao, Jiangxi, China  |             |
    | 10 |              Samsung Display Dongguan Co.,Ltd              | Dongguan, Guangdong, China |             |
    | 11 |             Wuhan Zhongqiao Electric Co., Ltd.             |    Wuhan, Hubei, China     |             |
    | 12 |                   Trina Solar Co., Ltd.                    | Changzhou, Jiangsu, China  |             |
    | 13 |              Lingyi iTech (Guangdong) Company              | Jiangmen, Guangdong, China | $3,399.16M  |
    | 14 |                    Jcet Group Co., Ltd.                    |  Jiangyin, Jiangsu, China  | $3,343.79M  |
    | 15 |             TPV Electronics (Fujian) Co., Ltd.             |   Fuqing, Fujian, China    |             |
    | 16 |             Tianma Microelectronics Co., Ltd.              | Shenzhen, Guangdong, China | $3,277.77M  |
    | 17 |           Fortech Electronics (Suzhou) Co., Ltd.           |   Suzhou, Jiangsu, China   |             |
    | 18 |                   JingAo Solar Co., Ltd.                   |   Xingtai, Hebei, China    |             |
    | 19 |     Suzhou Dongshan Precision Manufacturing Co., Ltd.      |   Suzhou, Jiangsu, China   | $2,637.61M  |
    | 20 |                Holitech Technology Co.,Ltd.                |    Jian, Jiangxi, China    | $2,629.38M  |
    | 21 |     Ezhou Jianfeng Heavy Industry Machinery Co., Ltd.      |    Ezhou, Hubei, China     |             |
    | 22 |          Beijing BOE Display Technology Co., Ltd.          |  Beijing, Beijing, China   |             |
    | 23 |           Avary Holding (Shenzhen) Co., Limited            | Shenzhen, Guangdong, China | $2,523.89M  |
    | 24 |                 Bright Oceans Corporation                  |  Beijing, Beijing, China   | $2,499.47M  |
    | 25 |         Tunghsu Optoelectronic Technology Co.,Ltd.         |  Beijing, Beijing, China   | $2,491.36M  |
    | 26 |          Wuxi Taiji Industry Limited Corporation           |    Wuxi, Jiangsu, China    | $2,404.47M  |
    | 27 |         Tianjin Zhonghuan Semiconductor Co., Ltd.          |  Tianjin, Tianjin, China   | $2,400.15M  |
    | 28 |             Tpk Touch Solutions (Xiamen) Inc.              |   Xiamen, Fujian, China    |             |
    | 29 |       Mektec Manufacturing Corporation (Zhuhai) Ltd.       |  Zhuhai, Guangdong, China  |             |
    | 30 |                Truly Opto-Electronics Ltd.                 | Shanwei, Guangdong, China  |             |
    | 31 |         Guangdong HEC Technology Holding Co., Ltd.         | Dongguan, Guangdong, China | $2,098.86M  |
    | 32 |           Lingyi Technology (Shenzhen) Co., Ltd.           | Shenzhen, Guangdong, China |             |
    | 33 |   Zhejiang Longji Leye Photovoltaic Technology Co., Ltd.   |  Quzhou, Zhejiang, China   |             |
    | 34 |                Shengyi Technology Co., Ltd.                | Dongguan, Guangdong, China | $1,881.96M  |
    | 35 |            Shenzhen Kaifa Technology Co., Ltd.             | Shenzhen, Guangdong, China | $1,879.50M  |
    | 36 |              Shanghai Huahong(Group) Co.,Ltd               | Shanghai, Shanghai, China  |             |
    | 37 |         Wuhan P&S Information Technology Co.,Ltd.          |    Wuhan, Hubei, China     | $1,866.39M  |
    | 38 |              Longi Solar Technology Co.,Ltd.               |    Xian, Shaanxi, China    |             |
    | 39 |               Sungrow Power Supply Co., Ltd.               |    Hefei, Anhui, China     | $1,720.88M  |
    | 40 | Henan Shuangchen Electronic Science & Technology Co., Ltd. |   Zhoukou, Henan, China    |             |
    | 41 |             Fujian Furi Electronics Co., Ltd.              |   Fuzhou, Fujian, China    | $1,617.07M  |
    | 42 |                    Risen Energy Co.,Ltd                    |  Ningbo, Zhejiang, China   | $1,564.92M  |
    | 43 |           Dongguan Fuqiang Electronics Co.,Ltd.            | Dongguan, Guangdong, China |             |
    | 44 |     Hongfujin Precision Industry (Shenzhen) Co., Ltd.      | Shenzhen, Guangdong, China |             |
    | 45 |             Gcl-Poly (Su Zhou) Energy Limited              |   Suzhou, Jiangsu, China   |             |
    | 46 |                 Shennan Circuits Co., Ltd.                 | Shenzhen, Guangdong, China | $1,495.80M  |
    | 47 |           Futaihua Industry (Shenzhen) Co., Ltd.           | Shenzhen, Guangdong, China |             |
    | 48 |            Hefei JA Solar Technology Co., Ltd.             |    Hefei, Anhui, China     |             |
    | 49 |        Foxconn Kunshan Computer Connector Co., Ltd.        |  Kunshan, Jiangsu, China   |             |
    +----+------------------------------------------------------------+----------------------------+-------------+
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2016-12-31
      • 2022-10-13
      • 2021-08-30
      • 2017-01-08
      • 2012-10-07
      • 2019-09-19
      • 2017-03-17
      • 2017-03-23
      相关资源
      最近更新 更多