【问题标题】:Scrape Embedded Google Sheet from HTML in Python在 Python 中从 HTML 中抓取嵌入式 Google 表格
【发布时间】:2020-02-12 18:41:06
【问题描述】:

这个对我来说比较棘手。我正在尝试从 python 中的 google 表格中提取嵌入的表格。

这里是link

我不拥有这张表,但它是公开的。

到目前为止,这是我的代码,当我输出标题时,它向我显示“”。任何帮助将不胜感激。最终目标是将此表转换为 pandas DF。谢谢大家

import lxml.html as lh
import pandas as pd

url = 'https://docs.google.com/spreadsheets/u/0/d/e/2PACX-1vQ--HR_GTaiv2dxaVwIwWYzY2fXTSJJN0dugyQe_QJnZEpKm7bu5o7eh6javLIk2zj0qtnvjJPOyvu2/pubhtml/sheet?headers=false&gid=1503072727'

page = requests.get(url)

doc = lh.fromstring(page.content)

tr_elements = doc.xpath('//tr')

col = []
i = 0

for t in tr_elements[0]:
    i +=1
    name = t.text_content()
    print('%d:"%s"'%(i,name))
    col.append((name,[])) 

【问题讨论】:

    标签: python google-sheets scrape


    【解决方案1】:

    如果你想将数据放入 DataFrame,你可以直接加载它:

    df = pd.read_html('https://docs.google.com/spreadsheets/u/0/d/e/2PACX-1vQ--HR_GTaiv2dxaVwIwWYzY2fXTSJJN0dugyQe_QJnZEpKm7bu5o7eh6javLIk2zj0qtnvjJPOyvu2/pubhtml/sheet?headers=false&gid=1503072727', 
                      header=1)[0]
    df.drop(columns='1', inplace=True)  # remove unnecessary index column called "1"
    

    这会给你:

                                   Target Ticker                   Acquirer  \
    0       Acacia Communications Inc Com   ACIA      Cisco Systems Inc Com   
    1  Advanced Disposal Services Inc Com   ADSW   Waste Management Inc Com   
    2                    Allergan Plc Com    AGN             Abbvie Inc Com   
    3           Ak Steel Holding Corp Com    AKS   Cleveland Cliffs Inc Com   
    4      Td Ameritrade Holding Corp Com   AMTD  Schwab (Charles) Corp Com   
    
      Ticker.1 Current Price Take Over Price Price Diff % Diff Date Announced  \
    0     CSCO        $68.79          $70.00      $1.21  1.76%       7/9/2019   
    1       WM        $32.93          $33.15      $0.22  0.67%      4/15/2019   
    2     ABBV       $197.05         $200.22      $3.17  1.61%      6/25/2019   
    3      CLF         $2.98           $3.02      $0.04  1.34%      12/3/2019   
    4     SCHW        $49.31          $51.27      $1.96  3.97%     11/25/2019   
    
      Deal Type  
    0      Cash  
    1      Cash  
    2       C&S  
    3     Stock  
    4     Stock  
    

    注意read_html 返回一个列表。在这种情况下只有 1个DataFrame,所以我们可以参考第一个也是唯一一个索引位置[0]

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2013-01-26
      • 1970-01-01
      • 1970-01-01
      • 2021-07-08
      • 1970-01-01
      • 2020-11-28
      • 2020-07-28
      • 1970-01-01
      相关资源
      最近更新 更多