【问题标题】:Extract table from web从网络中提取表格
【发布时间】:2019-05-20 04:31:23
【问题描述】:

需要从下面的url 中提取<a href="#">Data</a> 的数据。 任何线索如何将此表提取到 DataFrames 中?

from bs4 import BeautifulSoup
import requests

url = 'https://docs.google.com/spreadsheets/d/1dgOdlUEq6_V55OHZCxz5BG_0uoghJTeA6f83br5peNs/pub?range=A1:D70&gid=1&output=html#'

r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc, features='html.parser')

#print(soup.prettify())
print(soup.title)

【问题讨论】:

    标签: python pandas web-scraping beautifulsoup web-crawler


    【解决方案1】:

    从多维列表开始可能会更容易,然后将其移植到 DataFrame,这样我们就不会假设大小。 "Data" 超链接引用了 div id=0,因此我们选择其中的所有元素,然后将每一行中的每一列解析为一个列表数组(我在其中称为 elements em>) 被附加到一个完整列表数组(我称之为 fullelements)并为每个新行重置。

    from bs4 import BeautifulSoup
    import pandas as pd
    import requests
    
    url = 'https://docs.google.com/spreadsheets/d/1dgOdlUEq6_V55OHZCxz5BG_0uoghJTeA6f83br5peNs/pub?range=A1:D70&gid=1&output=html#'
    
    r = requests.get(url)
    html_doc = r.text
    soup = BeautifulSoup(html_doc, features='html.parser')
    
    #print(soup.prettify())
    print(soup.title.text)
    datadiv=soup.find("div", {"id": "0"})
    elementsfull =[]
    row=0
    for tr in datadiv.findAll("tr"):
        elements=[]
        column=0
        for td in tr.findAll("td"):
            if(td.text!=''):
                elements.append(td.text)
                column+=1
                #print('column: ', column)   
    
        elementsfull.append(elements)        
        #print('row: ', row)        
        row+=1
    
    mydf = pd.DataFrame(data=elementsfull)
    print(mydf)
    

    我测试了这段代码并对照表格检查了它,所以我保证它可以工作。

    【讨论】:

      【解决方案2】:
      import bs4 as bs
      import requests
      import pandas as pd
      
      url = 'https://docs.google.com/spreadsheets/d/1dgOdlUEq6_V55OHZCxz5BG_0uoghJTeA6f83br5peNs/pub?range=A1:D70&gid=1&output=html#'
      
      r = requests.get(url)
      html_doc = r.text
      soup = bs.BeautifulSoup(html_doc, features='html.parser')
      
      table = soup.find('table', attrs={'class':'subs noBorders evenRows'})
      table_rows = soup.find_all('tr')
      
      list1 = []
      for tr in table_rows:
          td = tr.find_all('td')
          row = [tr.text for tr in td]
          list1.append(row)
      
      df=pd.DataFrame(list1)    
      df.columns =  df.iloc[1]
      #starting from this point,it's just how you want to clean and slice the data
      df = df.iloc[3:263]  #check the data to see if you want to only read these
      df.dropna(axis='columns', how='all', inplace=True)
      

      【讨论】:

        【解决方案3】:

        您可以根据需要读取_html并处理数据帧

        import pandas as pd
        results = pd.read_html('https://docs.google.com/spreadsheets/d/1dgOdlUEq6_V55OHZCxz5BG_0uoghJTeA6f83br5peNs/pub?range=A1:D70&gid=1&output=html#')
        result = results[0].dropna(how='all')
        del result[0]
        result.dropna(axis='columns', how='all', inplace=True)
        result.to_csv(r'C:\Users\User\Desktop\Data.csv', sep=',', encoding='utf_8_sig',index = False, header=None)
        

        【讨论】:

          猜你喜欢
          • 2020-12-06
          • 2012-05-15
          • 2017-10-16
          • 2015-06-05
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多