【问题标题】:Get content of table in BeautifulSoup获取 BeautifulSoup 中表格的内容
【发布时间】:2015-12-02 18:08:56
【问题描述】:

我在使用 BeautifulSoup 提取的网站上有下表 这是网址(我还附上了图片

理想情况下,我希望将每家公司放在 csv 中的一行中,但是我将其放在不同的行中。请看附图。

我希望它像在“D”字段中一样,但我在 A1、A2、A3 中得到它...

这是我用来提取的代码:

def _writeInCSV(text):
    print "Writing in CSV File"
    with open('sara.csv', 'wb') as csvfile:
        #spamwriter = csv.writer(csvfile, delimiter='\t',quotechar='\n', quoting=csv.QUOTE_MINIMAL)
        spamwriter = csv.writer(csvfile, delimiter='\t',quotechar="\n")

        for item in text:
            spamwriter.writerow([item])

read_list=[]
initial_list=[]


url="http://www.nse.com.ng/Issuers-section/corporate-disclosures/corporate-actions/closure-of-register"
r=requests.get(url)
soup = BeautifulSoup(r._content, "html.parser")

#gdata_even=soup.find_all("td", {"class":"ms-rteTableEvenRow-3"})

gdata_even=soup.find_all("td", {"class":"ms-rteTable-default"})




for item in gdata_even:
    print item.text.encode("utf-8")
    initial_list.append(item.text.encode("utf-8"))
    print ""

_writeInCSV(initial_list)

有人可以帮忙吗?

【问题讨论】:

  • 如果我可以在 csv 中复制整个表格会更好,但我正在为如何做到这一点而苦苦挣扎

标签: python csv web-scraping beautifulsoup html-parsing


【解决方案1】:

这是一个想法:

  • 从表格中读取标题单元格
  • 从表中读取所有其他行
  • 压缩所有带有标题的数据行单元格,生成字典列表
  • 使用csv.DictWriter() 转储到 csv

实施:

import csv
from pprint import pprint

from bs4 import BeautifulSoup
import requests

url = "http://www.nse.com.ng/Issuers-section/corporate-disclosures/corporate-actions/closure-of-register"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

rows = soup.select("table.ms-rteTable-default tr")
headers = [header.get_text(strip=True).encode("utf-8") for header in rows[0].find_all("td")]

data = [dict(zip(headers, [cell.get_text(strip=True).encode("utf-8") for cell in row.find_all("td")]))
        for row in rows[1:]]

# see what the data looks like at this point
pprint(data)

with open('sara.csv', 'wb') as csvfile:
    spamwriter = csv.DictWriter(csvfile, headers, delimiter='\t', quotechar="\n")

    for row in data:
        spamwriter.writerow(row)

【讨论】:

    【解决方案2】:

    由于@alecxe 已经提供了一个惊人的答案,这里是另一个使用pandas 库的方法。

    import pandas as pd
    
    url = "http://www.nse.com.ng/Issuers-section/corporate-disclosures/corporate-actions/closure-of-register"
    tables = pd.read_html(url)
    
    tb1 = tables[0] # Get the first table.
    tb1.columns = tb1.iloc[0] # Assign the first row as header.
    tb1 = tb1.iloc[1:] # Drop the first row.
    tb1.reset_index(drop=True, inplace=True) # Reset the index.
    
    print tb1.head() # Print first 5 rows.
    # tb1.to_csv("table1.csv") # Export to CSV file.
    

    结果:

    In [5]: runfile('C:/Users/.../.spyder2/temp.py', wdir='C:/Users/.../.spyder2')
    0                 Company       Dividend    Bonus     Closure of Register  \
    0  Nigerian Breweries Plc          N3.50      Nil   5th - 11th March 2015   
    1           Forte Oil Plc          N2.50  1 for 5    1st – 7th April 2015   
    2          Nestle Nigeria         N17.50      Nil         27th April 2015   
    3       Greif Nigeria Plc        60 kobo      Nil  25th - 27th March 2015   
    4       Guaranty Bank Plc  N1.50 (final)      Nil         17th March 2015   
    
    0          AGM Date     Payment Date  
    0     13th May 2015    14th May 2015  
    1   15th April 2015  22nd April 2015  
    2     11th May 2015    12th May 2015  
    3   28th April 2015     5th May 2015  
    4  ​31st March 2015  31st March 2015  
    
    In [6]: 
    

    【讨论】:

    • 我收到错误:C:\Python27\python.exe C:/Users/Anant/XetraWebBot/Test/ReadCSV.py Traceback(最近一次调用最后):文件“C:/Users/ Anant/XetraWebBot/Test/ReadCSV.py",第 4 行,在 表中 = pd.read_html(url) AttributeError: 'module' object has no attribute 'read_html'
    • 很可能您没有更新的pandas 或者您没有html5lib 模块。预先警告:pandas 简化了表格抓取,如您在上面看到的,但设置它可能会非常麻烦,除非您使用像 Anaconda 这样的发行版(这是我在上面使用的)。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2011-02-25
    • 2021-02-13
    • 2020-09-06
    • 1970-01-01
    • 2022-11-14
    • 2021-04-15
    • 1970-01-01
    相关资源
    最近更新 更多