从网站抓取表格数据时出错答案

【问题标题】：Error while grabbing the table data from a website从网站抓取表格数据时出错
【发布时间】：2018-01-11 02:56:27
【问题描述】：

我正在尝试从网络上为我的项目获取一些与股票相关的数据。我遇到了几个问题。
问题 1：
我试图从这个网站上抢桌子http://sharesansar.com/c/today-share-price.html
它有效，但没有按顺序抓取列。例如：列“公司名称”的值为“开盘价”。我该如何解决这个问题？
问题 2：
我还尝试从“价格历史”选项卡下的http://merolagani.com/CompanyDetail.aspx?symbol=ADBL 获取公司特定数据。
这次在抓取表格数据的时候报错了，报错是：

self.data[key].append(cols[index].get_text())

IndexError: list index out of range

代码如下：

import logging
import requests
from bs4 import BeautifulSoup
import pandas


module_logger = logging.getLogger('mainApp.dataGrabber')


class DataGrabberTable:
    ''' Grabs the table data from a certain url. '''

    def __init__(self, url, csvfilename, columnName=[], tableclass=None):
        module_logger.info("Inside 'DataGrabberTable' constructor.")
        self.pgurl = url
        self.tableclass = tableclass
        self.csvfile = csvfilename
        self.columnName = columnName

        self.tableattrs = {'class':tableclass} #to be passed in find()

        module_logger.info("Done.")


    def run(self):
        '''Call this to run the datagrabber. Returns 1 if error occurs.'''

        module_logger.info("Inside 'DataGrabberTable.run()'.")

        try:
            self.rawpgdata = (requests.get(self.pgurl, timeout=5)).text
        except Exception as e:
            module_logger.warning('Error occured: {0}'.format(e))
            return 1

        #module_logger.info('Headers from the server:\n {0}'.format(self.rawpgdata.headers))

        soup = BeautifulSoup(self.rawpgdata, 'lxml')

        module_logger.info('Connected and parsed the data.')

        table = soup.find('table',attrs = self.tableattrs)
        rows = table.find_all('tr')[1:]

        #initializing a dict in a format below
        # data = {'col1' : [...], 'col2' : [...], }
        #col1 and col2 are from columnName list
        self.data = {}
        self.data = dict(zip(self.columnName, [list() for i in range(len(self.columnName))]))

        module_logger.info('Inside for loop.')
        for row in rows:
            cols = row.find_all('td')
            index = 0
            for key in self.data:
                if index > len(cols): break
                self.data[key].append(cols[index].get_text())
                index += 1
        module_logger.info('Completed the for loop.')

        self.dataframe = pandas.DataFrame(self.data)    #make pandas dataframe

        module_logger.info('writing to file {0}'.format(self.csvfile))
        self.dataframe.to_csv(self.csvfile)
        module_logger.info('written to file {0}'.format(self.csvfile))

        module_logger.info("Done.")
        return 0

    def getData(self):
        """"Returns 'data' dictionary."""
        return self.data




    # Usage example

    def main():
        url = "http://sharesansar.com/c/today-share-price.html"
        classname = "table"
        fname = "data/sharesansardata.csv"
        cols = [str(i) for i in range(18)] #make a list of columns

        '''cols = [
          'S.No', 'Company Name', 'Symbol', 'Open price', 'Max price', 
         'Min price','Closing price', 'Volume', 'Previous closing', 
         'Turnover','Difference',
         'Diff percent', 'Range', 'Range percent', '90 days', '180 days',
         '360 days', '52 weeks high', '52 weeks low']'''

        d = DataGrabberTable(url, fname, cols, classname)
        if d.run() is 1:
            print('Data grabbing failed!')
        else:
            print('Data grabbing done.') 


    if __name__ == '__main__':
        main()

一些建议会有所帮助。谢谢！

【问题讨论】：

标签： python pandas beautifulsoup

【解决方案1】：

您的 col 列表缺少一个元素，有 19 列，而不是 18 列：

>>> len([str(i) for i in range(18)])
18

除了你似乎把事情复杂化了。应该这样做：

import requests
from bs4 import BeautifulSoup
import pandas as pd

price_response = requests.get('http://sharesansar.com/c/today-share-price.html')
price_table = BeautifulSoup(price_response.text, 'lxml').find('table', {'class': 'table'})
price_rows = [[cell.text for cell in row.find_all(['th', 'td'])] for row in price_table.find_all('tr')]
price_df = pd.DataFrame(price_rows[1:], columns=price_rows[0])

com_df = None
for symbol in price_df['Symbol']:
    comp_response = requests.get('http://merolagani.com/CompanyDetail.aspx?symbol=%s' % symbol)
    comp_table = BeautifulSoup(comp_response.text, 'lxml').find('table', {'class': 'table'})
    com_header, com_value = list(), list()
    for tbody in comp_table.find_all('tbody'):
        comp_row = tbody.find('tr')
        com_header.append(comp_row.find('th').text.strip().replace('\n', ' ').replace('\r', ' '))
        com_value.append(comp_row.find('td').text.strip().replace('\n', ' ').replace('\r', ' '))
    df = pd.DataFrame([com_value], columns=com_header)
    com_df = df if com_df is None else pd.concat([com_df, df])

print(price_df)
print(com_df)

【讨论】：

我仍然得到列不匹配（问题 1）。
成功了！非常感谢。问题2呢？有没有发现什么错误？
@Kishor 它正在获取第一页上的所有代码。将.head(3) 添加到price_df['Symbol']，这样for symbol in price_df['Symbol'].head(3): 就会提前终止。然后您只获取前 3 个公司信息页面。
关于问题 2，在该站点中，当您向下滚动时，您会看到一个选项卡（具有关于、公告、新闻价格历史等列）。在“价格历史”选项卡下有一张桌子，我想抓住它。我试过但做不到。你能帮帮我吗？
@Kishor 获取 ID 为 divHistory 的 div：.find_all('div', {'id': 'divHistory'})。从那里抓住桌子，然后像以前一样抓住行等。试试看，如果您遇到困难，请向我们展示您的代码，以及失败的方式/位置。我强烈建议您从右键单击上下文菜单中使用 Chrome 的检查工具。右键单击您要查找的任何内容，然后进行检查。这将向您准确显示您需要抓取的内容。