【问题标题】:Python - Extracting data from the tablePython - 从表中提取数据
【发布时间】:2019-08-01 02:34:02
【问题描述】:

我正在尝试从表中提取数据,并使用漂亮的汤库访问这些数据。我将表格作为 html 获取,但我正在努力以可消耗的形式提取数据,因为表格本身有两列,第一列是标题,第二列是值。

这是我的代码:

html = browser.html
soup = bs(html, "html.parser")

table = soup.find("table", {"id":"productDetails_techSpec_section_1"})
table

打印表格结果:

"<table class="a-keyvalue prodDetTable" id="productDetails_techSpec_section_1" role="presentation">
<tbody><tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
                    Part Number 
                </th>
<td class="a-size-base">
              3885SD
            </td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
                  Item Weight
                </th>
<td class="a-size-base">
              1.83 pounds
            </td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
                  Product Dimensions
                </th>
<td class="a-size-base">
              9 x 6 x 3.5 inches
            </td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
                  Item model number
                </th>
<td class="a-size-base">
              3885SD
            </td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
                  Item Package Quantity
                </th>
<td class="a-size-base">
              1
            </td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
                  Number of Handles
                </th>
<td class="a-size-base">
              1
            </td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
                  Batteries Included?
                </th>
<td class="a-size-base">
              No
            </td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
                  Batteries Required?
                </th>
<td class="a-size-base">
              No
            </td>
</tr>
</tbody></table>"

我尝试使用这行代码来访问每个标题和数据点:

headings = [table.get_text() for th in table.find("tr").find_all("th")]
print(headings)

这是我得到的回应:

['\n\n\n                  \tPart Number\t\n                \n\n              3885SD\n            \n\n\n\n                  Item Weight\n                \n\n              1.83 pounds\n            \n\n\n\n                  Product Dimensions\n                \n\n              9 x 6 x 3.5 inches\n            \n\n\n\n                  Item model number\n                \n\n              3885SD\n            \n\n\n\n                  Item Package Quantity\n                \n\n              1\n            \n\n\n\n                  Number of Handles\n                \n\n              1\n            \n\n\n\n                  Batteries Included?\n                \n\n              No\n            \n\n\n\n                  Batteries Required?\n                \n\n              No\n            \n\n']

我一直在研究将这些数据输入pandas dataframe 的不同方法,这是迄今为止我得到的结果。 我的问题是如何将这些数据放入数据框中,其中我的标题和值如下例所示?

【问题讨论】:

  • 这应该是headings = [th.get_text() for th in table.find("tr").find_all("th")],对吧? table.get_text() 在列表理解中看起来有问题。
  • @JustinEzequiel - 嗨,谢谢您的回复。我也尝试过,结果如下: ['\n \tPart Number\t\n ']

标签: python pandas dataframe beautifulsoup


【解决方案1】:

例如

 import pandas as pd

html = """<table class="a-keyvalue prodDetTable" id="productDetails_techSpec_section_1" role="presentation">
 <tbody><tr><th class="a-color-secondary a-size-base prodDetSectionEntry">Part Number </th>
 <td class="a-size-base">3885SD</td></tr><tr>
 <th class="a-color-secondary a-size-base prodDetSectionEntry">
 Item Weight</th><td class="a-size-base">1.83 pounds</td></tr>
 <tr><th class="a-color-secondary a-size-base prodDetSectionEntry">Product Dimensions</th>
 <td class="a-size-base">9 x 6 x 3.5 inches</td>
 </tr><tr><th class="a-color-secondary a-size-base prodDetSectionEntry">Item model number</th>
 <td class="a-size-base">3885SD</td></tr>
 <tr><th class="a-color-secondary a-size-base prodDetSectionEntry">Item Package Quantity
 </th><td class="a-size-base">1</td></tr><tr>
 <th class="a-color-secondary a-size-base prodDetSectionEntry">Number of Handles
 </th><td class="a-size-base">1</td></tr><tr>
 <th class="a-color-secondary a-size-base prodDetSectionEntry">Batteries Included?
 </th><td class="a-size-base">No</td></tr><tr>
 <th class="a-color-secondary a-size-base prodDetSectionEntry">
  Batteries Required?</th><td class="a-size-base">No</td></tr></tbody></table>"""

#read table data
df = pd.read_html(html)[0]
cols = df[0]
vals = df[1]

table = pd.DataFrame(vals).T
#reset columns name
table.columns = cols
print(table)

O/P:

0 Part Number  Item Weight  Product Dimensions Item model number Item Package Quantity Number of Handles Batteries Included? Batteries Required?
1      3885SD  1.83 pounds  9 x 6 x 3.5 inches            3885SD                     1                 1                  No                  No

【讨论】:

    【解决方案2】:

    您可以使用zip() 转置表中的值:

    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(data, 'html.parser') # data is your table from question
    
    rows = []
    for tr in soup.select('tr'):
        rows.append([td.get_text(strip=True) for td in tr.select('th, td')])
    
    rows = [*zip(*rows)]    # transpose values
    
    for row in rows:
        print(''.join(r'{: <25}'.format(d) for d in row))
    

    打印:

    Part Number              Item Weight              Product Dimensions       Item model number        Item Package Quantity    Number of Handles        Batteries Included?      Batteries Required?      
    3885SD                   1.83 pounds              9 x 6 x 3.5 inches       3885SD                   1                        1                        No                       No                       
    

    【讨论】:

      【解决方案3】:

      解决方案: 创建解析表的函数:

      def parse_table(table):
          """ Get data from table """
          return [
              [cell.get_text().strip() for cell in row.find_all(['th', 'td'])]
                 for row in table.find_all('tr')
          ]
      

      然后使用该函数创建新表并将表转换为熊猫数据框:

      new_table = parse_table(table)
      df = pd.DataFrame(new_table)
      df =df.T
      df.columns = df.iloc[0]
      df = df[1:]
      df
      

      【讨论】:

        猜你喜欢
        • 2021-06-22
        • 2011-10-16
        • 1970-01-01
        • 2019-04-17
        • 2022-01-05
        • 2019-02-16
        • 2021-03-18
        • 2021-10-14
        • 2020-12-06
        相关资源
        最近更新 更多