【问题标题】:extracting tables using BeautifulSoup使用 BeautifulSoup 提取表格
【发布时间】:2018-07-17 12:46:34
【问题描述】:

我想使用 BeautifulSoup 从下面给出的 html 文件中提取所有表格并将其写入 csv。

HTML 如下所示:

        <h4>Site Name : Aria</h4>   
            <table style="width: 100%">
                <tbody><tr>
                    <th style="width: 25%"><strong>Dn Name:</strong></th>
                    <td style="width: 25%"><strong>Aria</strong></td>

                        <th style="width: 25%"><strong>WL:</strong></th>
                        <td style="width: 25%"><strong> Meters (m)</strong></td>

                </tr>
                <tr>
                    <th><strong>River Name:</strong></th>
                    <td><strong>Ben</strong></td>

                        <th><strong>DL:</strong></th>
                        <td><strong> Meters (m)</strong></td>

                </tr>
                <tr>
                    <th><strong>Basin Name:</strong></th>
                    <td><strong>GAN<strong></strong></strong></td>

                    <th><strong>HFL:</strong></th>
                    <td><strong>49.4 Meters (m)<strong></strong></strong></td>

                </tr>
                <tr>
                    <th><strong>Div Name:</strong></th>

                    <td><a target="_blank" href="http://imd.gov.in/ onclick="window.open(this.href, this.target, &#39;width=1000, height=600, toolbar=no, resizable=no&#39;); return false;">LGD-I</a></td>

                    <th><strong>HFL date:</strong></th>
                    <td>14-08-2017</td>

                </tr>
            </tbody></table>
            <p>&nbsp;</p>
            <table>
                <tbody><tr>
                    <th colspan="3" style="text-align: center;"><strong>PRESENT WL</strong></th>
                </tr>

                <tr>                            

                    <td class="" style="width:33%; height:18px;">Date: 17-07-2018 12:00</td>
                    <td class="" style="width:33%;">Value: 45.43 Meters (m)</td>
                    <td class="" style="width:33%;">Trend: Steady</td>
                </tr>
                <tr>
                    <th colspan="3" style="text-align: center;"><strong>CUMULATIVE DAILY RF</strong></th>
                </tr>
                <tr>

                        <td style="width:33%; height:18px;">Date: 17-07-2018 08:30</td>
                        <td style="width:33%;">Value: 0.0 Milimiters (mm)</td>
                        <td style="width:33%;"></td>

                </tr>
            </tbody></table>                            
                <p>&nbsp;</p>                       



                            <table style="width: 100%">
                                <tbody><tr>
                                    <th colspan="4" style="text-align: center;"><strong>NO FORECAST</strong></th>
                                </tr>
                            </tbody></table>




</div>

我慢慢地从所有三个表格中提取文本,但我无法以所需的格式编写它

我的代码

now = datetime.datetime.now()
date = now.strftime("%d-%m-%Y")
os.chdir(r'D:\shared')


soup = BeautifulSoup(response.text,"html5lib")

tables = soup.find_all("tr")
test =[]
for table in tables:
    test.append(table.get_text())

filename = 'Water'+'-'+str(date)+'.csv'
out = open(filename, mode='ab')
writer = csv.writer(out)
writer.writerow(data)
out.close()

在输出 csv 中,第一个表被写入第一列,第二个表被写入第二个表,第三个表被写入第三列。

我想要以下格式的数据:

Site Name:  Aria
Dn Name:    Aria    
WL:         Meters (m)
River Name: Ben 
DL:         Meters (m)
Basin Name: GAN
HFL:        49.4 Meters (m)
Div Name:   LGD-I)
HFL date:   14-08-2017

PRESENT WL
Date:       17-07-2018 12:00    
Value:      45.43 Meters (m)    
Trend:      Steady
CUMULATIVE 
DAILY RF
Date:       17-07-2018 08:30    
Value:      0.0 Milimiters (mm) 
NO FORECAST

【问题讨论】:

  • data 的结构是什么 - 您正在使用 csv writer 编写变量?
  • 数据是上面提到的html...

标签: python beautifulsoup


【解决方案1】:

我对这个问题的尝试:

data = """
        <h4>Site Name : Aria</h4>
            <table style="width: 100%">
                <tbody><tr>
                    <th style="width: 25%"><strong>Dn Name:</strong></th>
                    <td style="width: 25%"><strong>Aria</strong></td>

                        <th style="width: 25%"><strong>WL:</strong></th>
                        <td style="width: 25%"><strong> Meters (m)</strong></td>

                </tr>
                <tr>
                    <th><strong>River Name:</strong></th>
                    <td><strong>Ben</strong></td>

                        <th><strong>DL:</strong></th>
                        <td><strong> Meters (m)</strong></td>

                </tr>
                <tr>
                    <th><strong>Basin Name:</strong></th>
                    <td><strong>GAN<strong></strong></strong></td>

                    <th><strong>HFL:</strong></th>
                    <td><strong>49.4 Meters (m)<strong></strong></strong></td>

                </tr>
                <tr>
                    <th><strong>Div Name:</strong></th>

                    <td><a target="_blank" href="http://imd.gov.in/ onclick="window.open(this.href, this.target, &#39;width=1000, height=600, toolbar=no, resizable=no&#39;); return false;">LGD-I</a></td>

                    <th><strong>HFL date:</strong></th>
                    <td>14-08-2017</td>

                </tr>
            </tbody></table>
            <p>&nbsp;</p>
            <table>
                <tbody><tr>
                    <th colspan="3" style="text-align: center;"><strong>PRESENT WL</strong></th>
                </tr>

                <tr>

                    <td class="" style="width:33%; height:18px;">Date: 17-07-2018 12:00</td>
                    <td class="" style="width:33%;">Value: 45.43 Meters (m)</td>
                    <td class="" style="width:33%;">Trend: Steady</td>
                </tr>
                <tr>
                    <th colspan="3" style="text-align: center;"><strong>CUMULATIVE DAILY RF</strong></th>
                </tr>
                <tr>

                        <td style="width:33%; height:18px;">Date: 17-07-2018 08:30</td>
                        <td style="width:33%;">Value: 0.0 Milimiters (mm)</td>
                        <td style="width:33%;"></td>

                </tr>
            </tbody></table>
                <p>&nbsp;</p>



                            <table style="width: 100%">
                                <tbody><tr>
                                    <th colspan="4" style="text-align: center;"><strong>NO FORECAST</strong></th>
                                </tr>
                            </tbody></table>
</div>"""

import os
import datetime
from bs4 import BeautifulSoup
from pprint import pprint
# For Python 2.7 the next line should be "from itertools import izip_longest"
from itertools import zip_longest
import csv

now = datetime.datetime.now()
date = now.strftime("%d-%m-%Y")
# os.chdir(r'D:\shared')

soup = BeautifulSoup(data, "lxml")

tables = []
for table in soup.find_all('table'):
    current_table = []
    tables.append(current_table)
    for row in table.find_all("tr"):
        for (th, td) in zip_longest(row.find_all('th'), row.find_all('td')):
            s = ("%s %s" % (th.text.strip() if th else '', td.text.strip() if td else '')).strip()
            if s:
                current_table.append(s)

tables[0].insert(0, ': '.join(w.strip() for w in soup.find('h4').text.split(':')))

for table in tables:
    for i in table:
        print(i)

filename = 'CWC-Water'+'-'+str(date)+'.csv'
out = open(filename, mode='w')
writer = csv.writer(out)
for table in zip_longest(*tables):
    writer.writerow(table)
out.close()

打印:

Site Name: Aria
Dn Name: Aria
WL: Meters (m)
River Name: Ben
DL: Meters (m)
Basin Name: GAN
HFL: 49.4 Meters (m)
Div Name: LGD-I
HFL date: 14-08-2017
PRESENT WL
Date: 17-07-2018 12:00
Value: 45.43 Meters (m)
Trend: Steady
CUMULATIVE DAILY RF
Date: 17-07-2018 08:30
Value: 0.0 Milimiters (mm)
NO FORECAST

并输出具有以下格式的 .csv 文件(表格中的 3 列,LibreOffice 中的屏幕截图):

编辑: - 正确的图片

【讨论】:

  • 感谢您的帮助...我正在使用 Python 2.7 并收到错误s = f"{th.text.strip() if th else ''} {td.text.strip() if td else ''}".strip ^SyntaxError: invalid syntax.Also 我们可以在第一个表格下方写第二个表格,在第二个表格下方写第三个表格,正如我在问题中提到的那样。
  • @gi.rajan 编辑了我的答案,应该适用于 Python 2.7
  • 好的,现在它正在工作,我必须从itertools import izip_longest 使用 Python 2.7。您的代码以正确的格式打印数据我们可以以相同的格式保存表格吗???我想要一列中的所有数据。
  • 我已修改您的代码以获得所需的输出for table in tables: writer.writerows(izip(table)) out.close()
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2018-10-04
  • 1970-01-01
  • 1970-01-01
  • 2011-03-11
相关资源
最近更新 更多