【发布时间】:2020-12-18 18:53:01
【问题描述】:
我正在尝试刮一张桌子:
<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
<table class="table ajax">
<thead>
<tr>
<th scope="col">
<span>NO.</span>
</th>
<th scope="col" data-index="1">
<span>Year of initiation</span>
</th>
<th scope="col" data-index="2">
<span>Short case name</span>
</th>
<th scope="col" data-index="3" style="display: none;">
<span>Full case name</span>
</th>
<th scope="col" data-index="4">
<span>Applicable IIA</span>
</tr>
</thead>
<tbody>
<tr>
<th scope="row">1</th>
<td data-index="1">
2019
</td>
<td data-index="2">
Alcosa v. Kuwait</a>
</td>
<td data-index="3" style="display: none;">
Alcosa v. The State of Kuwait
</td>
<td data-index="4">
Kuwait - Spain BIT(2005)</a> </td>
<td data-index="5"> UNCITRAL
</td>
</tbody>
</table>
</body>
</html>
使用以下代码:
html = driver.page_source
bs=BeautifulSoup(html, "lxml")
table = bs.find('table', { 'class' : 'ajax' })
table_body=table.find('tbody')
rows = table_body.findAll('tr')
with open('son.csv', "wt+") as f:
writer = csv.writer(f)
for row in rows:
cols = row.find_all('td')
cols = [x.get_text(strip=True, separator='|') for x in cols]
writer.writerow(cols)
我可以获取表格行但无法获取表格标题。
这是我想要得到的输出:
NO. Year of initiation Short case name Applicable IIA
1 2019 Alcosa v. Kuwait Kuwait - Spain BIT(2005) UNCITRAL
我该怎么做?谢谢。
【问题讨论】:
-
为什么不采用与
tbody相同的方式? - table_header = table.find('thead') -
你能编辑你的问题并把HTML样本和预期的输出放在那里吗?
-
@Ryan 我想获取标题+内容。 table_header = table.find('thead') 可以给我标题,但我要如何将它们附加到表格行?
标签: python web-scraping beautifulsoup