【发布时间】:2022-01-30 02:14:58
【问题描述】:
我想从这个网站上的表中获取数据:https://www.skyscrapercenter.com/quick-lists#q=&page=1&type=building&status=COM&status=UCT&min_year=0&max_year=9999®ion=0&country=0&city=0。当我尝试阅读表格的 html 内容时,它给了我一个空的正文,例如
<thead>
<tr>
<th width="4%"> <div class="flex">#</div> </th>
<th width="15"> </th>
<th> <div class="flex">Building Name</div> </th>
<th width="15%"> <div class="flex">City</div> </th>
<th width="8%"> <div class="flex">Height m</div> </th>
<th width="8%"> <div class="flex">Floors</div> </th>
<th width="8%"> <div class="flex">Completion</div> </th>
<th width="10%"> <div class="flex">Material</div> </th>
<th width="15%"> <div class="flex">Use</div> </th>
</tr>
</thead>
<tbody>
</tbody>
</table>
Inspect 元素显示正文中有数据,但使用我的代码我只能从 thead 获取信息。 find_all('tr') 只给我来自 thead 的数据,而 find_all('td') 什么也没给。这是我的代码
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.skyscrapercenter.com/quick-lists#q=&page=1&type=building&status=COM&status=UCT&min_year=0&max_year=9999®ion=0&country=0&city=0'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
table1 = soup.find('table', id='table-buildings')
headers = []
for i in table1.find_all('th'):
title = i.text
headers.append(title)
mydata = pd.DataFrame(columns = headers)
# Create a for loop to fill mydata
for j in table1.find_all('tr'):
row_data = j.find_all('td')
row = [i.text for i in row_data]
length = len(mydata)
mydata.append = row
mydata
我找到了this similar post,但他们使用的链接已损坏,所以我无法检查它,老实说,我不太知道如何根据自己的情况调整答案,因为我对抓取还很陌生。
我的另一个问题是如何访问下一页上的行,我想抓取所有 500 个结果,而不仅仅是前 50 个。提前致谢!
【问题讨论】:
标签: python html web-scraping beautifulsoup