【发布时间】:2021-12-01 06:48:42
【问题描述】:
我有一个 HTML 列表,从这个列表中我只想要具有 class="" 的 <tr> 元素。我想稍后下载文件,所以我只需要第三个<td> 和这个里面的<a> 元素的href,我怎样才能将它们直接作为字符串读出?
我想要所有带有class = "" 的<tr> 元素。
例如:
<tr class="">
<td>29 September, 2021</td>
<td>Antwerp</td>
<td><a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/listings.csv.gz" onclick="var that=this;ga('send','event', 'download','listings',this.href);setTimeout(function(){location.href=that.href;},200);return false;">listings.csv.gz</a></td>
<td>Detailed Listings data for Antwerp</td>
在这个<tr> 元素内部有一个<td> 元素。我想在第三个<td> 元素中包含<a> 元素的href。所以我想要的是网址http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/listings.csv.gz(不仅是这个:D,我想要所有网址)
代码
import requests
from bs4 import BeautifulSoup
from datetime import datetime
DATASET_URL = "http://insideairbnb.com/get-the-data.html"
DATASET_CITY = "Antwerp"
r = requests.get(DATASET_URL)
content = r.content
soup = BeautifulSoup(content, "html.parser")
antwerp_table = soup.find(class_=DATASET_CITY.lower())
print(antwerp_table)
# antwerp_table is my html table
html 示例(更多信息请访问http://insideairbnb.com/get-the-data.html)
<table class="table table-hover table-striped antwerp">
<thead>
<tr>
<th class="col-md-3" data-field="host_id">Date Compiled</th>
<th class="col-md-3" data-field="host_id">Country/City</th>
<th class="col-md-3" data-field="host_id">File Name</th>
<th class="col-md-3" data-align="right" data-field="count">
Description
</th>
</tr>
</thead>
<tbody>
<tr class="">
<td>29 September, 2021</td>
<td>Antwerp</td>
<td><a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/listings.csv.gz" onclick="var that=this;ga('send','event', 'download','listings',this.href);setTimeout(function(){location.href=that.href;},200);return false;">listings.csv.gz</a></td>
<td>Detailed Listings data for Antwerp</td>
</tr>
<tr class="">
<td>29 September, 2021</td>
<td>Antwerp</td>
<td><a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/calendar.csv.gz" onclick="var that=this;ga('send','event', 'download','calendar',this.href);setTimeout(function(){location.href=that.href;},200);return false;">calendar.csv.gz</a></td>
<td>Detailed Calendar Data for listings in Antwerp</td>
</tr>
...
<tr class="archived">
<td>17 August, 2021</td>
<td>Antwerp</td>
<td><a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/calendar.csv.gz" onclick="var that=this;ga('send','event', 'download','calendar',this.href);setTimeout(function(){location.href=that.href;},200);return false;">calendar.csv.gz</a></td>
<td>Detailed Calendar Data for listings in Antwerp</td>
</tr>
【问题讨论】:
标签: python web-scraping