获取 html 表格内 <a> 元素的 href答案

【问题标题】：Get href of an <a> element inside of a html table获取 html 表格内 <a> 元素的 href
【发布时间】：2021-12-01 06:48:42
【问题描述】：

HTML website

我有一个 HTML 列表，从这个列表中我只想要具有 class="" 的 <tr> 元素。我想稍后下载文件，所以我只需要第三个<td> 和这个里面的<a> 元素的href，我怎样才能将它们直接作为字符串读出？

我想要所有带有class = "" 的<tr> 元素。

例如：

<tr class="">
<td>29 September, 2021</td>
<td>Antwerp</td>
<td><a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/listings.csv.gz" onclick="var that=this;ga('send','event', 'download','listings',this.href);setTimeout(function(){location.href=that.href;},200);return false;">listings.csv.gz</a></td>
<td>Detailed Listings data for Antwerp</td>

在这个<tr> 元素内部有一个<td> 元素。我想在第三个<td> 元素中包含<a> 元素的href。所以我想要的是网址http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/listings.csv.gz（不仅是这个：D，我想要所有网址）

代码

import requests
from bs4 import BeautifulSoup
from datetime import datetime

DATASET_URL = "http://insideairbnb.com/get-the-data.html"
DATASET_CITY = "Antwerp"
r = requests.get(DATASET_URL)
content = r.content
soup = BeautifulSoup(content, "html.parser")
antwerp_table = soup.find(class_=DATASET_CITY.lower())
        
print(antwerp_table)
# antwerp_table is my html table

html 示例（更多信息请访问http://insideairbnb.com/get-the-data.html）

<table class="table table-hover table-striped antwerp">
<thead>
<tr>
<th class="col-md-3" data-field="host_id">Date Compiled</th>
<th class="col-md-3" data-field="host_id">Country/City</th>
<th class="col-md-3" data-field="host_id">File Name</th>
<th class="col-md-3" data-align="right" data-field="count">
                        Description
                    </th>
</tr>
</thead>
<tbody>
<tr class="">
<td>29 September, 2021</td>
<td>Antwerp</td>
<td><a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/listings.csv.gz" onclick="var that=this;ga('send','event', 'download','listings',this.href);setTimeout(function(){location.href=that.href;},200);return false;">listings.csv.gz</a></td>
<td>Detailed Listings data for Antwerp</td>
</tr>
<tr class="">
<td>29 September, 2021</td>
<td>Antwerp</td>
<td><a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/calendar.csv.gz" onclick="var that=this;ga('send','event', 'download','calendar',this.href);setTimeout(function(){location.href=that.href;},200);return false;">calendar.csv.gz</a></td>
<td>Detailed Calendar Data for listings in Antwerp</td>
</tr>
...
<tr class="archived">
<td>17 August, 2021</td>
<td>Antwerp</td>
<td><a href="http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/calendar.csv.gz" onclick="var that=this;ga('send','event', 'download','calendar',this.href);setTimeout(function(){location.href=that.href;},200);return false;">calendar.csv.gz</a></td>
<td>Detailed Calendar Data for listings in Antwerp</td>
</tr>

【问题讨论】：

标签： python web-scraping

【解决方案1】：

首先你必须单独拿到桌子
如果你使用 find 它会找到所有的表
我检查了该类有 1 个表，因此我们可以使用 .select_one()
之后你必须 select() <a> 元素
这是您想要的工作代码

import requests
from bs4 import BeautifulSoup
from datetime import datetime

DATASET_URL = "http://insideairbnb.com/get-the-data.html"
DATASET_CITY = "Antwerp"
r = requests.get(DATASET_URL)
content = r.content
soup = BeautifulSoup(content, "html.parser")
antwerp_table = soup.select_one(f".{DATASET_CITY.lower()}")
for i in antwerp_table.select("a"):
    print(i.get("href"))

【讨论】：

再看看你的方法，在我看来它没有给出预期的结果。
我在这里发布之前进行了测试，它有效
是的，你是对的，我没有看到归档类有更多的 tr
代码仍在工作，但我的意思是预期的结果。您不仅要打印来自 <tr> 和 class="" 的链接，您的结果还包括带有 class="archived" 的这些链接，这不是目标。顺便说一句，就像你使用的 f-string 语法。

【解决方案2】：

迭代表格结果以查找链接

import requests
from bs4 import BeautifulSoup
from datetime import datetime

DATASET_URL = "http://insideairbnb.com/get-the-data.html"
DATASET_CITY = "Antwerp"
r = requests.get(DATASET_URL)
content = r.content
soup = BeautifulSoup(content, "html.parser")
antwerp_table = soup.find(class_=DATASET_CITY.lower())
        
#print(antwerp_table)
rows = (antwerp_table.find_all('tr', class_=''))
for tr in rows:
    cols = tr.findAll('td')
    if len(cols) >= 4:
        link = cols[2].find('a').get('href')
        print link

【讨论】：

【解决方案3】：

首先我们获取所有<tr> 和class=""，然后获取所有<a>，最后获取所有href

import requests
from bs4 import BeautifulSoup
from datetime import datetime

DATASET_URL = "http://insideairbnb.com/get-the-data.html"
DATASET_CITY = "Antwerp"
c = requests.get(DATASET_URL).content
soup = BeautifulSoup(c, "html.parser")
trs = soup.find(class_=DATASET_CITY.lower()).find_all('tr', class_='')
hrefs = [a for k in [tr.find_all('a') for tr in trs] for a in k]
links = [x.get('href') for x in hrefs]
print(links)

【讨论】：

【解决方案4】：

有不同的方法来获取 未存档 href 我建议由表的结构导致使用 bs4 css 选择器，该选择器获取所有 <tr> 和一个空的 @包括 987654324@ 和 <a>：

soup.select(f'.{DATASET_CITY.lower()} tr[class=""] a')

示例

import requests
from bs4 import BeautifulSoup
from datetime import datetime

DATASET_URL = "http://insideairbnb.com/get-the-data.html"
DATASET_CITY = "Antwerp"
r = requests.get(DATASET_URL)
content = r.content
soup = BeautifulSoup(content, "html.parser")
antwerp_table = [url['href'] for url in soup.select(f'.{DATASET_CITY.lower()} tr[class=""] a')]

输出

['http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/listings.csv.gz',
 'http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/calendar.csv.gz',
 'http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/data/reviews.csv.gz',
 'http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/visualisations/listings.csv',
 'http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/visualisations/reviews.csv',
 'http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/visualisations/neighbourhoods.csv',
 'http://data.insideairbnb.com/belgium/vlg/antwerp/2021-09-29/visualisations/neighbourhoods.geojson']

【讨论】：