如何从以下 HTML 代码中提取文本？答案

【问题标题】：how to extract the text from the following HTML code?如何从以下 HTML 代码中提取文本？
【发布时间】：2020-05-26 16:32:52
【问题描述】：

我正在为一个 DS 项目进行网络抓取，为此我正在使用 BeautifulSoup。但我无法从“table”类中的“tbody”标签中提取 Duration。以下是 HTML 代码：

<div class="table-responsive">
    <table class="table">
        <thead>
            <tr>
                <th>Start Date</th>
                <th>Duration</th>
                <th>Stipend</th>
                <th>Posted On</th>
                <th>Apply By</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td>
                    <div id="start-date-first">Immediately</div>
                </td>
                <td>1 Month</td>
                <td class="stipend_container_table_cell"> <i class="fa fa-inr"></i>
                1500 /month
                </td>
                <td>26 May'20</td>
                <td>23 Jun'20</td>
            </tr>
        </tbody>
    </table>
</div>

注意：为了提取“立即”文本，我使用以下代码：

x = container.find("div", {"class" : "table-responsive"})
x.table.tbody.tr.td.div.text

【问题讨论】：

标签： python html web-scraping beautifulsoup

【解决方案1】：

您可以使用 select() 函数通过 css 选择器查找标签。

tds = container.select('div > table > tbody > tr > td')
# or just select('td'), since there's no other td tag

print(tds[1].text)

select() 函数的返回值是与选择器匹配的所有 HTML 标记的列表。您要检索的是第二个，因此使用索引1，然后获取它的文本。

【讨论】：

【解决方案2】：

试试这个：

from bs4 import BeautifulSoup
import requests

url = "yourUrlHere"

pageRaw = requests.get(url).text
soup = BeautifulSoup(pageRaw , 'lxml')
print(soup.table)

在我的代码中，我使用 lxml 库来解析数据。如果您想安装pip install lxml... 或者只是在这部分代码中更改为您的库：

soup = BeautifulSoup(pageRaw , 'lxml')

这段代码会返回第一个表好吗？

保重

【讨论】：