【问题标题】:Python Beautifulsoup loop a tag (<td><b>) and get all its sibling (a href)Python Beautifulsoup 循环一个标签 (<td><b>) 并获取它的所有兄弟 (a href)
【发布时间】:2021-01-27 13:09:42
【问题描述】:

我有以下html文件来遍历Python的beautifulsoup:

<table align=center border='1' cellpadding="8"><tr><td><b>1940 (Spanish)  Jan</b> 
<a href="./1940sp/jan/2/home.htm" target="_parent">2</a>&nbsp 
<a href="./1940sp/jan/4/home.htm" target="_parent">4</a>&nbsp    
<td><b>1940 (English)  Jan</b> 
<a href="./1940/jan/2/home.htm" target="_parent">2</a>&nbsp 
<a href="./1940/jan/4/home.htm" target="_parent">4</a>&nbsp     
<tr><td><b>1940 (Spanish)  Feb</b> 
<a href="./1940sp/feb/1/home.htm" target="_parent">1</a>&nbsp 
 ...OMITTED...
<td><b>1940 (English)  Indices</b> 
<a href="./1940/ndx1/home.htm" target="_parent">Jan to Mar</a>&nbsp 
</table>

这个 html 有些有关闭的 td 标签,有些没有,但我想这没关系。我想要得到的是 href 的文本和相应的粗体文本,如下所示:

1940 (Spanish)  Jan|2
1940 (Spanish)  Jan|4
1940 (English)  Jan|2
1940 (English)  Jan|4
   ...
1940 (English)  Indices|Jan to Mar

我实际上可以用我的代码迭代粗体 tds,我想弄清楚的是迭代 a hrefs 文本的部分。我现在的python代码如下:

import requests
url = "http://nlpdl.nlp.gov.ph/OG01/1902"

  page = requests.get(url)
  from bs4 import BeautifulSoup
  soup = BeautifulSoup(page.content, 'html.parser')

  elements = soup.find("td").find_all_next("b")
  for el in elements:        
    print (el)

提前致谢!

【问题讨论】:

  • 谢谢苏希尔!抱歉,我编辑了我的问题,因为解决方案似乎将 td/b 与 a hrefs 交替使用。

标签: python beautifulsoup html-parsing


【解决方案1】:

这应该对你有帮助:

from bs4 import BeautifulSoup

html = """
<table align=center border='1' cellpadding="8"><tr><td><b>1940 (Spanish)  Jan</b> 
<a href="./1940sp/jan/2/home.htm" target="_parent">2</a>&nbsp 
<a href="./1940sp/jan/4/home.htm" target="_parent">4</a>&nbsp    
<td><b>1940 (English)  Jan</b> 
<a href="./1940/jan/2/home.htm" target="_parent">2</a>&nbsp 
<a href="./1940/jan/4/home.htm" target="_parent">4</a>&nbsp     
<tr><td><b>1940 (Spanish)  Feb</b> 
<a href="./1940sp/feb/1/home.htm" target="_parent">1</a>&nbsp 
<td><b>1940 (English)  Indices</b> 
<a href="./1940/ndx1/home.htm" target="_parent">Jan to Mar</a>&nbsp 
</table>
"""

soup = BeautifulSoup(html,'html5lib')

table = soup.find('table')

a_tags = table.find_all('a')

for a in a_tags:
    print(a.text)

输出:

2
4
2
4
1
Jan to Mar

这是它的完整版本(使用 requests 提取的 html 代码并具有正确的格式):

from bs4 import BeautifulSoup
import requests

url = "http://nlpdl.nlp.gov.ph/OG01/1902"

page = requests.get(url).text

soup = BeautifulSoup(page,'html5lib')

table = soup.find('table')

a_tags = table.find_all('a')
elements = soup.find("td").find_all_next("b")

for x in range(len(elements)):
    print(f"{elements[x].text}|{a_tags[x].text}")

输出:

1902 (Spanish)  Sep|10
1902 (Spanish)  Oct|17
1902 (Spanish)  Nov|24
1902 (Spanish)  Dec|1
1902 (Spanish)  Indices|8

【讨论】:

  • 注意使用 f 字符串时,它们在 python 3.7 之前不可用。还要减少管道和数字之间的空间。
  • 增加字符串之间的空间使输出看起来更美观。这就是我添加它的原因。
【解决方案2】:

试试这个:

import requests
url = "http://nlpdl.nlp.gov.ph/OG01/1902"

page = requests.get(url)
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

elements = soup.find("td").find_all_next("b")
links = soup.find("table").findAll("a")

for el,li in zip(elements,links):
  print('{a}|{b}'.format(a=el.text,b=li.text))

【讨论】:

    【解决方案3】:

    您可以使用.find_previous('b') 找到匹配的&lt;b&gt; 标签:

    from bs4 import BeautifulSoup
    
    
    txt = '''<table align=center border='1' cellpadding="8"><tr><td><b>1940 (Spanish)  Jan</b>
    <a href="./1940sp/jan/2/home.htm" target="_parent">2</a>&nbsp
    <a href="./1940sp/jan/4/home.htm" target="_parent">4</a>&nbsp
    <td><b>1940 (English)  Jan</b>
    <a href="./1940/jan/2/home.htm" target="_parent">2</a>&nbsp
    <a href="./1940/jan/4/home.htm" target="_parent">4</a>&nbsp
    <tr><td><b>1940 (Spanish)  Feb</b>
    <a href="./1940sp/feb/1/home.htm" target="_parent">1</a>&nbsp
     ...OMITTED...
    <td><b>1940 (English)  Indices</b>
    <a href="./1940/ndx1/home.htm" target="_parent">Jan to Mar</a>&nbsp
    </table>'''
    
    soup = BeautifulSoup(txt, 'html.parser')
    
    for a in soup.select('a'):
        print(a.find_previous('b').text, a.text)
    

    打印:

    1940 (Spanish)  Jan 2
    1940 (Spanish)  Jan 4
    1940 (English)  Jan 2
    1940 (English)  Jan 4
    1940 (Spanish)  Feb 1
    1940 (English)  Indices Jan to Mar
    

    【讨论】:

    • 嗨!这正是我想要的。我从来没有意识到我可以从“a”开始并通过 find_previous。
    猜你喜欢
    • 1970-01-01
    • 2012-05-15
    • 2020-12-06
    • 1970-01-01
    • 1970-01-01
    • 2020-08-04
    • 2021-12-26
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多