【问题标题】:Extracting anchor text from span class with BeautifulSoup使用 BeautifulSoup 从 span 类中提取锚文本
【发布时间】:2016-04-19 01:16:22
【问题描述】:

这是我要抓取的 html:

<span class="meta-attributes__attr-tags">
<a href="/tags/cinematic" title="cinematic">cinematic</a>, 
<a href="/tags/dissolve" title="dissolve">dissolve</a>,
<a href="/tags/epic" title="epic">epic</a>,
<a href="/tags/fly" title="fly">fly</a>,
</span>

我想获取每个 a href 的锚文本:电影、溶解、史诗等。

这是我的代码:

url = urllib2.urlopen("http: example.com")

content = url.read()
soup = BeautifulSoup(content)

links = soup.find_all("span", {"class": "meta-attributes__attr-tags"})
for link in links:
    print link.find_all('a')['href']

如果我使用“link.find_all”执行此操作,我会收到错误:TypeError: List indices must be integers, not str.

但如果我打印 link.find('a')['href'] 我只会得到第一个。

我怎样才能获得所有这些?

【问题讨论】:

    标签: python beautifulsoup scrape


    【解决方案1】:

    您可以执行以下操作:

    from bs4 import BeautifulSoup
    
    content = '''
    <span class="meta-attributes__attr-tags">
    <a href="/tags/cinematic" title="cinematic">cinematic</a>, 
    <a href="/tags/dissolve" title="dissolve">dissolve</a>,
    <a href="/tags/epic" title="epic">epic</a>,
    <a href="/tags/fly" title="fly">fly</a>,
    </span>
    '''
    
    soup = BeautifulSoup(content)
    spans = soup.find_all("span", {"class": "meta-attributes__attr-tags"})
    for span in spans:
        links = span.find_all('a')
        for link in links:
            print link['href']
    

    输出

    /tags/cinematic
    /tags/dissolve
    /tags/epic
    /tags/fly
    

    【讨论】:

    【解决方案2】:

    link.find_all('a') 返回一个带有 bs4 标签的列表。您可能希望通过href 为每个链接编制索引。所以也许这更接近您的需求:

    span = soup.find_all("span", {"class": "meta-attributes__attr-tags"})
    for links in span:
        for link in links.find_all('a'):
            print(link['href'])
    

    【讨论】:

    • 是的,这行得通,但是如何从 标签中获取实际文本?我记得它是 .contents 的东西,但不知道确切。
    • 如果要获取内容,请使用link.getText() 而不是link['href']。或者如果它可以包含子元素,link.contents 会给出内容列表。
    【解决方案3】:
    from bs4 import BeautifulSoup
    
    html = """
    <span class="meta-attributes__attr-tags">
    <a href="/tags/cinematic" title="cinematic">cinematic</a>, 
    <a href="/tags/dissolve" title="dissolve">dissolve</a>,
    <a href="/tags/epic" title="epic">epic</a>,
    <a href="/tags/fly" title="fly">fly</a>,
    </span>
    """
    
    soup = BeautifulSoup(html, "lxml")
    spans = soup.find_all("span", {"class": "meta-attributes__attr-tags"})
    
    for span in spans:
        for link in span.find_all('a'):
            print link.text, link['href']
    

    另一种更昂贵的方式可能是:

    from bs4 import BeautifulSoup
    
    html = """
    <span class="meta-attributes__attr-tags">
    <a href="/tags/cinematic" title="cinematic">cinematic</a>,
    <a href="/tags/dissolve" title="dissolve">dissolve</a>,
    <a href="/tags/epic" title="epic">epic</a>,
    <a href="/tags/fly" title="fly">fly</a>,
    </span>
    """
    
    soup = BeautifulSoup(html, "lxml")
    links = soup.find_all("a")
    
    for link in links:
        if 'meta-attributes__attr-tags' not in link.parent.get('class', []):
            continue
    
        print link.text, link['href']
    

    【讨论】:

      【解决方案4】:

      您可以通过使用CSS selector 来避免循环内的嵌套循环或任何其他 if 检查:

      for link in soup.select(".meta-attributes__attr-tags a[href]"):
          print(link["href"], link.get_text())
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2022-01-23
        • 1970-01-01
        • 2018-12-29
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多