【问题标题】:BeautifulSoup use unique CSS SelectorBeautifulSoup 使用独特的 CSS 选择器
【发布时间】:2017-01-23 07:28:13
【问题描述】:

从这个page,我需要从“Anbindung an das Telefonnetz”获取状态。

我确定了两种获取方式:

  1. 如果状态包含句子“Das System arbeitet einwandfrei”;
  2. 如果背景颜色为绿色。

我选择了第一个选项。

我使用 Python/BeautifulSoup 来抓取页面。问题是,没有唯一的 id/class 或任何东西来获取这个元素。
然后我决定使用这个特定元素的 CSS 选择器,如下所示:

div.system-item:nth-child(2) > div:nth-child(1) > p:nth-child(3)

并像这样使用它:

print(page.select("div.system-item:nth-child(2) > div:nth-child(1) > p:nth-child(3)"))

但是,我唯一得到的是一个空元素 ([])。

我可以尝试更多的方法来获得这个特定元素吗?

编辑
正如你们中的一些人推荐的那样,这里是页面的不完整 HTML 源代码。
。但为了实用,我建议你自己看看page

<!doctype html>
<head>
    <meta charset="utf-8">

            <title>Aktueller Status | Placetel</title>

    <meta http-equiv="X-UA-Compatible" content="IE=Edge">
    <meta name="msvalidate.01" content="756F6E40DD887A659CE83E5A92FFBB62">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">

    <meta name="generator" content="Kirby 2.3.2">

    <meta name="description" content="Placetel Systemstatus: Erfahren Sie mehr &uuml;ber den aktuellen Status der Placetel Telefonanlage.">
    <meta name="keywords" content="">

        <meta name="robots" content="index,follow,noodp,noydir">

    <link rel="canonical" href="https://www.placetel.de/status">
    <link rel="publisher" href="https://plus.google.com/b/111027512373770716962/111027512373770716962/posts">

    <link rel="shortcut icon" href="/favicon.ico">
    <link rel="apple-touch-icon" href="/apple-touch-icon.png">
    <meta name="msapplication-TileColor" content="#0e70b9">
    <meta name="msapplication-TileImage" content="/ms-tile-icon.png">
    <meta name="theme-color" content="#0e70b9">

    <script src="//use.typekit.net/rnw8lad.js"></script>
    <script>try { Typekit.load({ async: true }); } catch (e) {}</script>

    <link rel="stylesheet" href="https://www.placetel.de/assets/dist/css/main.css">    <script src="https://www.placetel.de/assets/dist/js/modernizr.js"></script>
    <link rel="dns-prefetch" href="//app.marketizator.com"/>
    <script>
        var _mktz = _mktz || [];
        _mktz.cc_domain = 'placetel.de';
    </script>
    <script type="text/javascript" src="//d2tgfbvjf3q6hn.cloudfront.net/js/o17fe41.js"></script>
</head>
<body id="????" class="page page-template-page-sections page-uid-status">

<script>
    var gaProperty = 'UA-17631409-3';
    var disableStr = 'ga-disable-' + gaProperty;
    if (document.cookie.indexOf(disableStr + '=true') > -1) {
        window[disableStr] = true;
    }
    function gaOptout() {
        document.cookie = disableStr + '=true; expires=Thu, 31 Dec 2099 23:59:59 UTC; path=/';
        window[disableStr] = true;
    }
</script>

<!-- Google Tag Manager -->
<noscript><iframe src="//www.googletagmanager.com/ns.html?id=GTM-KDNGCC"
                  height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>
<script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':
        new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
                                                  j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=
        '//www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);
    })(window,document,'script','dataLayer','GTM-KDNGCC');</script>
<!-- End Google Tag Manager -->
<header class="header header-condensed" id="header">
    <div class="container-fluid">

<nav class="navigation navigation-top">
    <ul>
                    <li class=" ">
                <a title="Unternehmen" href="https://www.placetel.de/unternehmen">

                    <span>Unternehmen</span>
                </a>
            </li>
                    <li class=" ">
                <a title="Partner werden" href="https://www.placetel.de/partner">

                    <span>Partner werden</span>
                </a>
            </li>
                    <li class=" ">
                <a title="Support" href="https://www.placetel.de/support">

                    <span>Support</span>
                </a>
            </li>
                    <li class=" ">
                <a title="Suche" href="javascript:modal('search')">

                    <span>Suche</span>
                </a>
            </li>
                <li class="navigation-top-support">
            <a href="https://www.placetel.de/support">
                <svg class="svg-phone"><use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://www.placetel.de/assets/dist/sprites/svg/sprite.1471515912.svg#svg-phone"></use></svg>                <span>0221 29 191 999</span>
            </a>
        </li>
        <li class="navigation-top-login">
            <a href="https://app.placetel.de/account/login">
                <span>Login</span>
            </a>
        </li>
    </ul>
</nav>    </div>

    <div class="container-fluid">
        <a class="site-logo" href="https://www.placetel.de">
            <svg class="svg-placetel-logo"><title>Placetel</title> <use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://www.placetel.de/assets/dist/sprites/svg/sprite.1471515912.svg#svg-placetel-logo"></use></svg>        </a>

<nav class="navigation navigation-main" id="navigation-main">
    <ul>

            <li class="has-sub-navigation">
                <a title="Telefonanlage" href="https://www.placetel.de/telefonanlage"
                   class="">
                    <span>Telefonanlage</span>

                                            <svg class="svg-arrow"><use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://www.placetel.de/assets/dist/sprites/svg/sprite.1471515912.svg#svg-arrow"></use></svg>                                    </a>

                                    <nav class="sub-navigation">
                        <ul>
                                                            <li class="">
                                    <a href="https://www.placetel.de/telefonanlage">
                                        Vorteile                                    </a>
                                </li>
                                                            <li class="">
                                    <a href="https://www.placetel.de/telefonanlage/preise">
                                        Preise                                    </a>
                                </li>
                                                            <li class="">
                                    <a href="https://www.placetel.de/telefonanlage/funktionen">
                                        Funktionen                                    </a>
                                </li>
                                                            <li class="">
                                    <a href="https://www.placetel.de/telefonanlage/unified-communication">
                                        Unified Communication                                    </a>
                                </li>
                                                            <li class="">
                                    <a href="https://www.placetel.de/telefonanlage/funktionsweise">
                                        Wie funktioniert es?                                    </a>
                                </li>
                                                            <li class="">
                                    <a href="https://www.placetel.de/telefonanlage/isdn-abschaltung">
                                        ISDN-Abschaltung                                    </a>
                                </li>
                                                            <li class="">
                                    <a href="https://www.placetel.de/telefonanlage/faq">
                                        FAQ                                    </a>
                                </li>
                                                    </ul>
                    </nav>
                            </li>

            <li class="">
                <a title="Trunking" href="https://www.placetel.de/sip-trunking"
                   class="">
                    <span>Trunking</span>

                                    </a>

                            </li>

            <li class="">
                <a title="Mobilfunk" href="https://www.placetel.de/mobilfunk"
                   class="">
                    <span>Mobilfunk</span>

                                    </a>

                            </li>

            <li class="navigation-main-shop">
                <a title="Endger&auml;te-Shop" href="/shop/"
                   class="">
                    <span>Endger&auml;te-Shop</span>

                                    </a>

                            </li>

            <li class="visible-xs-block visible-sm-block">
                <a title="Support" href="https://www.placetel.de/support"
                   class="">
                    <span>Support</span>

                                    </a>

                            </li>

            <li class="visible-xs-block visible-sm-block">
                <a title="Partner" href="https://www.placetel.de/partner"
                   class="">
                    <span>Partner</span>

                                    </a>

                            </li>

            <li class="has-sub-navigation visible-xs-block visible-sm-block">
                <a title="Unternehmen" href="https://www.placetel.de/unternehmen"
                   class="">
                    <span>Unternehmen</span>

                                            <svg class="svg-arrow"><use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://www.placetel.de/assets/dist/sprites/svg/sprite.1471515912.svg#svg-arrow"></use></svg>                                    </a>

                                    <nav class="sub-navigation">
                        <ul>
                                                            <li class="">
                                    <a href="https://www.placetel.de/unternehmen">
                                        &Uuml;ber uns                                    </a>
                                </li>
                                                            <li class="">
                                    <a href="https://www.placetel.de/unternehmen/technologie">
                                        Technologie                                    </a>
                                </li>
                                                            <li class="">
                                    <a href="https://www.placetel.de/unternehmen/jobs">
                                        Jobs                                    </a>
                                </li>
                                                            <li class="">
                                    <a href="https://www.placetel.de/unternehmen/events">
                                        Events                                    </a>
                                </li>
                                                            <li class="">
                                    <a href="https://www.placetel.de/unternehmen/presse">
                                        Presse                                    </a>
                                </li>
                                                            <li class="">
                                    <a href="https://www.placetel.de/unternehmen/kontakt">
                                        Kontakt                                    </a>
                                </li>
                                                    </ul>
                    </nav>
                            </li>

            <li class="navigation-main-register">
                <a title="Kostenlos testen!" href="javascript:modal('register')"
                   class="btn">
                    <span>Kostenlos testen!</span>

                                    </a>

                            </li>
            </ul>
</nav>        
        <a class="site-navigation-toggle" id="hotdog">
            <i>
                <span></span>
            </i> Menü
        </a>
    </div>
</header>


            <section class="section section-full section-full-section-einleitung-text section-full-normal">
    <div class="container-fluid typography typography-dark">
                    <h2 class="section-full-title">Der Placetel System Status</h2>

                    <h3 class="section-full-subtitle">Jeden Tag einen Grund zur Freude.</h3>

                    <p>Wir bei Placetel haben ein Lieblingswort: „läuft“. Der Grund: Ihre Placetel Telefonanlage funktioniert nämlich immer. Darüber freuen wir uns natürlich riesig. Da aber erst eine geteilte Freude eine richtige Freude ist, haben wir Ihnen diese Statusseite eingerichtet.  Diese Seite informiert Sie jeden Tag über den einwandfreien Status Ihrer Anlage.<br />
Und falls etwas mal nicht so perfekt funktionieren sollte wie gewohnt, können Sie uns den Fehler gern  melden.</p>        
            </div>

            <style>
            .section-full-section-einleitung-text {
                background-color: ;
            }
        </style>

    </section>    

            <section class="section section-system">
    <a class="btn btn-primary btn-transparent btn-with-icon" href="javascript:location.reload();">
        <svg class="svg-refresh"><use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://www.placetel.de/assets/dist/sprites/svg/sprite.1471515912.svg#svg-refresh"></use></svg>        Status aktualisieren
    </a>

    <div class="system flex-grid typography typography-light">
        <div class="system-item system-item-green flex-grid-item">
            <div class="system-item-inner">
                <h6>
                    System                </h6>

                <i>
                    <svg class="svg-included"><use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://www.placetel.de/assets/dist/sprites/svg/sprite.1471515912.svg#svg-included"></use></svg>                    <svg class="svg-dots"><use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://www.placetel.de/assets/dist/sprites/svg/sprite.1471515912.svg#svg-dots"></use></svg>                    <svg class="svg-not-included"><use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://www.placetel.de/assets/dist/sprites/svg/sprite.1471515912.svg#svg-not-included"></use></svg>                </i>

                <p>
                    Das System arbeitet einwandfrei<br>
                    11:10 Uhr
                </p>

                            </div>
        </div>

        <div class="system-item system-item-green flex-grid-item">
            <div class="system-item-inner">
                <h6>
                    Anbindung an das  Telefonnetz                </h6>

                <i>
                    <svg class="svg-included"><use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://www.placetel.de/assets/dist/sprites/svg/sprite.1471515912.svg#svg-included"></use></svg>                    <svg class="svg-dots"><use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://www.placetel.de/assets/dist/sprites/svg/sprite.1471515912.svg#svg-dots"></use></svg>                    <svg class="svg-not-included"><use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://www.placetel.de/assets/dist/sprites/svg/sprite.1471515912.svg#svg-not-included"></use></svg>                </i>

                <p>
                    Das System arbeitet einwandfrei<br>
                    11:10 Uhr
                </p>

                            </div>
        </div>

        <div class="system-item system-item-green flex-grid-item">
            <div class="system-item-inner">
                <h6>
                    Faxsystem                </h6>

                <i>
                    <svg class="svg-included"><use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://www.placetel.de/assets/dist/sprites/svg/sprite.1471515912.svg#svg-included"></use></svg>                    <svg class="svg-dots"><use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://www.placetel.de/assets/dist/sprites/svg/sprite.1471515912.svg#svg-dots"></use></svg>                    <svg class="svg-not-included"><use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://www.placetel.de/assets/dist/sprites/svg/sprite.1471515912.svg#svg-not-included"></use></svg>                </i>

                <p>
                    Das System arbeitet einwandfrei<br>
                    11:10 Uhr
                </p>

                            </div>
        </div>

        <div class="system-item system-item-green flex-grid-item">
            <div class="system-item-inner">
                <h6>
                    Konferenzsystem                </h6>

                <i>
                    <svg class="svg-included"><use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://www.placetel.de/assets/dist/sprites/svg/sprite.1471515912.svg#svg-included"></use></svg>                    <svg class="svg-dots"><use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://www.placetel.de/assets/dist/sprites/svg/sprite.1471515912.svg#svg-dots"></use></svg>                    <svg class="svg-not-included"><use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://www.placetel.de/assets/dist/sprites/svg/sprite.1471515912.svg#svg-not-included"></use></svg>                </i>

                <p>
                    Das System arbeitet einwandfrei<br>
                    11:10 Uhr
                </p>

                            </div>
        </div>

        <div class="system-item system-item-green flex-grid-item">
            <div class="system-item-inner">
                <h6>
                    Features und Optionen                </h6>

                <i>
                    <svg class="svg-included"><use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://www.placetel.de/assets/dist/sprites/svg/sprite.1471515912.svg#svg-included"></use></svg>                    <svg class="svg-dots"><use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://www.placetel.de/assets/dist/sprites/svg/sprite.1471515912.svg#svg-dots"></use></svg>                    <svg class="svg-not-included"><use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://www.placetel.de/assets/dist/sprites/svg/sprite.1471515912.svg#svg-not-included"></use></svg>                </i>

                <p>
                    Das System arbeitet einwandfrei<br>
                    11:10 Uhr
                </p>

                            </div>
        </div>
    </div>
</section>    

</body>
</html>

【问题讨论】:

  • 为什么不直接获取div.system-item:nth-child(2) 并检查该元素是否具有system-item-green 类?
  • 即使我只使用dix.system-item:nth-child(2),我也得到了一个空元素([]
  • 寻求代码帮助的问题必须包含重现它所需的最短代码在问题本身中最好在Stack Snippet 中。见How to create a Minimal, Complete, and Verifiable example
  • Das System arbeitet einwandfrei出现多次,你要哪一个?
  • “案例”中的一个 Anbindung an das Telefonnetz

标签: python html css beautifulsoup


【解决方案1】:

据我所知nth-of-child 仍未在BeautifulSoup4 中实现。另外,如果您调查网站的 CSS(即_system.scss 文件),您会发现有 3 种状态:

  1. system-item-green
  2. system-item-yellow
  3. system-item-red

所以你可能想稍微改变你的代码如下:

import requests
from bs4 import BeautifulSoup as BS

url = 'https://www.placetel.de/status'
headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux i586; rv:31.0) Gecko/20100101 Firefox/31.0'
}
source = requests.get(url, headers=headers)
soup = BS(source.text, 'html.parser')

status = soup.select("div.system-item")[1].attrs['class']

if 'system-item-green' in status:
     print("It works!")
elif 'system-item-yellow' in status:
     print("Something's slightly wrong")
elif 'system-item-red' in status:
     print("Does not work")
else:
     print("Has someone changed page's markup?")

【讨论】:

  • 感谢您的帮助!但是,既然有 4 个不同的div.sytem-item,你怎么能确定你选择了“Anbindung and das Telefonnetz”?
  • @Mornor 启动 python shell 并遍历列表 soup.select("div.system-item") 的元素。您将看到列表的第二个元素(即soup.select("div.system-item")[1])就是您需要的元素。
  • 感谢您的解决方案!
【解决方案2】:

您可以使用文本找到 Anbindung an das Telefonnetz 的 h6 并获取 p 兄弟:

import requests
import re
r = requests.get("https://www.placetel.de/status").content
soup = BeautifulSoup(r, "lxml")

h6 = soup.find("h6", text=re.compile(ur"Anbindung an das  Telefonnetz", re.I))
if h6:
    print(h6.find_next_sibling("p"))

如果你想要完整的 css3 选择器支持,你可以使用 lxml's cssselect:

from lxml import html
tree = html.fromstring(r)
print(tree.cssselect("div.system-item:nth-child(2) > div:nth-child(1) > p:nth-child(3)")

你也可以只通过文本搜索,所以如果 h6 变成 h5 或任何其他标签,它不会有任何可能性:

match = soup.find(text=re.compile(ur"Anbindung an das  Telefonnetz", re.I))

if match:
    print(match.parent.find_next_sibling("p").text)

您可以使用外部 div 来本地化文本搜索,bs4 非常灵活。仅选择所有 div.system-item 并在顺序更改时索引会中断,并且您不会知道因为不会出现错误,因此查找文本实际上可能是一种更安全的方法。

【讨论】:

  • 甚至是更好的解决方案。但是,你能解释一下为什么你使用lxml而不是html.parser吗?
  • @Mornor,只是习惯比什么都重要,我想都没想就写了,而且用lxml解析器解析更快,而且很多人在使用bs4时都安装了它。您可以使用 html.parser ,它会正常工作。
猜你喜欢
  • 2018-08-19
  • 2021-08-30
  • 2016-10-18
  • 1970-01-01
  • 2016-03-27
  • 2015-01-04
  • 2016-04-05
  • 2018-02-21
  • 2020-02-16
相关资源
最近更新 更多