【问题标题】:Scraping Information for One Result on a Page of Multiple Results在多个结果的页面上为一个结果抓取信息
【发布时间】:2019-03-29 23:08:36
【问题描述】:

我想从包含多个结果的页面中抓取/解析特定结果的数据。

例如,下面是一个页面的源 html 剪辑,该页面在业务目录中具有两个业务搜索结果。两者都有业务项目,例如状态。但是,我只想要与街道地址 311 South Swall Drive 相关的业务项目。

</section><section itemscope itemtype="http://schema.org/Organization" class="org">
<div class="b-business-item">
<div class='b-business-item_header-wrap  '>
<div class='b-business-item_title-wrap'>
<h2 class="b-business-item_header uppercase"><a itemprop="url" href="/p/kash+apparel+lp-12645872"><font itemprop="name">Kash Apparel, Lp</font></a></h2>

<p class="b-business-item_sub-header"><span class="addr-cont" itemprop="address" itemscope itemtype="http://schema.org/PostalAddress"><span itemprop="streetAddress">2615 Fruitland Ave</span>, <span><span itemprop="addressLocality">Los Angeles</span>, <span itemprop="addressRegion">CA</span> <span itemprop="postalCode">90058</span></span></span></p>
</div>
</div>
<p class="b-business-item_props"><span class="b-business-item_title">Status:</span><span class="b-business-item_value">Inactive</span></p>
<p class="b-business-item_props"><span class="b-business-item_title">Industry:</span><span class="b-business-item_value">Mfg Women's/Misses' Outerwear</span></p>
<p class="b-business-item_props"><span class="b-business-item_title">Members (3):</span><span class="b-business-item_value">Mel Salde <span class='gray-text'>(Accountant, inactive)</span><br/>Edir Haroni <span class='gray-text'>(Limited Partner, inactive)</span><br/>Stephanie Kleinjan <span class='gray-text'>(General Partner, inactive)</span></span></p>
</div>
</section><section itemscope itemtype="http://schema.org/Organization" class="org">
<div class="b-business-item">
<div class='b-business-item_header-wrap  '>
<div class='b-business-item_title-wrap'>
<h2 class="b-business-item_header uppercase"><a itemprop="url" href="/p/kash+inc-178509132"><font itemprop="name">KASH INC</font></a></h2>

<p class="b-business-item_sub-header"><span class="addr-cont" itemprop="address" itemscope itemtype="http://schema.org/PostalAddress"><span itemprop="streetAddress">311 South Swall Drive</span>, <span><span itemprop="addressLocality">Los Angeles</span>, <span itemprop="addressRegion">CA</span> <span itemprop="postalCode">90048</span></span></span></p>
</div>
</div>
<p class="b-business-item_props"><span class="b-business-item_title">Status:</span><span class="b-business-item_value">Inactive</span></p>
<p class="b-business-item_props"><span class="b-business-item_title">Registration:</span><span class="b-business-item_value">Sep 26, 2006</span></p>
<p class="b-business-item_props"><span class="b-business-item_title">State ID:</span><span class="b-business-item_value">C2904860</span></p>
<p class="b-business-item_props"><span class="b-business-item_title">Business type:</span><span class="b-business-item_value">Articles of Incorporation</span></p>
<p class="b-business-item_props"><span class="b-business-item_title">Member:</span><span class="b-business-item_value">Ashwant Venkatram <span class='gray-text'>(President, inactive)</span></span></p>

我正在尝试抓取 311 South Swall Drive 的状态、注册、州 ID、业务类型和成员,而不是其他结果的类似字段。不幸的是,企业目录没有任何方法可以输入地址来将搜索范围缩小到一个结果。

【问题讨论】:

    标签: html web-scraping beautifulsoup html-parsing


    【解决方案1】:

    我想这就是你要找的东西:

    for sect in soup.find_all('section'):
      for adrs in sect.select('span[itemprop="streetAddress"]'):
        if adrs.text == '311 South Swall Drive':
            for item in sect.select('p'):
                print(item.text)
    

    输出:

    311 South Swall Drive, Los Angeles, CA 90048
    Status:Inactive
    Registration:Sep 26, 2006
    State ID:C2904860
    Business type:Articles of Incorporation
    Member:Ashwant Venkatram (President, inactive)
    

    【讨论】:

    • 这太完美了!我对从 html 抓取相对较新,所以我仍然习惯于 BeautifulSoup 如何解析它。谢谢!
    猜你喜欢
    • 1970-01-01
    • 2018-07-03
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2014-02-08
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多