【问题标题】:how to crawl for a block of a website如何抓取一个网站块
【发布时间】:2015-04-16 13:26:18
【问题描述】:

html部分是这样的,

<div id="block-hubs3d-hub-hub-specialties" class="block block-hubs3d-hub first odd">
        <h3 class="block-title">Specialties</h3>

<div class="field field-name-field-hub-specialties field-type-taxonomy-term-reference field-label-hidden">
    <div class="field-items">
          <div class="field-item item-1 even">ABS+PLA+Nylon+Flexible</div>
          <div class="field-item item-2 odd">Custom Finishing</div>
          <div class="field-item item-3 even">DLP - SLA Technology</div>
          <div class="field-item item-4 odd">Makerjuice G+</div>
      </div>
</div>

如何获取格式,例如:

specialties: ABS+PLA+Nylon+Flexible, Custom Finishing, DLP - SLA Technology, DLP - SLA Technology

目前我只知道使用 bs4 来获取所有文本:

response = requests.get('https://www.3dhubs.com/new-york/hubs/peerless')
soup = bs4.BeautifulSoup(response.text)

【问题讨论】:

标签: python web-crawler bs4


【解决方案1】:

通过class查找divs:

import bs4

h = """
<div id="block-hubs3d-hub-hub-specialties" class="block block-hubs3d-hub first odd">
        <h3 class="block-title">Specialties</h3>

<div class="field field-name-field-hub-specialties field-type-taxonomy-term-reference field-label-hidden">
    <div class="field-items">
          <div class="field-item item-1 even">ABS+PLA+Nylon+Flexible</div>
          <div class="field-item item-2 odd">Custom Finishing</div>
          <div class="field-item item-3 even">DLP - SLA Technology</div>
          <div class="field-item item-4 odd">Makerjuice G+</div>
      </div>
</div>
"""

b = bs4.BeautifulSoup(h)

specialties = [div.text for div in b.findAll("div", {"class":"field-item"})]
print(", ".join(b))

输出:

ABS+PLA+Nylon+Flexible, Custom Finishing, DLP - SLA Technology, Makerjuice G+

【讨论】:

    猜你喜欢
    • 2023-03-22
    • 1970-01-01
    • 2016-05-15
    • 2017-09-01
    • 1970-01-01
    • 2023-03-10
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多