使用 beautifulsoup 在 ID 或 CLASS 名称中查找特定单词答案

【问题标题】：Find a particular word in ID or CLASS name using beautifulsoup使用 beautifulsoup 在 ID 或 CLASS 名称中查找特定单词
【发布时间】：2014-07-30 07:52:29
【问题描述】：

我正在使用 beautifulsoup 从电子商务网站的产品页面中提取信息。我要识别的产品页面是：

"CLASS 或 ID 属性中将包含 'thumb' 字样" 例如：class="product_thumbs" id = "thumbimages" 等

目前我的程序仅在 URL 中查找 .html，但这仅适用于一个电子商务网站。但我希望它搜索整个 html 并查找其中包含“thumb”字样的 ID 和 CLASS 属性。

我目前的代码如下：

        if ".html" in childurl: # store details into product_details table if its a product page
              print("Product Found.!")
              print(childurl)
              soup = BeautifulSoup(urllib2.urlopen(childurl).read())
              priceele = soup.find(itemprop='price').string.strip()
              brandname = soup.find(itemprop='brand').string.strip()
              nameele = soup.find(itemprop='name').string.strip()
              image = soup.find(itemprop='image').get('src')

请

【问题讨论】：

标签： python beautifulsoup web-crawler

【解决方案1】：

试试正则表达式

import bs4, re
html="""<html><body><div class="foo_thumb"></div><p class="wrong"></p><a id="barthumb"></a></body></html>"""
soup = bs4.BeautifulSoup(html)
predicates = [
    {'id' : re.compile('.*thumb.*')}, 
    {'class' : re.compile('.*thumb.*')},
]
for p in predicates:
    soup.find_all(**p)
#will print [<a id="barthumb"></a>], [<div class="foo_thumb"></div>]

【讨论】：