Beautifulsoup的find_all中使用正则表达式答案

【问题标题】：Using regular expression in find_all of BeautifulsoupBeautifulsoup的find_all中使用正则表达式
【发布时间】：2017-06-30 22:21:55
【问题描述】：

我正在尝试抓取 tumblr 存档，div 类标签看起来像图片中给出的

课程以“post post_micro”开头，我尝试使用正则表达式但失败了

soup.find_all(class_=re.compile('^post post_micro')

我尝试将 find_all 中的函数用于类

def func(x):                 
    if str(x).startswith('post_tumblelog'):
        return True

并将其用作：

soup.find_all(class_=func)

以上工作正常，我得到了我需要的东西。但我想知道如何使用正则表达式以及为什么在 func(x) 中，

str(x).startswith('post_tumblelog')

当类名以“post post_micro”开头时评估为 True。

【问题讨论】：

【解决方案1】：

在 BeautifulSoup 4 中，您可以使用 .select() method，因为它可以接受 CSS 属性选择器。在您的情况下，您将使用属性选择器[class^="post_tumblelog"]，它将选择以字符串post_tumblelog 开头的class 属性。

soup.select('[class^="post_tumblelog"]')

或者，您也可以使用：

soup.find_all(class_=lambda x: x and x.startswith('post_tumblelog'))

作为旁注，您似乎缺少括号，以下是有效的：

soup.find_all(class_=re.compile('^post_tumblelog'))

【讨论】：

使用 .select 会给出错误：不支持或无效的 CSS 选择器：“[class^="post" 和其余两个选项正在使用 'post_tumblelog' 但不适用于 'post post_micro'，我不知道为什么会这样。
是的，其中 2 个使用 lambda 函数和正则表达式之一，但传递的参数需要是 'post_tumblelog'
@sandepp - 我刚刚用字符串post_tumblelog 和post post_micro 对其进行了测试，它们都有效。您介意发布您的 HTML 并发布您使用的 BeautifulSoup 版本吗？
bs4版本是4.3.2，我觉得是类值空格的问题，昨晚刚选了Beautifulsoup，对css和html一窍不通。所以我觉得还得再努力一点