抓取包含 ::before 的网页答案

【问题标题】：Scrape webpage containing ::before抓取包含 ::before 的网页
【发布时间】：2017-11-29 20:19:29
【问题描述】：

我的问题是使用 bs4 抓取 HTML 时无法抓取包含 ::before 之类的内容。

我想知道公司为页面中的哪些可持续发展目标做出了贡献。 https://www.unglobalcompact.org/participation/report/cop/create-and-submit/active/395091 但复选标记在源代码中是不可见的。

我应该怎么做或者我可以用什么从网站上抓取它？

【问题讨论】：

标签： python css web-scraping beautifulsoup

【解决方案1】：

这里根本不需要::before:: 部分。选中和未选中的元素有不同的类 - 选中有selected_question，未选中有advanced_question。

你可以使用类似的东西来解析它：

from bs4 import BeautifulSoup
import requests


url = "https://www.unglobalcompact.org/participation/report/cop/create-and-submit/active/395091"
response = requests.get(url)

soup = BeautifulSoup(response.content, "lxml")

questions = soup.select("ul.questionnaire > li.question_group")
for question in questions:
    question_text = question.get_text(strip=True)
    print(question_text)

    answers = question.find_next_siblings("li")
    for answer in answers:
        answer_text = answer.get_text(strip=True)
        is_selected = "selected_question" in answer.get("class", [])

        print(answer_text, is_selected)
    print("-----")

将打印：

Which of the following Sustainable Development Goals (SDGs) do the activities described in your COP address? [Select all that apply]
SDG 1: End poverty in all its forms everywhere False
SDG 2: End hunger, achieve food security and improved nutrition and promote sustainable agriculture False
SDG 3: Ensure healthy lives and promote well-being for all at all ages True
SDG 4: Ensure inclusive and equitable quality education and promote lifelong learning opportunities for all False
...

注意为所选答案打印的True。

我还注意到，如果选择 html.parser 作为解析器，此代码将无法正常工作。

【讨论】：