【问题标题】:Scrape webpage containing ::before抓取包含 ::before 的网页
【发布时间】:2017-11-29 20:19:29
【问题描述】:

我的问题是使用 bs4 抓取 HTML 时无法抓取包含 ::before 之类的内容。

我想知道公司为页面中的哪些可持续发展目标做出了贡献。 https://www.unglobalcompact.org/participation/report/cop/create-and-submit/active/395091 但复选标记在源代码中是不可见的。

我应该怎么做或者我可以用什么从网站上抓取它?

【问题讨论】:

    标签: python css web-scraping beautifulsoup


    【解决方案1】:

    这里根本不需要::before:: 部分。选中和未选中的元素有不同的类 - 选中有selected_question,未选中有advanced_question

    你可以使用类似的东西来解析它:

    from bs4 import BeautifulSoup
    import requests
    
    
    url = "https://www.unglobalcompact.org/participation/report/cop/create-and-submit/active/395091"
    response = requests.get(url)
    
    soup = BeautifulSoup(response.content, "lxml")
    
    questions = soup.select("ul.questionnaire > li.question_group")
    for question in questions:
        question_text = question.get_text(strip=True)
        print(question_text)
    
        answers = question.find_next_siblings("li")
        for answer in answers:
            answer_text = answer.get_text(strip=True)
            is_selected = "selected_question" in answer.get("class", [])
    
            print(answer_text, is_selected)
        print("-----")
    

    将打印:

    Which of the following Sustainable Development Goals (SDGs) do the activities described in your COP address? [Select all that apply]
    SDG 1: End poverty in all its forms everywhere False
    SDG 2: End hunger, achieve food security and improved nutrition and promote sustainable agriculture False
    SDG 3: Ensure healthy lives and promote well-being for all at all ages True
    SDG 4: Ensure inclusive and equitable quality education and promote lifelong learning opportunities for all False
    ...
    

    注意为所选答案打印的True

    我还注意到,如果选择 html.parser 作为解析器,此代码将无法正常工作。

    【讨论】:

      猜你喜欢
      • 2017-08-24
      • 1970-01-01
      • 2019-07-29
      • 2021-04-07
      • 2021-07-28
      • 2020-06-18
      • 2017-06-18
      • 2018-04-09
      相关资源
      最近更新 更多