【问题标题】:Scrape webpage containing ::before抓取包含 ::before 的网页
【发布时间】:2017-11-29 20:19:29
【问题描述】:
【问题讨论】:
标签:
python
css
web-scraping
beautifulsoup
【解决方案1】:
这里根本不需要::before:: 部分。选中和未选中的元素有不同的类 - 选中有selected_question,未选中有advanced_question。
你可以使用类似的东西来解析它:
from bs4 import BeautifulSoup
import requests
url = "https://www.unglobalcompact.org/participation/report/cop/create-and-submit/active/395091"
response = requests.get(url)
soup = BeautifulSoup(response.content, "lxml")
questions = soup.select("ul.questionnaire > li.question_group")
for question in questions:
question_text = question.get_text(strip=True)
print(question_text)
answers = question.find_next_siblings("li")
for answer in answers:
answer_text = answer.get_text(strip=True)
is_selected = "selected_question" in answer.get("class", [])
print(answer_text, is_selected)
print("-----")
将打印:
Which of the following Sustainable Development Goals (SDGs) do the activities described in your COP address? [Select all that apply]
SDG 1: End poverty in all its forms everywhere False
SDG 2: End hunger, achieve food security and improved nutrition and promote sustainable agriculture False
SDG 3: Ensure healthy lives and promote well-being for all at all ages True
SDG 4: Ensure inclusive and equitable quality education and promote lifelong learning opportunities for all False
...
注意为所选答案打印的True。
我还注意到,如果选择 html.parser 作为解析器,此代码将无法正常工作。