【问题标题】:Revealing missing tags on a website using BeautifulSoup in Python在 Python 中使用 BeautifulSoup 显示网站上缺失的标签
【发布时间】:2020-11-26 12:46:50
【问题描述】:

我正在做一个项目,我试图从 CNN/Politics 网页的首页提取所有 url。我浏览了 html 源代码,发现文章链接位于“li”标签中。

我通过执行以下操作获取所述标签下的所有内容:

url = 'https://edition.cnn.com/politics'

r1 = requests.get(url)
coverpage = r1.content

soup = BeautifulSoup(coverpage, 'lxml')

links = soup.find_all('li')

这为我提供了与此类似的对象列表; “网站地图”

我没有指定一个类,因为该类从 url 更改为 url。

但是,在运行此代码时,我并没有得到所有的 'li' 对象。检查网页源时,还有更多类名为“cd blabla”的“li”对象,但 beautifulsoup 似乎无法识别这些。我不知道它们是否以某种方式嵌入到另一个标签中,或者为什么它们没有被提取。

我希望提取指向可以从政治封面导航到的文章的链接。 我该如何解决这个问题?有没有更简单的方法可以在页面上找到指向其他文章的链接。

【问题讨论】:

  • 您确定所有<li> 元素都可以通过普通请求获得还是可以通过js 加载?
  • 我不知道,他们可能是。我对使用图书馆很陌生。如果标签是通过js加载的,有没有办法访问它们? - 或者首先检查是否是这种情况。
  • 检查我的答案我想我说清楚了

标签: python web-scraping beautifulsoup python-requests


【解决方案1】:

这是一个很好的网站。当您深入了解网站如何加载数据并查看网站的源代码时,所有数据都以Javascript Object 形式保存在脚本标记中。这不是JSON。如果您在脚本中提取数据,那么您将获得所有文章链接、图片等...

因为它是一个 Javascript 对象,所以您需要 3rd 方库来转换为 json。我使用 demjson 库来完成这项工作 - https://github.com/dmeranda/demjson

以下脚本将数据保存到 json 文件中。一旦你有 json 得到所有的链接应该不难。

import requests, demjson, json
from bs4 import BeautifulSoup

res = requests.get("https://edition.cnn.com/politics")

soup = BeautifulSoup(res.text, "html.parser")

script = None
for i in soup.find_all("script"):
    if "window.CNN" in i.text:
        script = i.get_text(strip=True)

if script is None: print("No data found")
else:
    data = script.partition("CNN.contentModel")[-1].partition("FAVE.settings")[0]
    json_data = demjson.decode(data[data.index('{'):-1])

    with open("data.json", "w") as f:
        json.dump(json_data, f)

输出:

{
    "hasVideo": false,
    "layout": "no-rail",
    "vertical": "politics",
    "sectionName": "politics",
    "pageType": "section",
    "env": "prod",
    "type": "page",
    "analytics": {
        "pageTop": {},
        "headline": "",
        "author": "",
        "showName": "",
        "subSectionName": "",
        "isArticleVideoCollection": false,
        "publishDate": "2014-02-27T01:35:32Z",
        "lastUpdatedDate": "2020-08-06T09:31:15Z",
        "pageBranding": "10-minute-preview",
        "cep_topics": {
            "brsf": [],
            "buzz": [],
            "iabt": [],
            "sent": [
                "16B6"
            ],
            "tags": [],
            "shortSource": "se_politics",
            "source": "section_politics"
        },
        "chartbeat": {
            "sections": ""
        },
        "branding_content_page": "10-minute-preview",
        "branding_content_zone": [
            "default"
        ],
        "branding_content_container": [
            "default"
        ],
        "branding_content_card": [
            ""
        ]
    },
    "edition": "international",
    "sourceId": "section_politics",
    "title": "CNNPolitics - Political News, Analysis and Opinion",
    "siblings": {
        "articleList": [
            {
                "uri": "/2020/08/06/politics/donald-trump-mail-in-voting-election/index.html",
                "headline": "Trump's mail-in voting falsehoods are part of a wide campaign to discredit the election",
                "thumbnail": "//cdn.cnn.com/cnnnext/dam/assets/200805203446-02-donald-trump-0805-small-11.jpg",
                "duration": "",
                "description": "<a href=\"http://www.cnn.com/specials/politics/president-donald-trump-45\" target=\"_blank\">President Donald Trump's</a> barrage of <a href=\"http://www.cnn.com/2020/08/05/politics/fact-check-trump-fox-friends-pandemic-biden-protests/index.html\" target=\"_blank\">challenges to the reputation, structures and traditions</a> of elections is conjuring up a contentious and potentially constitutionally critical three-month period for America's democracy.",
                "layout": ""
            },
            {
                "uri": "/2020/08/05/politics/donald-trump-press-briefing-beirut-coronavirus-voting-fact-check/index.html",
                "headline": "Fact check: At briefing, Trump continues to mislead on coronavirus, mail-in voting and Beirut",
                "thumbnail": "//cdn.cnn.com/cnnnext/dam/assets/200805203446-02-donald-trump-0805-small-11.jpg",
                "duration": "",
                "description": "President Donald Trump ended his Wednesday much like he began it, by repeating falsehood after falsehood.",
                "layout": ""
            },
            {
                "uri": "/2020/08/05/politics/state-department-russian-disinformation-report/index.html",
                "headline": "US accuses Russia of conducting sophisticated disinformation and propaganda campaign  ",
                "thumbnail": "//cdn.cnn.com/cnnnext/dam/assets/170626163907-russia-dnc-hacking-ron-2-00000808-small-11.jpg",
                "duration": "",
                "description": "A <a href=\"https://content.govdelivery.com/attachments/USSTATEBPA/2020/08/05/file_attachments/1512230/Pillars%20of%20Russias%20Disinformation%20and%20Propaganda%20Ecosystem_08-04-20%20%281%29.pdf\" target=\"_blank\">new report</a> from the US State Department accuses Russia of conducting a sophisticated disinformation and propaganda campaign that uses a variety of approaches including Kremlin-aligned news sites to promote their agenda.",
                "layout": ""
            },
            {
                "uri": "/2020/08/05/politics/fact-check-trump-ad-biden-basement-delaware-photos-iowa/index.html",
                "headline": "<strong>Fact check: </strong>Trump ad edits out microphone and trees from Biden photo to make him seem alone in basement",
                "thumbnail": "//cdn.cnn.com/cnnnext/dam/assets/200803235935-01-joe-biden-campaign-0720-small-11.jpg",
                "duration": "",
                "description": "A new <a href=\"https://www.youtube.com/watch?v=9PUfxZQa7WQ&feature=emb_title\" target=\"_blank\">ad</a> from President Donald Trump's campaign deceptively alters a photo of former Vice President Joe Biden campaigning outdoors in Iowa to make it seem as if Biden is \"hiding\" in his Delaware basement.",
                "layout": ""
            },
            {
                "uri": "/2020/08/05/politics/mark-meadows-unemployment-benefits-extension-coronavirus-relief-cnntv/index.html",
                "headline": "White House chief of staff floats executive action on unemployment and evictions if Congress can't strike deal",
                "thumbnail": "//cdn.cnn.com/cnnnext/dam/assets/191219132522-03-mark-meadows-lead-image-small-11.jpg",
                "duration": "",
                "description": "White House chief of staff Mark Meadows said Wednesday that <a href=\"https://www.cnn.com/specials/politics/president-donald-trump-45\" target=\"_blank\">President Donald Trump</a> is prepared to take executive action on eviction protection and extending enhanced unemployment benefits if Congress isn't close to <a href=\"https://www.cnn.com/2020/08/05/politics/congress-stimulus-negotiations/index.html\" target=\"_blank\">a coronavirus recovery package</a> by Friday. ",
                "layout": ""
            },
            {
                "uri": "/2020/08/05/politics/trump-campaign-four-debates/index.html",
                "headline": "Trump campaign calls for a fourth presidential debate, citing early voting",
                "thumbnail": "//cdn.cnn.com/cnnnext/dam/assets/200709094609-trump-biden-split-small-1-1.jpg",
                "duration": "",
                "description": "<a href=\"https://www.cnn.com/election/2020/candidate/trump\" target=\"_blank\">Donald Trump's</a> presidential campaign called for an additional presidential debate in a letter to the Commission on Presidential Debates on Wednesday. ",
                "layout": ""
            },
            {
                "uri": "/2020/08/05/politics/schlapp-mail-voting-expansion-nevada-fact-check/index.html",
                "headline": "<strong>Fact Check: </strong>With vote by mail expansion, can Nevada voters cast ballots after Election Day?",
                "thumbnail": "//cdn.cnn.com/cnnnext/dam/assets/200610082429-voting-north-las-vegas-small-11.jpg",
                "duration": "",
                "description": "President Donald Trump reversed his stance on voting by mail Tuesday when he <a href=\"https://www.cnn.com/2020/08/04/politics/donald-trump-mail-in-voting-florida/index.html\" target=\"_blank\">tweeted</a> that doing so in Florida is \"safe and secure.\" When asked about the reversal later Tuesday afternoon, Trump seemed to imply that Republican-run states with existing mail-in voting programs were up to par, but Democratic states establishing or expanding mail-in voting during the pandemic were not.",
                "layout": ""
            },

...
...
...

【讨论】:

  • 真的很好,非常感谢!到目前为止,这似乎是提取我需要的所有数据的最简单方法!
【解决方案2】:

处理包含js的页面加载元素。尝试使用硒,大多数情况下它可能会起作用。 您必须阅读文档https://selenium-python.readthedocs.io/index.html,例如安装和驱动程序。

from selenium import webdriver
from bs4 import BeautifulSoup

PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)

url = "https://edition.cnn.com/politics"
driver.get(url)
req = driver.page_source
driver.close()
soup = BeautifulSoup(req, "html.parser")

result = soup.find_all(class_="cd__headline-text")


for i in result:
    print(i.text)

输出:

Trump's mail-in voting falsehoods are part of a wide campaign to discredit the election
Fact check: At briefing, Trump continues to mislead on coronavirus, mail-in voting and Beirut
US accuses Russia of conducting sophisticated disinformation and propaganda campaign  
Fact check: Trump ad edits out microphone and trees from Biden photo to make him seem alone in basement
White House chief of staff floats executive action on unemployment and evictions if Congress can't strike deal
Trump campaign calls for a fourth presidential debate, citing early voting
Fact Check: With vote by mail expansion, can Nevada voters cast ballots after Election Day?
Trump bests Biden in July fundraising but money gap between the campaigns has essentially closed
New York Times: Prosecutors subpoenaed Trump's bank in criminal inquiry 
Analysis: But, seriously -- what is this country going to do with its kids this fall?
Analysis: This week's 'smooth' primaries almost felt normal. Here's why.
Brianna Keilar debunks Trump campaign official: You've got to shovel B.S.
Illinois Republican congressman tests positive for coronavirus
Former Army Delta Force officer, US ambassador sign secretive contract to develop Syrian oil fields
Supreme Court lifts lower court order that would have required more Covid-related safety measures in California jail
Ex-acting AG Sally Yates defends FBI investigation into Flynn, calls Barr move to drop charges 'highly irregular'
Esper says 'most believe' Beirut explosion 'was an accident' after Trump claimed it was an attack
Fact check: Trump makes at least 20 false claims in Fox & Friends interview
Trump trashes Obama's Lewis eulogy that pressed for voting rights
Trump still not grasping the severity of the pandemic, source tells CNN 
Republican senators grow anxious over direction of stimulus talks with no deal in sight
Joe Biden will no longer travel to Milwaukee to accept Democratic nomination
Analysis: Trump's interview debacle sends a warning for the fall campaign  
Fauci says US has suffered from pandemic 'as much or worse than anyone' 
Primary results: Key takeaways from Kansas
CNN holds elected officials and candidates accountable. View our Facts First database
Seven governors join deal in pursuit of first multistate coordinated testing strategy
Hogan overrules Maryland county order delaying in-person education at private schools, including Barron Trump's 
Birx defends herself as Pelosi accuses Trump administration of spreading disinformation on Covid-19
See latest Trump and Biden head-to-head polling
Top Senate Republican pushes back against Trump's unsubstantiated claims mail-in-voting leads to mass fraud
Republican operatives are helping Kanye West get on general election ballots
Progressive who unseated longtime Democratic congressman says 'people are looking for a fighter right now'
Trump said he may deliver convention speech from White House
Biden clarifies he has not taken cognitive test
Fact check: Biden says he hasn't taken a cognitive test. Is he flip-flopping?
WNBA players wear shirts supporting Sen. Kelly Loeffler's challenger -- including some from team she co-owns
Trump campaign sues Nevada over plan to mail ballots to all registered voters
Analysis: Trump may finally realize he's suppressing his own vote
Trump continues to lose ground in 2020 election as nation grapples with coronavirus 

【讨论】:

  • 谢谢!我将尝试查看是否可以通过这种方式获取与每个标题相关联的 url!我想浏览每个网址并从文章中提取文本。再次感谢!
  • 不客气。尝试学习 selenium,因为它可以帮助抓取包含 js 的网站。
【解决方案3】:

您的代码可以正常工作我试过了,但请检查您是否没有遗漏任何要求 就像lxml 安装在这里是我所做的

from bs4 import BeautifulSoup
import requests

url = 'https://edition.cnn.com/politics'

r1 = requests.get(url)
soup = BeautifulSoup(r1.content, 'lxml')
li = soup.find_all('li')
print(li)

注意 find_all 方法返回 i 数组,所以如果你想要一个一个,你可以简单地循环它并打印每个 sing li 如下所示

for i in li:
    print(i.prettify())

【讨论】:

  • 嘿,感谢您的测试。它对我也很好。问题是它可以运行,但是原始帖子中指定的一些 li 标签丢失了。它们可能是嵌入的,或者正如 Cobalt 之前提到的,它们可以通过 js 调用。我不知道如何访问这些“隐藏”标签 - 文章链接似乎位于此处。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2016-12-22
  • 1970-01-01
  • 2015-06-30
  • 1970-01-01
  • 2019-07-26
  • 1970-01-01
相关资源
最近更新 更多