【问题标题】:How to extract only paragraph part from a link excluding other links from a web page?如何仅从链接中提取段落部分,不包括网页中的其他链接?
【发布时间】:2019-06-20 07:02:08
【问题描述】:

我正在尝试从网页中提取句子,但我无法排除该网页中显示的其他链接或侧面图标。

我试图从网页(意思是段落)中找到所有出现的“p”,但我也得到了其他不需要的结果。

我的代码:

  import re
  from nltk import word_tokenize, sent_tokenize, ngrams
  from collections import Counter
  from urllib import request
  from bs4 import BeautifulSoup

  url = "https://www.usatoday.com/story/sports/nba/rockets/2019/01/25/james-harden-30-points-22-consecutive-games-rockets-edge-raptors/2684160002/"
  html = request.urlopen(url).read().decode('utf8')
  raw = BeautifulSoup(html,"lxml") 


 partags = raw.find_all('p') #to extract only paragraphs 
 print(partags) 

我得到以下输出(作为图像发布,因为复制粘贴看起来不那么整洁)

[![enter image description here][1]][1]

https://i.stack.imgur.com/rGC1P.png

但是我想从链接中只提取这种句子,是否有任何额外的过滤器可以应用。

[![在此处输入图片描述][1]][1]

https://i.stack.imgur.com/MlPUV.png'

Code after Valery's feedback.  

partags = raw.get_text()
print(partags)

我得到的输出(它还有 JSON 格式的链接和其他)

This is just sample from the full output: 

James Harden extends 30-point streak, makes key defensive stop
{
    "@context": "http://schema.org",
    "@type": "NewsArticle",
    "headline": "James Harden extends 30-point streak, makes key defensive stop to help Rockets edge Raptors",
    "description": "James Harden scored 35 points for his 22nd consecutive game with at least 30, and forced Kawhi Leonard into a missed 3 at buzzer for 121-119 win.",
    "url": "https://www.usatoday.com/story/sports/nba/rockets/2019/01/25/james-harden-30-points-22-consecutive-games-rockets-edge-raptors/2684160002/?utm_source=google&utm_medium=amp&utm_campaign=speakable",
    "mainEntityOfPage": {
        "@type": "WebPage",
        "@id": "https://www.usatoday.com/story/sports/nba/rockets/2019/01/25/james-harden-30-points-22-consecutive-games-rockets-edge-raptors/2684160002/"
    },

【问题讨论】:

  • 将bs4导入为from bs4 import BeautifulSoup as bs也很常见
  • @AmirhosImani 它不会给我同样的东西还是我在这里遗漏了什么?
  • 是一样的……更常见的是导入为bs,类似于导入pandas为pd

标签: python python-3.x jupyter-notebook


【解决方案1】:

关于这个BeautifulSoup/bs4/doc/#get-text的bs4文档

import requests
from bs4 import BeautifulSoup as bs

response = requests.get("https://www.usatoday.com/story/sports/nba/rockets/2019/01/25/james-harden-30-points-22-consecutive-games-rockets-edge-raptors/2684160002/")
html = response.text
raw = bs(html, "html")

for partag in raw.find_all('p'):

    print(partag.get_text())

这里是Link to results

所以在 partags(段落标签)上调用 get_text() 会产生没有噪音的有效文本。

【讨论】:

  • 感谢@valery,我正在查看此链接,但似乎无法处理所有情况,我仍在获取链接。 partags = raw.get_text() print(partags)
  • 这很有趣。您能否提供可粘贴的 html(不是图像)示例进行测试?
  • 谢谢,我在上面的问题中添加了从 Get_text 获得的输出。
  • 添加完整示例如何使用 get_text() 从 p 标签中提取文本
  • 感谢您的帮助和解决方案,我得到了预期的输出。
猜你喜欢
  • 2013-06-14
  • 2011-07-04
  • 2011-02-01
  • 1970-01-01
  • 2018-12-09
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2014-01-30
相关资源
最近更新 更多