【问题标题】:Delete unwanted elements of python webscraping loop results删除python webscraping循环结果中不需要的元素
【发布时间】:2021-04-26 04:39:44
【问题描述】:

我目前正在尝试使用以下代码从网页中提取文本和标签(主题):

Texts = []
Topics = []

url = 'https://www.unep.org/news-and-stories/story/yes-climate-change-driving-wildfires'

response = requests.get(url, headers={'User-agent': 'Mozilla/5.0'})
if response.ok:
    soup = BeautifulSoup(response.text,'lxml')
    txt = soup.findAll('div', {'class': 'para_content_text'})
    for div in txt:
        p = div.findAll('p')
        Texts.append(p)
    print(Texts)


    top = soup.find('div', {'class': 'article_tags_topics'})
    a = top.findAll('a')
    Topics.append(a)
    print(Topics)

没有代码问题,但这里是我从之前的代码中获得的摘录:

    </p>, <p><strong>UNEP:</strong> And this is bad news?</p>, <p><strong>NH:</strong> This is bad news. This is bad for our health, for our wallet and for the fabric of society.</p>, <p><strong>UNEP:</strong> The world is heading towards a global average temperature that’s 3<strong>°</strong>C to 4<strong>°</strong>C higher than  it was before the industrial revolution. For many people, that might not seem like a lot. What do you say to them?</p>, <p><strong>NH:</strong> Just think about your own body. When your temperature goes up from 36.7°C (98°F) to 37.7°C (100°F), you’ll probably consider taking the day off. If it goes 1.5°C above normal, you’re staying home for sure. If you add 3°C, people who are older and have preexisting conditions –  they may die. The tolerances are just as tight for the planet.</p>]]

[[<a href="/explore-topics/forests">Forests</a>, <a href="/explore-topics/climate-change">Climate change</a>]]

在寻找“干净”的文本结果时,我尝试在循环中添加以下代码行,以便仅获取文本:

p = p.text

但我得到了:

AttributeError:ResultSet 对象没有属性“文本”。您可能将项目列表视为单个项目。当你打算调用 find() 时,你调用了 find_all() 吗?

我还注意到,对于主题结果,我得到了不需要的 URL,我只想获取 Forest 和结果(它们之间没有逗号)。

知道我可以在代码中添加什么以获得干净的文本和主题吗?

【问题讨论】:

    标签: python web-scraping beautifulsoup data-cleaning


    【解决方案1】:

    这是因为 p 是一个 ResultSet 对象。您可以通过运行以下命令来查看:

    print(type(Texts[0]))
    

    输出:

    <class 'bs4.element.ResultSet'>
    

    要获取实际文本,您可以直接寻址每个ResultSet 中的每个项目:

    for result in Texts:
        for item in result:
            print(item.text)
    

    输出:

    As wildfires sweep across the western United States, taking lives, destroying homes and blanketing the country in smoke, Niklas Hagelberg has a sobering message: this could be America’s new normal.
    ......
    

    甚至使用列表推导:

    full_text = '\n'.join([item.text for result in Texts for item in result])
    

    【讨论】:

      【解决方案2】:

      AttributeError 表示您有一个元素列表,因为您使用了p = div.findAll('p')

      试试:

      p[0].text
      

      或将p = div.findAll('p') 更改为p = div.find('p')(它只会返回找到的第一个案例)

      【讨论】:

        猜你喜欢
        • 2019-07-04
        • 2023-03-16
        • 2019-09-18
        • 1970-01-01
        • 1970-01-01
        • 2016-03-29
        • 2017-09-25
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多