【问题标题】:Get neat text from a BeautifulSoup page从 BeautifulSoup 页面获取整洁的文本
【发布时间】:2021-09-14 02:32:19
【问题描述】:

我正在做一个网络爬虫从 URL 中查找职位描述,这是我现在的代码:

def getJobDesc(url):
    try:
        req = requests.get(url)
        page = BeautifulSoup(req.text, 'html.parser')
        jd = page.find("div", {"data-automation": "jobDescription"})
        return jd
    except:
        return ""

所以它做了它应该做的事情,来自测试 URL 的jd 如下:

<div class="vDEj0_0" data-automation="jobDescription"><span class="FYwKg _2Bz3E C6ZIU_0 _6ufcS_0 _2DNlq_0 _29m7__0"><div class="FYwKg"><p><strong>Job Responsibilities:</strong></p><ul><li><span style="color:black">Provide innovative solutions to complex business problems</span></li><li><span style="color:black">Plan, develop and implement large-scale projects from conception to completion</span></li><li><span style="color:black">Develop and architect lifecycle of projects working on different technologies and platforms</span></li><li><span style="color:black">Design, develop and implement new integration</span></li></ul><p><strong>Job Requirements:</strong></p><ul><li><span style="color:black">Proficient in Java and preferably in Python as well</span></li><li><span style="color:black">Basic understanding of database i.e MongoDB, MySQL databases is a plus</span></li><li><span style="color:black">Good understanding of </span><strong>Object-oriented</strong><span style="color:black"> programming</span></li><li><span style="color:black">Basic understanding in version control systems e.g. Git</span></li><li><span style="color:black">Basic understanding in Linux operating system</span></li><li><span style="color:black">Basic understanding of cloud services – Azure, AWS, etc</span></li><li><span style="color:black">Basic understanding of Devops</span></li><li><span style="color:black">A degree in Computer Science or equivalent industry experience</span></li><li><span style="color:black">Passionate with building elegant, scalable software that solves practical problems</span></li><li><span style="color:black">Team player and can do attitude</span></li><li><span style="color:black">Good problem solving skills and attention to detail</span></li></ul></div></span></div>

但是当我将其更改为返回 jd.text 时,结果如下:

'Job Responsibilities:Provide innovative solutions to complex business problemsPlan, develop and implement large-scale projects from conception to completionDevelop and architect lifecycle of projects working on different technologies and platformsDesign, develop and implement new integrationJob Requirements:Proficient in Java and preferably in Python as wellBasic understanding of database i.e MongoDB, MySQL databases is a plusGood understanding of\xa0Object-oriented\xa0programmingBasic understanding in version control systems e.g. GitBasic understanding in Linux operating systemBasic understanding of cloud services – Azure, AWS, etcBasic understanding of DevopsA degree in Computer Science or equivalent industry experiencePassionate with building elegant, scalable software that solves practical problemsTeam player and can do attitudeGood problem solving skills and attention to detail'

所以我这里有两个问题:

  1. 列表未正确转换。
  2. 未正确解析格式化文本(本例中为 Object-oriented 一词)。

【问题讨论】:

    标签: python html xml beautifulsoup


    【解决方案1】:

    您可以使用get_text() 方法并在separator= 参数中添加一个空格,以“取消嵌套”文本。

    所以,而不是:

    return jd.text
    

    使用:

    return jd.get_text(separator=" ")
    

    你也可以使用:

    jd.get_text(separator="\n")
    

    将文本输出到不同的行。

    (注意:我无法重现您的第二个问题,但看看这是否解决了它)。

    【讨论】:

    • 谢谢,get_text() 方法解决了它。至于第二个问题,我在别处找到了解决方案,使用unicodedata.normalize()
    猜你喜欢
    • 2011-10-17
    • 2020-05-14
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-09-14
    • 1970-01-01
    • 2020-05-03
    相关资源
    最近更新 更多