【问题标题】:BeautifulSoup4 - Concatenating multiple html elements between two different tagsBeautifulSoup4 - 在两个不同的标签之间连接多个 html 元素
【发布时间】:2020-02-16 19:22:13
【问题描述】:

我正在使用 Python 和 bs4 抓取页面

我从 bs4 得到的 html 源代码如下(为了便于阅读,稍微整理了一下):

<p style="text-align:justify;font-size:12.0px;font-family:Arial, Helvetica, sans-serif">
<span style="font-size:14.0px"><span style="font-family:Arial, Helvetica, sans-serif">

<strong>COMPANY DESCRIPTION</strong><br>
Here goes the first para of company description</span></span></p>

<p style="text-align:justify;font-size:12.0px;font-family:Arial, Helvetica, sans-serif">
<span style="font-size:14.0px"><span style="font-family:Arial, Helvetica, sans-serif">
Here goes the second para of company description</span></span></p>

<p><strong>PURPOSE AND OBJECTIVES</strong></p>
<p>To address requirements in the area of Supply Chain Management Extended Warehouse Management solutions, Build competencies at Solution Delivery Center to deliver solutions<br>

<strong>EXPECTATIONS AND TASKS&nbsp;</strong></p>
<ul>
    <li>Independently handle large implementation projects with focus on Warehouse Management processes such as inbound, outbound and internal processes. RF Device functions and Barcode support experience is desirable</li>
    <li>Able to lead EWM discussions, assessments and detail requirement studies with customers</li>
</ul>

<strong>KEY PERFORMANCE INDICATORS</strong></p>
<ul>
    <li>Customer Feedback/customer satisfaction scores</li>
    <li>Productive days/utilization as defined by the organization for projects/assessments/etc.</li>
    <li>Knowledge Management and creation of effective reusable components</li>
</ul>

<strong>EXPERIENCE REQUIREMENTS</strong></p>
<ul>
    <li>Minimum of 4+ years industry experience and a minimum of 5 to 6 years of SAP EWM experience</li>
    <li>Domain knowledge in Supply Chain Management in the areas of Planning, Manufacturing &amp; warehousing processes is a must</li>
</ul>

<p><strong>EDUCATION AND QUALIFICATIONS/SKILLS AND COMPETENCIES</strong></p>
<ul>
    <li>Degree in Engineering or IT</li>
    <li>SAP Certification in Extended Warehouse Management (EWM) desirable</li>
</ul>

<p><span style="font-family:Arial,Helvetica,sans-serif"><span style="font-size:14.0px"><strong>WHAT YOU GET FROM US </strong></span></span></p>

观察:

在上面的代码中,所有部分的标题都在&lt;strong&gt; &lt;/strong&gt; 标记之间。不同页面的标题可能有所不同。

我的要求:

  • 将所有 html 文本和标签从公司描述之后的第二个 &lt;strong&gt; 标签开始合并,即从目的和目标开始,并在包含您从我们这里得到什么的标签之前结束。
  • 我没有寻找任何使用 Selenium 的解决方案,因为它会比较慢。

我正在抓取的页面是Link I am scraping

这是我的python代码:

def scrape_url(url, method='bs4'):
    session = requests.session()
    page = session.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    return soup

url = 'https://jobs.sap.com/job/Mumbai-Senior-Account-Executive-Job-MH/539212101/'
soup = scrape_url(url)
job_page = soup.body.find('div', attrs={'class': 'job'})
print(job_page)

【问题讨论】:

    标签: python-3.x beautifulsoup


    【解决方案1】:

    首先使用正则表达式识别带有文本的标签,然后使用find_next_siblings()获取所有下一个兄弟,然后检查any siblings contains是否为文本WHAT YOU GET FROM US

    代码

    import re
    import requests
    from bs4 import BeautifulSoup
    def scrape_url(url, method='bs4'):
        session = requests.session()
        page = session.get(url)
        soup = BeautifulSoup(page.text, 'html.parser')
        return soup
    
    url = 'https://jobs.sap.com/job/Kuala-Lumpur-Business-Processes-Consultant-%28FICO%29-Job-14/541909901/'
    soup = scrape_url(url)
    findtag=soup.find('p',text=re.compile("PURPOSE AND OBJECTIVES"))
    print(findtag.text)
    for item in findtag.find_next_siblings():    
        if 'WHAT YOU GET FROM US' in item.text:
            break
        else:
            print(item.text.strip())
    

    输出:在控制台上

    PURPOSE AND OBJECTIVES
    
    To address requirements in the area of Supply Chain Management Extended Warehouse Management solutions, Build competencies at Solution Delivery Center to deliver solutions especially in areas relating to SAP EWM
    
    EXPECTATIONS AND TASKS
    
    Independently handle large implementation projects with focus on Warehouse Management processes such as inbound, outbound and internal processes. RF Device functions and Barcode support experience is desirable
    Able to lead EWM discussions, assessments and detail requirement studies with customers
    Leading the team that are assigned to, in functional capacity, adding value to the project and to the final deliverables
    Be actively involved in the preparation, conception, realization and Go Live of customer implementation projects
    Demonstrate the ability to plan, run, and manage blueprint workshops / meetings with internal and external clients
    Responsible for defining the scope of a project / opportunities, estimating efforts and project timelines
    Participating in RFP discussions and estimating under guidance from a Bid Manager
    Providing a creative source of ideas/solutions to address problems
    Delivering billable components that meets a customer’s needs
    KEY PERFORMANCE INDICATORS
    
    Customer Feedback/customer satisfaction scores
    Productive days/utilization as defined by the organization for projects/assessments/etc.
    Knowledge Management and creation of effective reusable components
    EXPERIENCE REQUIREMENTS
    
    Minimum of 4+ years industry experience and a minimum of 5 to 6 years of SAP EWM experience
    Domain knowledge in Supply Chain Management in the areas of Planning, Manufacturing & warehousing processes is a must
    Must have strong ERP implementation experience
    Experience in SAP Material Flow Systems (MFS) or any other third party automation tools will be desirable
    Experience in EWM technical knowledge will be an added advantage
    Knowledge on SAP S/4HANA Public Cloud solution and SAP IOT/Leonardo portfolio will be preferred but not mandatory
    Good understanding of S/4HANA Order to Cash and Procure to Pay business processes
    Good understanding of SAP ACTIVATE implementation methodology
    Use of Solution Manager as a part of implementation life cycle is desirable
    Good Communication skill in English.
    
    EDUCATION AND QUALIFICATIONS/SKILLS AND COMPETENCIES
    
    Degree in Engineering or IT
    SAP Certification in Extended Warehouse Management (EWM) desirable
    Minimum 4 to 5 full life cycle SAP EWM implementations
    Strong knowledge in SAP SCM Extended Warehouse Management Solutions and S/4HANA Embedded EWM Solution
    Good integration knowledge with other components with SAP S/4HANA (WM, SD, MM, PP) and other SAP or Non-SAP legacy applications
    Knowledge of SCOR, APICS certification preferable
    Strong client-facing experience and well-developed customer focus
    Solid oral and written communication skills, with the demonstrated ability to communicate complex technical topics to management and non-technical audiences
    Mobility is must – candidate must be ready to travel to project locations (short term and long term)
    

    【讨论】:

    • 嗨昆杜,很好的解决方案。为我获取此特定 url 所需的输出。但是对于批处理,输入“目的和目标”将在页面之间更改。页面中唯一的起始常量是“公司描述”。我们可以从“公司描述”开始并以某种方式获取下一个兄弟姐妹吗?我厌倦了“p”、“强”和“公司描述”。无法让它工作。
    • 这里有 2 个链接来显示差异link1 | link2 如您所见,link2 有不同的文字“Key Areas of Responsibility and Tasks:”
    • 嗨,朋友,如果这解决了您的原始要求,请接受并投票赞成答案。但是对于多个网址,我可以要求发布一个新问题并解释您对两个网址的预期输出。谢谢。跨度>
    • 完成。新更新的问题link
    • 对不起,伙计走了。很快就会回来。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-09-21
    • 2022-10-15
    • 1970-01-01
    • 2023-03-07
    • 1970-01-01
    相关资源
    最近更新 更多