【发布时间】:2020-02-16 19:22:13
【问题描述】:
我正在使用 Python 和 bs4 抓取页面
我从 bs4 得到的 html 源代码如下(为了便于阅读,稍微整理了一下):
<p style="text-align:justify;font-size:12.0px;font-family:Arial, Helvetica, sans-serif">
<span style="font-size:14.0px"><span style="font-family:Arial, Helvetica, sans-serif">
<strong>COMPANY DESCRIPTION</strong><br>
Here goes the first para of company description</span></span></p>
<p style="text-align:justify;font-size:12.0px;font-family:Arial, Helvetica, sans-serif">
<span style="font-size:14.0px"><span style="font-family:Arial, Helvetica, sans-serif">
Here goes the second para of company description</span></span></p>
<p><strong>PURPOSE AND OBJECTIVES</strong></p>
<p>To address requirements in the area of Supply Chain Management Extended Warehouse Management solutions, Build competencies at Solution Delivery Center to deliver solutions<br>
<strong>EXPECTATIONS AND TASKS </strong></p>
<ul>
<li>Independently handle large implementation projects with focus on Warehouse Management processes such as inbound, outbound and internal processes. RF Device functions and Barcode support experience is desirable</li>
<li>Able to lead EWM discussions, assessments and detail requirement studies with customers</li>
</ul>
<strong>KEY PERFORMANCE INDICATORS</strong></p>
<ul>
<li>Customer Feedback/customer satisfaction scores</li>
<li>Productive days/utilization as defined by the organization for projects/assessments/etc.</li>
<li>Knowledge Management and creation of effective reusable components</li>
</ul>
<strong>EXPERIENCE REQUIREMENTS</strong></p>
<ul>
<li>Minimum of 4+ years industry experience and a minimum of 5 to 6 years of SAP EWM experience</li>
<li>Domain knowledge in Supply Chain Management in the areas of Planning, Manufacturing & warehousing processes is a must</li>
</ul>
<p><strong>EDUCATION AND QUALIFICATIONS/SKILLS AND COMPETENCIES</strong></p>
<ul>
<li>Degree in Engineering or IT</li>
<li>SAP Certification in Extended Warehouse Management (EWM) desirable</li>
</ul>
<p><span style="font-family:Arial,Helvetica,sans-serif"><span style="font-size:14.0px"><strong>WHAT YOU GET FROM US </strong></span></span></p>
观察:
在上面的代码中,所有部分的标题都在<strong> </strong> 标记之间。不同页面的标题可能有所不同。
我的要求:
- 将所有 html 文本和标签从公司描述之后的第二个
<strong>标签开始合并,即从目的和目标开始,并在包含您从我们这里得到什么的标签之前结束。 - 我没有寻找任何使用 Selenium 的解决方案,因为它会比较慢。
我正在抓取的页面是Link I am scraping
这是我的python代码:
def scrape_url(url, method='bs4'):
session = requests.session()
page = session.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
return soup
url = 'https://jobs.sap.com/job/Mumbai-Senior-Account-Executive-Job-MH/539212101/'
soup = scrape_url(url)
job_page = soup.body.find('div', attrs={'class': 'job'})
print(job_page)
【问题讨论】: