【问题标题】:Finding href from 'a' tag is not finding the first 'a' tag, how do i fix it?从'a'标签中查找href没有找到第一个'a'标签,我该如何解决?
【发布时间】:2021-09-27 08:37:33
【问题描述】:

我是 python 新手,我确实在尝试刮。由于某种原因,每个职位发布都保存在“a”标签而不是 div 下,div 也包含 href。 这是项目输出:print(item)

<a class="tapItem fs-unmask result job_e0fb3e5f520856c0 resultWithShelf sponTapItem tapItem-noPadding desktop" data-hide-spinner="true" data-jk="e0fb3e5f520856c0" data-mobtk="1favs1gn0t5v1800" href="/company/Acentury/jobs/New-Graduate-Software-Developer-e0fb3e5f520856c0?fccid=5c6453896b020232&amp;vjs=3" id="job_e0fb3e5f520856c0" rel="nofollow" target="_blank"><div class="slider_container"><div class="slider_list"><div class="slider_item"><div class="job_seen_beacon"><table cellpadding="0" cellspacing="0" class="jobCard_mainContent" role="presentation"><tbody><tr><td class="resultContent"><div class="heading4 color-text-primary singleLineTitle tapItem-gutter"><h2 class="jobTitle jobTitle-color-purple jobTitle-newJob"><div class="new topLeft holisticNewBlue desktop"><span class="label">new</span></div><span title="New Graduate Software Developer">New Graduate Software Developer</span></h2></div><div class="heading6 company_location tapItem-gutter"><pre><span class="companyName">Acentury</span><div class="companyLocation">Richmond Hill, ON<span class="remote-bullet">•</span><span>Temporarily Remote</span></div></pre></div><div class="heading6 tapItem-gutter metadataContainer"><div class="metadata salary-snippet-container"><span class="salary-snippet">$44,182 - $126,699 a year</span></div></div><div class="heading6 error-text tapItem-gutter"></div></td></tr></tbody></table><table class="jobCardShelfContainer" role="presentation"><tbody><tr class="jobCardShelf"><td class="shelfItem indeedApply"><span class="iaIcon"></span><span class="ialbl iaTextBlack">Easily apply</span></td></tr><tr class="underShelfFooter"><td><div class="heading6 tapItem-gutter result-footer"><div class="job-snippet"><ul style="list-style-type:circle;margin-top: 0px;margin-bottom: 0px;padding-left:20px;">
<li>Work with senior <b>developers</b> to develop front-end features on our current platform through entire R&amp;D cycle from design to implementation and official release.</li>
</ul></div><span class="date">Today</span><span class="result-link-bar-separator">·</span><button aria-expanded="false" class="sl resultLink more_links_button" type="button">More...</button></div><div class="tab-container"><div class="more-links-container result-tab" role="presentation"><div class="more_links"><button class="close-button" title="Close" type="button"></button><ul><li><span class="mat">View all <a href="/Acentury-jobs">Acentury jobs</a> - <a href="/jobs-in-Richmond-Hill,-ON">Richmond Hill jobs</a></span></li><li><span class="mat">Salary Search: <a href="/career/software-engineer/salaries/Richmond-Hill--ON?campaignid=serp-more&amp;fromjk=e0fb3e5f520856c0&amp;from=serp-more">New Graduate Software Developer salaries in Richmond Hill, ON</a></span></li></ul></div></div></div></td></tr></tbody></table><div aria-live="polite"></div></div></div><div class="slider_sub_item"></div></div></div><div class="kebabMenu"><button aria-expanded="false" aria-haspopup="true" aria-label="Job actions" class="kebabMenu-button"><svg fill="none" height="24" viewbox="0 0 24 24" width="24" xmlns="http://www.w3.org/2000/svg"><path d="M12 7C13.1 7 14 6.1 14 5C14 3.9 13.1 3 12 3C10.9 3 10 3.9 10 5C10 6.1 10.9 7 12 7ZM12 10C10.9 10 10 10.9 10 12C10 13.1 10.9 14 12 14C13.1 14 14 13.1 14 12C14 10.9 13.1 10 12 10ZM12 17C10.9 17 10 17.9 10 19C10 20.1 10.9 21 12 21C13.1 21 14 20.1 14 19C14 17.9 13.1 17 12 17Z" fill="#2d2d2d"></path></svg></button></div></a> 

我的代码是

divs = soup.find_all('a', class_ = 'tapItem')
for item in divs:
   for people in item.find_all('a'):
       print(people)   
       for ok in people.find_all('a', class_ = 'tapItem'):
           linkJob1 = ok.get('href')
   print(linkJob1)

人物不包含第一个'a'标签,而是其他标签,我该如何解决这个问题?谢谢

网址:https://ca.indeed.com/jobs?q=software+developer&l=Toronto%2C+ON&start=0

预期结果是每个职位/卡片的href

【问题讨论】:

  • 网址是什么以及预期结果的示例?
  • ca.indeed.com/… 预期结果是每个职位/卡片的href

标签: python web-scraping beautifulsoup href


【解决方案1】:

如果您在类result 的元素级别循环,您只需要一个ID(作业ID),您可以从data-jk 属性中提取它。然后,您可以像网站一样,动态构建 url:

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://ca.indeed.com/jobs?q=software+developer&l=Toronto,+ON&start=0')
soup = bs(r.content, 'lxml')

for job in soup.select('.result'):
    print(job.select_one('.jobTitle').get_text(' '))
    print(f'https://ca.indeed.com/viewjob?jk={job["data-jk"]}')

【讨论】:

  • 我如何获得职位/卡片的完整职位描述?当您单击工作卡预览时,在“不刮”模式下,它只会在右侧打开完整的工作描述。我非常希望在抓取时得到完整的描述
猜你喜欢
  • 2016-01-17
  • 2013-06-24
  • 2016-08-14
  • 2021-03-03
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2012-05-15
相关资源
最近更新 更多