【问题标题】:Extract data from html using beautifulsoup使用 beautifulsoup 从 html 中提取数据
【发布时间】:2019-08-28 12:14:49
【问题描述】:

我正在尝试提取 EXPERIENCE 标签下的数据。我使用 beautifulsoup 来提取数据。下面是我的html:

<div><span>EXPERIENCE

<br/></span></div><div><span>

<br/></span></div><div><span>

<br/></span></div><div><span></span><span> </span><span>I worked in XYZ company from 2016 - 2018

<br/></span></div><div><span> I worked on JAVA platform

<br/></span></div><div><span>From then i worked in ABC company

</br>2018- Till date

</br></span></div><div><span>I got handson on Python Language

</br></span></div><div><span>PROJECTS

</br></span></div><div><span>Developed and optimized many application, etc...

到目前为止我的工作:

with open('E:/cvparser/test.html','rb') as h:

    dh = h.read().splitlines()

    out = str(dh)

    soup = BeautifulSoup(out,'html.parser')

    for tag in soup.select('div:has(span:contains("EXPERIENCE"))'):

        final = (tag.get_text(strip = True, separator = '\n'))

    print(final)

预期输出:

I worked in XYZ company from 2016 - 2018

I worked on JAVA platform

From then i worked in ABC company

2018- Till date

I got handson on Python Language

对于我的代码,它返回 null。有人可以帮我吗?

【问题讨论】:

  • 只是为了澄清,经验不是标签。您感兴趣的标签是&lt;span&gt; 标签。因此,您正在寻找包含文本/内容 EXPERIENCEspan 标签下的数据
  • 这几乎肯定是重复的。我最近三次看到同样的问题,只是形式略有不同。

标签: python beautifulsoup


【解决方案1】:

我的理解是您希望在 EXPERIENCEPROJECTS

之间的 span 中有文本

这是你需要的:

from bs4 import BeautifulSoup as soup

html = """<div><span>EXPERIENCE

<br/></span></div><div><span>

<br/></span></div><div><span>

<br/></span></div><div><span></span><span> </span><span>I worked in XYZ company from 2016 - 2018

<br/></span></div><div><span> I worked on JAVA platform

<br/></span></div><div><span>From then i worked in ABC company

</br>2018- Till date

</br></span></div><div><span>I got handson on Python Language

</br></span></div><div><span>PROJECTS
</br></span></div><div><span>Developed and optimized many application, etc...</span></div>"""

page = soup(html, "html.parser")

save = False
final = ''
for div in page.find_all('div'):
    text = div.get_text()

    if text and text.strip().replace('\n','') == 'PROJECTS':
        save = False

    if save and text and text.strip().replace('\n', ''):
        # last if is to avoid new line in final result
        final = '{0}\n{1}'.format(final,text.replace('\n',''))
    else:
        if text and 'EXPERIENCE' in text:
            save = True

print(final)

输出:

 I worked in XYZ company from 2016 - 2018
 I worked on JAVA platform
From then i worked in ABC company
I got handson on Python Language

【讨论】:

    【解决方案2】:

    我不确定你的 html 示例,但试试这个:

    from bs4 import BeautifulSoup
    result2 = requests.get("") # your url here
    src2 = result2.content
    soup = BeautifulSoup(src2, 'lxml')
    
    
    for item in soup.find_all('div', {'span': 'Experience'}): 
        print(item.text)
    

    【讨论】:

      【解决方案3】:

      您可以使用itertools.groupby 将所有相关子内容与其相应的标题匹配:

      import itertools, re
      from bs4 import BeautifulSoup as soup
      d = lambda x:[i for b in x.contents for i in ([b] if b.name is None else d(b))]
      data = list(filter(None, map(lambda x:re.sub('\n+|^\s+', '', x), d(soup(html, 'html.parser')))))
      new_d = [list(b) for _, b in groupby(data, key=lambda x:x.isupper())]
      result = {new_d[i][0]:new_d[i+1] for i in range(0, len(new_d), 2)}
      

      输出:

      {'EXPERIENCE': ['\uf0b7', 'I worked in XYZ company from 2016 - 2018', 'I worked on JAVA platform', 'From then i worked in ABC company', 'I got handson on Python Language'], 'PROJECTS': ['Developed and optimized many application, etc...']}
      

      要获得所需的输出:

      print('\n'.join(result['EXPERIENCE']))
      

      输出:

      
      I worked in XYZ company from 2016 - 2018
      I worked on JAVA platform
      From then i worked in ABC company
      2018- Till date
      I got handson on Python Language
      

      【讨论】:

      • 它给了我超出范围的列表索引。@Ajax1234
      猜你喜欢
      • 2015-09-29
      • 2019-05-24
      • 2013-01-29
      • 1970-01-01
      • 2015-04-13
      • 2013-11-15
      • 1970-01-01
      • 2012-04-04
      • 1970-01-01
      相关资源
      最近更新 更多