【问题标题】:Regular expression in Beautiful Soup outputBeautiful Soup 输出中的正则表达式
【发布时间】:2018-08-28 13:02:18
【问题描述】:

我正在尝试从 BS 处理的 html 页面中获取行,其中包含
“十亿”这个词。但是我得到的是空列表.....顺便说一句,这些行在
<li>标签之间,我尝试使用soup.findAll("<li>", {"class": "tabcontent"})

但它也给了我一个空列表。

import requests
from bs4 import BeautifulSoup
import re

url = 'http://www.worldstopexports.com/united-states-top-10-exports/'

header = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
    "X-Requested-With": "XMLHttpRequest"
}

page = requests.get (url, headers=header)

soup = BeautifulSoup (page.text, 'lxml')

table = soup.find_all (class_='tabcontent')[0].text

print(re.findall(r'^.*? billion', table))

print(table)



Machinery including computers: US$201.7 billion (13% of total exports)
Electrical machinery, equipment: $174.2 billion (11.3%)
Mineral fuels including oil: $138 billion (8.9%)
Aircraft, spacecraft: $131.2 billion (8.5%)
Vehicles: $130.1 billion (8.4%)
Optical, technical, medical apparatus: $83.6 billion (5.4%)
Plastics, plastic articles: $61.5 billion (4%)
Gems, precious metals: $60.4 billion (3.9%)
Pharmaceuticals: $45.1 billion (2.9%)
Organic chemicals: $36.2 billion (2.3%)

【问题讨论】:

    标签: python regex python-3.x beautifulsoup


    【解决方案1】:

    您可以使用select() 首先获取选项卡,然后使用li 子项和文本:

    # ... right under soup = BeautifulSoup (page.text, 'lxml') ...
    # select the first tab
    tab = soup.select('div.tabcontent')[0]
    
    # select its items
    items = [text 
        for item in tab.select('li') 
        for text in [item.text] 
        if "billion" in text]
    print(items)
    

    这会产生

    ['Machinery including computers: US$201.7 billion (13% of total exports)', 'Electrical machinery, equipment: $174.2 billion (11.3%)', 'Mineral fuels including oil: $138 billion (8.9%)', 'Aircraft, spacecraft: $131.2 billion (8.5%)', 'Vehicles: $130.1 billion (8.4%)', 'Optical, technical, medical apparatus: $83.6 billion (5.4%)', 'Plastics, plastic articles: $61.5 billion (4%)', 'Gems, precious metals: $60.4 billion (3.9%)', 'Pharmaceuticals: $45.1 billion (2.9%)', 'Organic chemicals: $36.2 billion (2.3%)']
    

    【讨论】:

      【解决方案2】:

      您的错误在于使用.*;点运算符通常不匹配换行符,table 字符串在开头和单词 billion 之间包含换行符。如果您要使用正则表达式,那么至少使用re.MULTILINE 标志在换行符后匹配^

      >>> re.findall(r'^.*billion', table, flags=re.MULTILINE)
      ['Machinery including computers: US$201.7 billion',
       'Electrical machinery, equipment: $174.2 billion',
       'Mineral fuels including oil: $138 billion',
       'Aircraft, spacecraft: $131.2 billion',
       'Vehicles: $130.1 billion',
       'Optical, technical, medical apparatus: $83.6 billion',
       'Plastics, plastic articles: $61.5 billion',
       'Gems, precious metals: $60.4 billion',
       'Pharmaceuticals: $45.1 billion',
       'Organic chemicals: $36.2 billion']
      

      但是,既然您想在 li 元素中查找文本,为什么不选择那些?

      soup.find(class_='tabcontent').find_all('li', string=re.compile(r'billion'))
      

      将正则表达式模式传递给string 可让您过滤元素的内容。这为您提供了匹配的元素:

      >>> soup.find(class_='tabcontent').find_all('li', string=re.compile(r'billion'))
      [<li>Machinery including computers: US$201.7 billion (13% of total exports)</li>,
       <li>Electrical machinery, equipment: $174.2 billion (11.3%)</li>,
       <li>Mineral fuels including oil: $138 billion (8.9%)</li>,
       <li>Aircraft, spacecraft: $131.2 billion (8.5%)</li>,
       <li>Vehicles: $130.1 billion (8.4%)</li>,
       <li>Optical, technical, medical apparatus: $83.6 billion (5.4%)</li>,
       <li>Plastics, plastic articles: $61.5 billion (4%)</li>,
       <li>Gems, precious metals: $60.4 billion (3.9%)</li>,
       <li>Pharmaceuticals: $45.1 billion (2.9%)</li>,
       <li>Organic chemicals: $36.2 billion (2.3%)</li>]
      

      如果您只想要它们的内容,您可以随时将.get_text() 应用于这些元素。

      【讨论】:

        【解决方案3】:

        另一种方法可能如下所示:

        import requests
        from bs4 import BeautifulSoup
        
        URL = 'http://www.worldstopexports.com/united-states-top-10-exports/'
        soup = BeautifulSoup(requests.get(URL,headers={"User-Agent":"Mozilla/5.0"}).text, 'lxml')
        table = soup.find(class_='tabcontent')
        data =  '\n'.join([item.text for item in table.find_all("li")])
        print(data)
        

        输出:

        Machinery including computers: US$201.7 billion (13% of total exports)
        Electrical machinery, equipment: $174.2 billion (11.3%)
        Mineral fuels including oil: $138 billion (8.9%)
        Aircraft, spacecraft: $131.2 billion (8.5%)
        Vehicles: $130.1 billion (8.4%)
        Optical, technical, medical apparatus: $83.6 billion (5.4%)
        Plastics, plastic articles: $61.5 billion (4%)
        Gems, precious metals: $60.4 billion (3.9%)
        Pharmaceuticals: $45.1 billion (2.9%)
        Organic chemicals: $36.2 billion (2.3%)
        

        【讨论】:

          猜你喜欢
          • 2012-11-27
          • 2015-09-04
          • 2020-05-31
          • 2019-08-28
          • 1970-01-01
          • 1970-01-01
          • 2017-02-01
          • 1970-01-01
          • 2018-01-14
          相关资源
          最近更新 更多