【发布时间】:2018-08-28 13:02:18
【问题描述】:
我正在尝试从 BS 处理的 html 页面中获取行,其中包含
“十亿”这个词。但是我得到的是空列表.....顺便说一句,这些行在<li>标签之间,我尝试使用soup.findAll("<li>", {"class": "tabcontent"})
但它也给了我一个空列表。
import requests
from bs4 import BeautifulSoup
import re
url = 'http://www.worldstopexports.com/united-states-top-10-exports/'
header = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
"X-Requested-With": "XMLHttpRequest"
}
page = requests.get (url, headers=header)
soup = BeautifulSoup (page.text, 'lxml')
table = soup.find_all (class_='tabcontent')[0].text
print(re.findall(r'^.*? billion', table))
print(table)
Machinery including computers: US$201.7 billion (13% of total exports)
Electrical machinery, equipment: $174.2 billion (11.3%)
Mineral fuels including oil: $138 billion (8.9%)
Aircraft, spacecraft: $131.2 billion (8.5%)
Vehicles: $130.1 billion (8.4%)
Optical, technical, medical apparatus: $83.6 billion (5.4%)
Plastics, plastic articles: $61.5 billion (4%)
Gems, precious metals: $60.4 billion (3.9%)
Pharmaceuticals: $45.1 billion (2.9%)
Organic chemicals: $36.2 billion (2.3%)
【问题讨论】:
标签: python regex python-3.x beautifulsoup