网络爬虫在列表之间提取答案

【问题标题】：Web crawler to extract in between the list网络爬虫在列表之间提取
【发布时间】：2015-03-05 13:10:04
【问题描述】：

我正在用 python 编写一个网络爬虫。我希望得到<li> </li>标签之间的所有内容。例如：

<li>January 13, 1991: At least 40 people <a href ="......."> </a> </li>

所以我想在这里：

a.)提取日期-并将其转换为 dd/mm/yyyy 格式

b.)人前面的数字。

soup = BeautifulSoup(page1)
h2 =soup.find_all("li")
count = 0
while count < len(h2):
    print (str(h2[count].get_text().encode('ascii', 'ignore')))
    count += 1

我现在只能提取文本。

【问题讨论】：

标签： python parsing web-scraping beautifulsoup html-parsing

【解决方案1】：

获取带有.text、split the string的文本通过:的第一次出现，使用strptime()指定现有的%B %d, %Y格式将日期字符串转换为datetime，然后使用strftime() 将其格式化为字符串，指定所需的%d/%m/%Y 格式并使用At least (\d+) 正则表达式提取数字，其中(\d+) 是匹配一个或多个数字的capturing group：

from datetime import datetime
import re

from bs4 import BeautifulSoup


data = '<li>January 13, 1991: At least 40 people <a href ="......."> </a> </li>'
soup = BeautifulSoup(data)

date_string, rest = soup.li.text.split(':', 1)

print datetime.strptime(date_string, '%B %d, %Y').strftime('%d/%m/%Y')
print re.match(r'At least (\d+)', rest.strip()).group(1)

打印：

13/01/1991
40

【讨论】：

如果在数据对象中你已经定义了完整的 html 代码，而不仅仅是
标签。

@AbhishekBhatia 那么您需要使用find()、find_all() 或select() 或BeautifulSoup 提供的其他方法来定位元素 - 如果您难以找到所需的元素。谢谢。

请检查这个问题。 stackoverflow.com/questions/27884490/…