从具有大量空白的源中提取数据答案

【问题标题】：Data extraction from source with lots of white space从具有大量空白的源中提取数据
【发布时间】：2011-02-27 07:06:15
【问题描述】：

我正在尝试从 http://www.phillysheriff.com/old_site/properties.html 中提取数据

理想情况下，我可以获取包含地址、病房、价格和平方英尺的 CSV 文件？有没有简单的方法可以做到这一点？

【问题讨论】：

标签： csv screen-scraping text-processing

【解决方案1】：

从网页中提取此类信息的过程俗称“抓取”。如果是我，我会使用 python 语言和"Beautiful Soup" 包来做。但是，“screen scrape”或“web scrape”的 google 和您最喜欢的编程语言应该会为您找到一个可以为您完成艰苦工作的包。

【讨论】：

【解决方案2】：

您可以运行 IRobotSoft 网络爬虫，在其浏览器窗口中打开页面，然后使用菜单：设计 -> 实践 HTQL。在输入框中输入以下 HTQL 查询，将页面转换为标准 HTML 表格：

<hr sep>2-0{
a=<center>1 &tx &trim;
b=<center>1:xx ./'nbsp'/1 &tx &trim('&; ');
c=<center>1:xx ./'nbsp'/3 ./'\n'/1 &tx &trim('&; ');
d=<center>1:xx ./'nbsp'/3 ./'Ward'~'BRT#'/1 &tx;
e=<center>1:xx ./'nbsp'/3 ./'BRT#'~'Improvements:'/1 &tx;
f=<center>1:xx ./'nbsp'/3 ./'Improvements:'/2 &tx;
g=<br sep>2. /'nbsp'/1 &tx &trim('&; ');
h=<br sep>2. /'nbsp'/3 &tx &trim('&; '); 
i=<br sep>2. /'nbsp'/5 &tx &trim('&; ');
j=<br sep>2. /'nbsp'/7 &tx &trim('&; ');
}

【讨论】：