【发布时间】:2016-03-19 05:41:31
【问题描述】:
我有这个 html:
<div id="content">
<h1>Title 1</h1><br><br>
<h2>Sub-Title 1</h2>
<br><br>
Description 1.<br><br>Description 2.
<br><br>
<h2>Sub-Title 2</h2>
<br><br>
Description 1<br>Description 2<br>
<br><br>
<div class="infobox">
<font style="color:#000000"><b>Information Title</b></font>
<br><br>Long Information Text
</div>
</div>
我想在Scrapy中获取<div id="content">中的所有html但不包括<div class="infobox">的块,所以预期的结果是这样的:
<div id="content">
<h1>Title 1</h1><br><br>
<h2>Sub-Title 1</h2>
<br><br>
Description 1.<br><br>Description 2.
<br><br>
<h2>Sub-Title 2</h2>
<br><br>
Description 1<br>Description 2<br>
<br><br>
</div>
如何修改我当前的选择器:
item['article_html'] = hxs.select("//div[@id='content']").extract()[0]
【问题讨论】:
标签: python html xpath web-scraping scrapy