使用 xpath 从 html 中提取文本答案

【问题标题】：Extract text from html with xpath使用 xpath 从 html 中提取文本
【发布时间】：2015-07-12 07:38:20
【问题描述】：

我想像这样从 html 中提取文本-

<div id="sn1058961" class="soundTrack soda odd">Boom Shack-a-Lak<br />
Written by <a href="/name/nm0031896?ref_=ttsnd_snd_1">Apache Indian</a> (as  Stephen Kapur) and Ervin Barrington Woolley<br />
Performed by <a href="/name/nm0031896?ref_=ttsnd_snd_1">Apache Indian</a><br   />
Courtesy of Island Records Ltd.<br />
Under license from Universal Music Enterprises<br />

如下形式。

如果我使用下面的 xpath

//*[@id="soundtracks_content"]/div[2]/div[1]/node()[count(preceding-sibling::br)=1][normalize-space()]

然后它必须提取一段文本“由 Apache Indian (as Stephen Kapur) 和 Ervin Barrington Woolley 编写”，但上面的命令正在提取三个文本元素“Written by”、“Apache Indian”和“(as Stephen卡普尔）和欧文·巴林顿·伍利”。您能否建议另一个从上述 html 中提取单个文本的 xpath。我一直在url上练习我的xpath：“http://www.imdb.com/title/tt2096672/soundtrack?ref_=tt_ql_trv_7”

我正在使用 import.io 通过 xpath 抓取数据，但我不允许输入我刚刚输入的整个 xpath

node()[count(preceding-sibling::br)=1][normalize-space()]

我已经粘贴了我实际在做什么的图片 - 请注意我还需要锚文本

【问题讨论】：

标签： xpath imdb

【解决方案1】：

使用 xpath 2.0

string-join(//*[@id="soundtracks_content"]/div[2]/div[1]//text()[count(preceding-sibling::br)=1][normalize-space()], "")

【讨论】：