【问题标题】:how to not select somedata using xpath in webharvest如何在 webharvest 中不使用 xpath 选择某些数据
【发布时间】:2013-09-19 08:51:37
【问题描述】:

我正在使用 webharvest 和 xquery 从网站获取数据。

我有 2 个带有以下数据的 xquery 变量

$text:

<p> <strong>Psoria-Shield Inc.</strong> (<a href="http://www.psoria-shield.com/"></a><a href="/Tracker?data=gB90UgQvS9bs99znBBkklh-mudx4NTcPFIy_wiP7zUJ-qBXYABNid0GYgW4g7qVsjn3_dv2FPGzaYgKnhq_Ujg%3D%3D" target="_top">www.psoria-shield.com</a>) is a Tampa FL based company specializing in design, manufacturing, and distribution of medical devices to domestic and international
                  markets. PSI employs full-time engineering, production, sales staff, and manufactures within an ISO 13485 certified quality
                  system. PSI's flagship product, Psoria-Light&#174;, is FDA-cleared and CE marked and delivers targeted UV phototherapy for
                  the treatment of certain skin disorders. Psoria-Shield Inc., was acquired by Wellness Center USA Inc. ("WCUI") in August 2012,
                  and is now a wholly-owned subsidiary.
               </p> 
               <p> <strong>AminoFactory</strong> (<a href="http://www.aminofactory.com/"></a><a href="/Tracker?data=O0xbFRJiVuWDzRDq7SVwVR9xAPYLIGQyBw4mDziUrH4KB3DIYUasiO_O78eteJsv2doAGtg4kRhAqmnvkQ-9LA%3D%3D" target="_top">www.aminofactory.com</a>), a division of Wellness Center USA, Inc., is an online supplement store that markets and sells a wide range of high-quality
                  nutritional vitamins and supplements. By utilizing AminoFactory's online catalog, bodybuilders, athletes, and health conscious
                  consumers can choose and purchase the highest quality nutritional products from a wide array of offerings in just a few clicks.
               </p> 
                <pre>At Wellness Center Usa, Inc.
Tel: (847) 925-1885 <a href="/Tracker?data=rhuzXSqaPgDJ--ByIIMSm7wrtVUZmqiD7wl78d4gUHajkKceardtmAscrHABzvo360XXBJCWn_Rb_s-yPMVXTw_XJrSieD88bIXbE9snPn4%3D" target="_top">www.wellnescenterusa.com</a> Investor Relations Contact:
Arthur Douglas &amp; Associates, Inc.
Arthur Batson
Phone: 407-478-1120 <a href="/Tracker?data=9uKwR5tr9QwjFw830lvFTIWgz-s_eHaywZHwDl3el2RfYe5VuQZd_8sJU4J7HoFgOdyCn8br77RK60SIqLZkCy468cEKHpGUgE-nanwYfHo%3D" target="_top">www.arthurdouglasinc.com</a></pre> </span><span class="dt-green">

$contact:

At Wellness Center Usa, Inc.
Tel: (847) 925-1885 <a href="/Tracker?data=rhuzXSqaPgDJ--ByIIMSm7wrtVUZmqiD7wl78d4gUHajkKceardtmAscrHABzvo360XXBJCWn_Rb_s-yPMVXTw_XJrSieD88bIXbE9snPn4%3D" target="_top">www.wellnescenterusa.com</a> Investor Relations Contact:
Arthur Douglas &amp; Associates, Inc.
Arthur Batson
Phone: 407-478-1120 <a href="/Tracker?data=9uKwR5tr9QwjFw830lvFTIWgz-s_eHaywZHwDl3el2RfYe5VuQZd_8sJU4J7HoFgOdyCn8br77RK60SIqLZkCy468cEKHpGUgE-nanwYfHo%3D" target="_top">www.arthurdouglasinc.com</a>

(以上文字只是一个例子。)

我想要的是从$text 中删除$contact 的内容到目前为止我已经想出了以下代码:

{
    for $x in $text
        return if(matches($contact, '')) then $x
            else if(matches($contact, $x)) then  '' else $x 
}

它不工作。我不知道我哪里出错了。请让我知道这样做的正确方法。

【问题讨论】:

    标签: javascript html web-scraping xquery webharvest


    【解决方案1】:

    不要使用matches(...) 进行精确的字符串比较,它是为正则表达式制作的,您需要转义一堆特殊字符。

    如果 HTML 子树完全相同,请使用:

    $text[not(deep-equal(., <pre>{ $contact }</pre>))]
    

    如果你只想比较它的内容,使用data(...):

    $text[not(data(.) = string-join(data($contact)))]
    

    但是鉴于您发布的数据,您只需删除所有 &lt;pre/&gt; 节点就可以了:

    $text[local-name() != 'pre']
    

    【讨论】:

    • return $text[not(data(.) = data($contact))] 以文本形式返回所有包含 $contact 内容的内容?
    • 其实它返回了所有不等于$contact中的节点的节点,但最后,你是对的。
    • $contact 实际上只有数据。没有前置标签。我已经编辑了我的答案。我正在使用“$text[not(data(.) = data($contact))]”,它不会删除联系信息。
    • data($contact) 现在返回多个值,你需要string-join(...) 他们。我更新了我的答案,还有deep-equal 版本。最后一种选择保持不变。
    • 鉴于您问题中的信息,所有这些查询都应该有效;问题必须在该代码之外。如果您不发布更多信息(尤其是代码),我们将无法为您提供帮助。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2011-04-19
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多