【问题标题】:excluding URLs from path links?从路径链接中排除 URL?
【发布时间】:2011-10-26 19:49:15
【问题描述】:

在下面的函数中,我想指定要从结果中排除的域列表。有哪些选择?要排除的数组集合?

class KeywordSearch
{       
    const GOOGLE_SEARCH_XPATH = "//a[@class='l']";
    public $searchQuery;
    public $numResults ;
    public $sites;
    public $finalPlainText = '';
    public $finalWordList = array();
    public $finalKeywordList = array();

    function __construct($query,$numres=7){
        $this->searchQuery = $query;
        $this->numResults = $numres;
        $this->sites = array();
    }

    protected static $_excludeUrls  = array('wikipedia.com','amazon.com','youtube.com','zappos.com');//JSB NEW

    private function getResults($searchHtml){

        $results = array();
        $dom = new DOMDocument();
        $dom->preserveWhiteSpace = false;
        $dom->formatOutput = false;
        @$dom->loadHTML($searchHtml);
        $xpath = new DOMXpath($dom);
        $links = $xpath->query(self::GOOGLE_SEARCH_XPATH);

        foreach($links as $link)
        {
            $results[] = $link->getAttribute('href');           
        }

        $results = array_filter($results,'self::kwFilter');//JSB NEW
        return $results;
    }

    protected static function kwFilter($value)
    {
        return !in_array($value,self::$_excludeUrls);
    }   

【问题讨论】:

    标签: php xpath domdocument


    【解决方案1】:
    protected static $_banUrls  = array('foo.com','bar.com');
    
    private function getResults($searchHtml){
    
            $results = array();
    
            $dom = new DOMDocument();
    
            $dom->preserveWhiteSpace = false;
    
            $dom->formatOutput = false;
    
            @$dom->loadHTML($searchHtml);
    
            $xpath = new DOMXpath($dom);
    
            $links = $xpath->query(self::GOOGLE_SEARCH_XPATH);
    
    
            foreach($links as $link)
            {
            //FILTER OUT SPECIFIC LINKS HERE
                $results[] = $link->getAttribute('href');
    
            }
            $results = array_filter($results,'self::myFilter');
    
            return $results;
    
        }
    
        protected static function myFilter($value)
        {
                return !in_array($value,self::$_banUrls);
        }
    

    【讨论】:

    • @Scott B 很高兴它对你有所帮助,如果你觉得它有用,请接受它。
    • 我收到一个错误>“第二个参数,'myFilter',应该是一个有效的回调”
    • 如果有影响的话,这一切都在一个类中。
    • 还是不开心。我已将修改后的代码粘贴到原始问题中。错误 > “致命错误:无法调用方法 self::kwFilter() 或方法不存在”
    • @Scott B wats kwFilter() ??我认为您忘记将函数名称更改为 myFilter() 。
    【解决方案2】:

    既然您标记了这个 XPath,下面是如何使用 XPath contain function: 进行标记

    $html = <<< HTML
    <ul>
        <li><a href="http://foo.example.com">
        <li><a href="http://bar.example.com">
        <li><a href="http://baz.example.com">
    </ul>
    HTML;
    
    $dom = new DOMDocument;
    $dom->loadHtml($html);
    $xp = new DOMXPath($dom);
    $query = '//a/@href[
        not(contains(., "foo.example.com")) and
        not(contains(., "bar.example.com"))
    ]';
    foreach ($xp->query($query) as $hrefAttr) {
        echo $hrefAttr->nodeValue;
    }
    

    这将输出:

    http://baz.example.com
    

    查看Xpath 1.0. specification for other possible string functions 以测试节点集。

    【讨论】:

      猜你喜欢
      • 2015-12-29
      • 1970-01-01
      • 1970-01-01
      • 2011-07-04
      • 2015-01-26
      • 2017-06-08
      • 1970-01-01
      • 2018-07-18
      • 2011-03-16
      相关资源
      最近更新 更多