如何使用 preg match all 在 <a 标签中获取值？答案

【问题标题】：How to get value inside <a tag using preg match all?如何使用 preg match all 在 <a 标签中获取值？
【发布时间】：2013-04-17 09:09:23
【问题描述】：

我得到了需要使用 preg match all 提取超链接标签内的值的 html 内容。我尝试了以下但我没有得到任何数据。我包含了一个示例输入数据。你们能帮我修复这段代码并打印play.asp前面的所有值吗？ID=（例如：我想从play.asp获取这个值12345？ID=12345) ?

示例输入 html 数据：

<A HREF="http://www.somesite.com/play.asp?ID=12345&Selected_ID=&PhaseID=123" class="space"><span id="Img_1"></span></A></TD>

和代码

$regexp = "<A\s[^>]*HREF=\"play.asp(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/A>";

if(preg_match_all("/$regexp/siU", $input, $matches)) 
{ 


$url=str_replace('?ID=', '', $matches[2]); 

$url2=str_replace('&Selected_ID=&PhaseID=123', '', $url);

print_r($url2);
}

【问题讨论】：

你能发布一个完整的例子来说明你需要什么（输入和结果）吗？
感谢您的回复。 Jbrtrnd 我想从 ==>play.asp?ID=12345 获得 12345。注意：在我的实际输入中，我有很多组超链接，所以我想获取 play.asp?ID=????? 前面的所有值

标签： php regex parsing preg-match-all

【解决方案1】：

$str = '<A HREF="http://www.somesite.com/play.asp?ID=12345&Selected_ID=&PhaseID=123" class="space"><span id="Img_1"></span></A>';

preg_match_all( '/<\s*A[^>]HREF="(.*?)"\s?(.*?)>/i', $str, $match);
print_r( $match );

试试这个。

【讨论】：

感谢它的工作原理，但在 html 中提供了所有 href 值。我只对那些类名为 "space" 的 href 值感兴趣。我该如何过滤？
$x = preg_match_all( '/<\s*A[^>]HREF="(.*?)"\s?(.*?)class="space"(.*?)\s*>/i', $str, $match); var_dump( $x ); print_r( $match)
如果class="space"存在则$x返回1，否则返回0。

【解决方案2】：

不要！正则表达式是一种（不好的）文本处理方式。这不是文本，而是 HTML 源代码。处理它的工具称为 HTML 解析器。虽然 PHP 的 DOMDocument 也能够加载 HTML，但在极少数情况下它可能会出现故障。一个糟糕的正则表达式（你错误地认为还有其他的）几乎会在页面的任何更改上出现故障。

【讨论】：

【解决方案3】：

这还不够吗？

/<a href="(.*?)?"/I

编辑：

这似乎有效：

'/<a href="(.*?)\?/i'

【讨论】：

我用你给我的例子替换了我的正则表达式，但没有输出任何结果！ $regexp = '/
你不需要 str_replace。试试 '/]*href=\"(.+)\?/i'。

【解决方案4】：

这应该会达到预期的效果。它是 HTML 解析器和内容提取功能的组合：

function extractContents($string, $start, $end)
{
    $pos = stripos($string, $start);
    $str = substr($string, $pos);
    $str_two = substr($str, strlen($start));
    $second_pos = stripos($str_two, $end);
    $str_three = substr($str_two, 0, $second_pos);
    $extractedContents = trim($str_three);
    return $extractedContents;
}

include('simple_html_dom.php');
$html = file_get_html('http://siteyouwantlinksfrom.com');
$links = $html->find('a');
foreach($links as $link)
{
    $playIDs[] = extractContents($link->href, 'play.asp?ID=', '&');
}

print_r($playIDs);

你可以从here下载simple_html_dom.php

【讨论】：

我正在使用 curl 获取内容，但我无法让您的示例正常工作！我将 simple_html_dom.php 放在同一个文件夹中，现在使用 curl 的文本区域中也没有数据！

【解决方案5】：

您不应该使用正则表达式来解析 HTML。
这是 DOMDocument 的解决方案：

<?php
    $input = '<A HREF="http://www.somesite.com/play.asp?ID=12345&Selected_ID=&PhaseID=123" class="space"><span id="Img_1"></span></A>';
    // Clean "&" element in href
    $cleanInput = str_replace('&','&amp;',$input);
    // Load HTML

    $domDocument = new DOMDocument();
    $domDocument->loadHTML($cleanInput);

    // Retrieve <a /> tags
    $aTags = $domDocument->getElementsByTagName('a');
    foreach($aTags as $aTag)
    {   

        $href = $aTagA->getAttribute('href');
        $url  =  parse_url($href);
        $vars = array();
        parse_str($url['query'], $vars);

        var_dump($vars);
    }
?>

输出：

array (size=3)
  'ID' => string '12345' (length=5)
  'Selected_ID' => string '' (length=0)
  'PhaseID' => string '123' (length=3)

【讨论】：