【问题标题】:PHP Remove URL from stringPHP 从字符串中删除 URL
【发布时间】:2009-07-11 14:38:06
【问题描述】:

如果我有一个包含 url 的字符串(例如,我们称之为 $url),例如;

$url = "Here is a funny site http://www.tunyurl.com/34934";

如何从字符串中删除 URL? 困难的是,url 也可能不带 http:// 就显示出来,比如 ;

$url = "Here is another funny site www.tinyurl.com/55555";

不存在 HTML。如果 http 或 www 存在,我将如何开始搜索,然后删除文本/数字/符号直到第一个空格?

【问题讨论】:

  • 我们是在谈论从字符串中提取 url 还是删除实际链接本身? $url = "这是另一个有趣的网站 www.tinyurl.com/55555"; (extracting) $url = "这是另一个有趣的网站 www.tinyurl.com/55555";和 $someVar = 'www.tinyurl.com/55555'; (删除) $url = "这里是另一个有趣的网站";

标签: php


【解决方案1】:

我重新阅读了这个问题,这是一个可以按预期工作的函数:

function cleaner($url) {
  $U = explode(' ',$url);

  $W =array();
  foreach ($U as $k => $u) {
    if (stristr($u,'http') || (count(explode('.',$u)) > 1)) {
      unset($U[$k]);
      return cleaner( implode(' ',$U));
    }
  }
  return implode(' ',$U);
}

$url = "Here is another funny site www.tinyurl.com/55555 and http://www.tinyurl.com/55555 and img.hostingsite.com/badpic.jpg";
echo "Cleaned: " . cleaner($url);

编辑#2/#3(我一定很无聊)。这是一个验证 URL 中是否存在 TLD 的版本:

function containsTLD($string) {
  preg_match(
    "/(AC($|\/)|\.AD($|\/)|\.AE($|\/)|\.AERO($|\/)|\.AF($|\/)|\.AG($|\/)|\.AI($|\/)|\.AL($|\/)|\.AM($|\/)|\.AN($|\/)|\.AO($|\/)|\.AQ($|\/)|\.AR($|\/)|\.ARPA($|\/)|\.AS($|\/)|\.ASIA($|\/)|\.AT($|\/)|\.AU($|\/)|\.AW($|\/)|\.AX($|\/)|\.AZ($|\/)|\.BA($|\/)|\.BB($|\/)|\.BD($|\/)|\.BE($|\/)|\.BF($|\/)|\.BG($|\/)|\.BH($|\/)|\.BI($|\/)|\.BIZ($|\/)|\.BJ($|\/)|\.BM($|\/)|\.BN($|\/)|\.BO($|\/)|\.BR($|\/)|\.BS($|\/)|\.BT($|\/)|\.BV($|\/)|\.BW($|\/)|\.BY($|\/)|\.BZ($|\/)|\.CA($|\/)|\.CAT($|\/)|\.CC($|\/)|\.CD($|\/)|\.CF($|\/)|\.CG($|\/)|\.CH($|\/)|\.CI($|\/)|\.CK($|\/)|\.CL($|\/)|\.CM($|\/)|\.CN($|\/)|\.CO($|\/)|\.COM($|\/)|\.COOP($|\/)|\.CR($|\/)|\.CU($|\/)|\.CV($|\/)|\.CX($|\/)|\.CY($|\/)|\.CZ($|\/)|\.DE($|\/)|\.DJ($|\/)|\.DK($|\/)|\.DM($|\/)|\.DO($|\/)|\.DZ($|\/)|\.EC($|\/)|\.EDU($|\/)|\.EE($|\/)|\.EG($|\/)|\.ER($|\/)|\.ES($|\/)|\.ET($|\/)|\.EU($|\/)|\.FI($|\/)|\.FJ($|\/)|\.FK($|\/)|\.FM($|\/)|\.FO($|\/)|\.FR($|\/)|\.GA($|\/)|\.GB($|\/)|\.GD($|\/)|\.GE($|\/)|\.GF($|\/)|\.GG($|\/)|\.GH($|\/)|\.GI($|\/)|\.GL($|\/)|\.GM($|\/)|\.GN($|\/)|\.GOV($|\/)|\.GP($|\/)|\.GQ($|\/)|\.GR($|\/)|\.GS($|\/)|\.GT($|\/)|\.GU($|\/)|\.GW($|\/)|\.GY($|\/)|\.HK($|\/)|\.HM($|\/)|\.HN($|\/)|\.HR($|\/)|\.HT($|\/)|\.HU($|\/)|\.ID($|\/)|\.IE($|\/)|\.IL($|\/)|\.IM($|\/)|\.IN($|\/)|\.INFO($|\/)|\.INT($|\/)|\.IO($|\/)|\.IQ($|\/)|\.IR($|\/)|\.IS($|\/)|\.IT($|\/)|\.JE($|\/)|\.JM($|\/)|\.JO($|\/)|\.JOBS($|\/)|\.JP($|\/)|\.KE($|\/)|\.KG($|\/)|\.KH($|\/)|\.KI($|\/)|\.KM($|\/)|\.KN($|\/)|\.KP($|\/)|\.KR($|\/)|\.KW($|\/)|\.KY($|\/)|\.KZ($|\/)|\.LA($|\/)|\.LB($|\/)|\.LC($|\/)|\.LI($|\/)|\.LK($|\/)|\.LR($|\/)|\.LS($|\/)|\.LT($|\/)|\.LU($|\/)|\.LV($|\/)|\.LY($|\/)|\.MA($|\/)|\.MC($|\/)|\.MD($|\/)|\.ME($|\/)|\.MG($|\/)|\.MH($|\/)|\.MIL($|\/)|\.MK($|\/)|\.ML($|\/)|\.MM($|\/)|\.MN($|\/)|\.MO($|\/)|\.MOBI($|\/)|\.MP($|\/)|\.MQ($|\/)|\.MR($|\/)|\.MS($|\/)|\.MT($|\/)|\.MU($|\/)|\.MUSEUM($|\/)|\.MV($|\/)|\.MW($|\/)|\.MX($|\/)|\.MY($|\/)|\.MZ($|\/)|\.NA($|\/)|\.NAME($|\/)|\.NC($|\/)|\.NE($|\/)|\.NET($|\/)|\.NF($|\/)|\.NG($|\/)|\.NI($|\/)|\.NL($|\/)|\.NO($|\/)|\.NP($|\/)|\.NR($|\/)|\.NU($|\/)|\.NZ($|\/)|\.OM($|\/)|\.ORG($|\/)|\.PA($|\/)|\.PE($|\/)|\.PF($|\/)|\.PG($|\/)|\.PH($|\/)|\.PK($|\/)|\.PL($|\/)|\.PM($|\/)|\.PN($|\/)|\.PR($|\/)|\.PRO($|\/)|\.PS($|\/)|\.PT($|\/)|\.PW($|\/)|\.PY($|\/)|\.QA($|\/)|\.RE($|\/)|\.RO($|\/)|\.RS($|\/)|\.RU($|\/)|\.RW($|\/)|\.SA($|\/)|\.SB($|\/)|\.SC($|\/)|\.SD($|\/)|\.SE($|\/)|\.SG($|\/)|\.SH($|\/)|\.SI($|\/)|\.SJ($|\/)|\.SK($|\/)|\.SL($|\/)|\.SM($|\/)|\.SN($|\/)|\.SO($|\/)|\.SR($|\/)|\.ST($|\/)|\.SU($|\/)|\.SV($|\/)|\.SY($|\/)|\.SZ($|\/)|\.TC($|\/)|\.TD($|\/)|\.TEL($|\/)|\.TF($|\/)|\.TG($|\/)|\.TH($|\/)|\.TJ($|\/)|\.TK($|\/)|\.TL($|\/)|\.TM($|\/)|\.TN($|\/)|\.TO($|\/)|\.TP($|\/)|\.TR($|\/)|\.TRAVEL($|\/)|\.TT($|\/)|\.TV($|\/)|\.TW($|\/)|\.TZ($|\/)|\.UA($|\/)|\.UG($|\/)|\.UK($|\/)|\.US($|\/)|\.UY($|\/)|\.UZ($|\/)|\.VA($|\/)|\.VC($|\/)|\.VE($|\/)|\.VG($|\/)|\.VI($|\/)|\.VN($|\/)|\.VU($|\/)|\.WF($|\/)|\.WS($|\/)|\.XN--0ZWM56D($|\/)|\.XN--11B5BS3A9AJ6G($|\/)|\.XN--80AKHBYKNJ4F($|\/)|\.XN--9T4B11YI5A($|\/)|\.XN--DEBA0AD($|\/)|\.XN--G6W251D($|\/)|\.XN--HGBK6AJ7F53BBA($|\/)|\.XN--HLCJ6AYA9ESC7A($|\/)|\.XN--JXALPDLP($|\/)|\.XN--KGBECHTV($|\/)|\.XN--ZCKZAH($|\/)|\.YE($|\/)|\.YT($|\/)|\.YU($|\/)|\.ZA($|\/)|\.ZM($|\/)|\.ZW)/i",
    $string,
    $M);
  $has_tld = (count($M) > 0) ? true : false;
  return $has_tld;
}

function cleaner($url) {
  $U = explode(' ',$url);

  $W =array();
  foreach ($U as $k => $u) {
    if (stristr($u,".")) { //only preg_match if there is a dot    
      if (containsTLD($u) === true) {
      unset($U[$k]);
      return cleaner( implode(' ',$U));
    }      
    }
  }
  return implode(' ',$U);
}


$url = "Here is another funny site badurl.badone somesite.ca/worse.jpg but this badsite.com www.tinyurl.com/55555 and http://www.tinyurl.com/55555 and img.hostingsite.com/badpic.jpg";
echo "Cleaned: " . cleaner($url);

返回:

Cleaned: Here is another funny site badurl.badone but this and and

【讨论】:

  • 感谢您抽出宝贵时间对此进行扩展。
  • “this and and”应该是蓝色的吗?我不想编辑实际输出 ;-)
  • 这不会捕获与引号等其他字符相邻的 URL。所以<a href="http://www.google.com"> 不会被正确过滤以删除 URL。您可以使用 strip_tags,但如果这不是您想要的,则需要对其进行调整。
  • 由于某种原因,如果网址前有换行符,则会中断。例如:“Some text and some more i.imgur.com/aaa.png”可以正常工作,但如果在单词“more”(而不是空格)之后有一个 \n,则结果是“Some text and some”。有什么建议么?谢谢!
  • 我替换了 $U = explode(' ',$url); $U = preg_split('/\s+/', $url);这似乎可以解决问题。
【解决方案2】:
$string = preg_replace('/\b(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|$!:,.;]*[A-Z0-9+&@#\/%=~_|$]/i', '', $string);  

【讨论】:

    【解决方案3】:

    解析 URL 的文本很困难,寻找已经为您执行此操作的预先存在的、经过大量测试的代码比编写自己的代码和缺少边缘情况要好。例如,我会看一下 Django 的 urlize 中的过程,它将 URL 包装在锚点中。您可以将其移植到 PHP 中,并且——而不是将 URL 包装在锚中——只需从文本中删除它们。

    【讨论】:

      【解决方案4】:

      谢谢迈克,

      更新一下,返回通知错误,

      '/\b(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|$!:,.;]*[A-Z0-9+&@#\/%=~_|$]/i'

      $string = preg_replace('/\b(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|$!:,.;]*[A-Z0-9+&@#\/%=~_|$]/i', '', $string);

      【讨论】:

        【解决方案5】:
        $url = "Here is a funny site http://www.tunyurl.com/34934";
        $replace = 'http www .com .org .net';
        $with = '';
        
        $clean_url = clean($url,$replace,$with);
        echo $clean_url;
        
        function clean($url,$replace,$with) {   
        
          $replace = explode(" ",$replace);
          $new_string = '';
          $check = explode(" ",$url);
        
          foreach($check AS $key => $value) {
             foreach($replace AS $key2 => $value2 ) {
                if (-1 < strpos( strtolower($value), strtolower($value2) )  ) {
                    $value = $with;
                    break;
                }
             }
            $new_string .= " ".$value;
          }
         return $new_string;
        }
        

        【讨论】:

        • 你能用你的代码解释一下吗?它可能对 OP 或未来的用户有更多帮助。
        【解决方案6】:

        您需要编写一个正则表达式来提取网址。

        【讨论】:

          猜你喜欢
          • 2014-08-26
          • 2012-07-18
          • 2011-09-14
          • 1970-01-01
          • 1970-01-01
          • 2014-03-02
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多