【问题标题】:Regex to allow only set of HTML Tags and Attributes正则表达式只允许一组 HTML 标签和属性
【发布时间】:2012-03-18 15:28:57
【问题描述】:

如何使用常规正则表达式只允许特定的 HTML 标记集和特定的属性集?

允许的 HTML 标签:

p|body|b|u|em|strong|ul|ol|li|h1|h2|h3|h4|h5|h6|hr|a|br|img|tr|td|table|tbody|label |div|sup|sub|标题

允许的 HTML 属性:

alt|href|tcmuri|title|height|width|align|valign|rowspan|colspan|src|summary|class|id|name|title|target|nowrap|scope|axis|cellpadding|cellspacing|dir|lang |相对

为了测试这个正则表达式,我使用 RegExr 网站。

下面的正则表达式用于定位属性:

((alt|href|tcmuri|title|height|width|align|valign|rowspan|colspan|src|summary|class|id|name|title|target|nowrap|scope|axis|cellpadding|cellspacing|dir|lang|rel)\s*=\s*["|']?[/.?=&#\w\s:;-]+["|']?)

下面的正则表达式用于定位 HTML 标签:

<(?>/?)(?:[^p|body|b|u|em|strong|ul|ol|li|h1|h2|h3|h4|h5|h6|hr|a|br|img|tr|td|table|tbody|label|div|sup|sub|caption|P]|[p|cufontext|cufoncanvas|P][^\s>/])[^>]*>

我试图合并两个这样的东西,但它没有正确过滤:-

<(?>/?)(?:[^p|body|b|u|em|strong|ul|ol|li|h1|h2|h3|h4|h5|h6|hr|a|br|img|tr|td|table|tbody|label|div|sup|sub|caption|P]|[p|cufontext|cufoncanvas|P]|((alt|href|tcmuri|title|height|width|align|valign|rowspan|colspan|src|summary|class|id|name|title|target|nowrap|scope|axis|cellpadding|cellspacing|dir|lang|rel)\s*=\s*["|']?[/.?=&#\w\s:;-]+["|']?)[^\s>/])[^>]*>

我的意图是只允许这组属性和 HTML 标记。

应删除其余的标签和属性,并保留内容。

示例:

输入 HTML:

<h2 class="callout" cufid="2"><cufon style="width: 88px; height: 18px" class="cufon cufon-vml" alt="Lorem  "><cufoncanvas style="height: 29px; top: -5px; left: -2px"><cvml:shape style="width: 107px; height: 29px" path=" m39,-257 l75,-257,75,0,39,0,39,-257 x e m-41,-394 l2097,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-41,-394" coordsize="2138,577"></cvml:shape><cvml:shape style="width: 107px; height: 29px" path=" m61,-157 c67,-174,93,-193,115,-192,142,-192,167,-170,166,-142 l166,0,131,0,131,-137 c134,-180,68,-170,61,-142 l61,0,27,0,26,-189,61,-189,61,-157 x e m-144,-394 l1994,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-144,-394" coordsize="2138,577"></cvml:shape><cvml:shape style="width: 107px; height: 29px" path=" m68,-172 l68,0,33,0,33,-172,3,-172 c32,-188,54,-208,68,-232 l68,-189,108,-189,108,-168 c100,-173,82,-172,68,-172 x e m-326,-394 l1812,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-326,-394" coordsize="2138,577"></cvml:shape><cvml:shape style="width: 107px; height: 29px" path=" m63,-154 c69,-182,103,-204,127,-183 l127,-152 c107,-185,64,-157,63,-128 l63,0,29,0,29,-189,63,-189,63,-154 x e m-427,-394 l1711,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-427,-394" coordsize="2138,577"></cvml:shape><cvml:shape style="width: 107px; height: 29px" path=" m11,-94 c11,-144,47,-192,94,-192,146,-192,177,-149,177,-94,177,-44,141,2,94,2,41,1,11,-39,11,-94 x m93,-178 c29,-172,34,-21,93,-14,155,-20,157,-172,93,-178 x e m-548,-394 l1590,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-548,-394" coordsize="2138,577"></cvml:shape><cvml:shape style="width: 107px; height: 29px" path=" m15,-86 c15,-142,46,-192,95,-192,114,-192,127,-185,133,-174 l133,-257,168,-257,168,-34 c154,-10,130,2,95,2,52,2,15,-42,15,-86 x m134,-153 c128,-167,117,-177,98,-178,68,-178,54,-147,54,-86,54,-24,94,2,133,-24 x e m-728,-394 l1410,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-728,-394" coordsize="2138,577"></cvml:shape><cvml:shape style="width: 107px; height: 29px" path=" m62,-51 c61,-9,125,-16,132,-47 l132,-189,166,-189,166,0,132,0,132,-28 c125,-10,106,1,81,2,-2,5,36,-116,28,-189 l62,-189,62,-51 x e m-909,-394 l1229,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-909,-394" coordsize="2138,577"></cvml:shape><cvml:shape style="width: 107px; height: 29px" path=" m150,-6 c86,20,16,-20,16,-89,16,-163,78,-215,150,-182 l150,-158 c112,-211,48,-154,55,-94,49,-36,110,12,149,-31 x e m-1093,-394 l1045,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-1093,-394" coordsize="2138,577"></cvml:shape><cvml:shape style="width: 107px; height: 29px" path=" m67,0 l32,0,32,-190,67,-190,67,0 x m69,-241 c69,-229,60,-221,49,-221,38,-221,29,-230,29,-241,29,-252,38,-261,49,-261,60,-261,69,-253,69,-241 x e m-1248,-394 l890,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-1248,-394" coordsize="2138,577"></cvml:shape><cvml:shape style="width: 107px; height: 29px" path=" m61,-157 c67,-174,93,-193,115,-192,142,-192,167,-170,166,-142 l166,0,131,0,131,-137 c134,-180,68,-170,61,-142 l61,0,27,0,26,-189,61,-189,61,-157 x e m-1336,-394 l802,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-1336,-394" coordsize="2138,577"></cvml:shape><cvml:shape style="width: 107px; height: 29px" path=" m15,-88 c6,-164,80,-221,134,-175 l134,-189,168,-189 c159,-82,207,86,87,80,64,79,45,75,31,66 l31,39 c68,87,150,59,134,-18,94,37,6,-25,15,-88 x m96,-178 c35,-178,36,-9,106,-14,119,-15,128,-21,133,-31 l133,-156 c121,-171,108,-178,96,-178 x e m-1518,-394 l620,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-1518,-394" coordsize="2138,577"></cvml:shape><cvml:shape style="width: 107px; height: 29px" path=" m-1693,-394 l445,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-1693,-394" coordsize="2138,577"></cvml:shape></cufoncanvas><cufontext>Lorem </cufontext></cufon><cufon style="width: 45px; height: 18px" class="cufon cufon-vml" alt="Lorem "><cufoncanvas style="height: 29px; top: -5px; left: -2px"><cvml:shape style="width: 65px; height: 29px" path=" m60,-58 c71,-14,125,3,152,-36 l151,-13 c94,27,15,-19,15,-91,15,-143,44,-192,94,-192,124,-192,152,-174,152,-147,152,-99,82,-86,60,-58 x m120,-149 c121,-167,109,-179,94,-179,62,-179,47,-115,55,-75,78,-95,120,-108,120,-149 x e m-41,-394 l1256,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-41,-394" coordsize="1297,577"></cvml:shape><cvml:shape style="width: 65px; height: 29px" path=" m44,-175 c74,-204,147,-194,147,-143 l147,0,112,0,112,-23 c96,12,7,13,14,-45,18,-86,44,-97,87,-114,126,-130,124,-173,84,-175,66,-175,53,-166,44,-149 l44,-175 x m112,-116 c94,-97,41,-84,47,-43,52,-8,96,-9,112,-31 l112,-116 x e m-201,-394 l1096,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-201,-394" coordsize="1297,577"></cvml:shape><cvml:shape style="width: 65px; height: 29px" path=" m13,-36 c28,-4,92,-11,88,-53,84,-91,9,-100,12,-147,15,-193,77,-204,110,-176 l110,-152 c100,-176,50,-189,45,-156,50,-111,121,-110,121,-59,121,-6,50,20,13,-12 l13,-36 x e m-360,-394 l937,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-360,-394" coordsize="1297,577"></cvml:shape><cvml:shape style="width: 65px; height: 29px" path=" m67,0 l32,0,32,-190,67,-190,67,0 x m69,-241 c69,-229,60,-221,49,-221,38,-221,29,-230,29,-241,29,-252,38,-261,49,-261,60,-261,69,-253,69,-241 x e m-483,-394 l814,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-483,-394" coordsize="1297,577"></cvml:shape><cvml:shape style="width: 65px; height: 29px" path=" m60,-58 c71,-14,125,3,152,-36 l151,-13 c94,27,15,-19,15,-91,15,-143,44,-192,94,-192,124,-192,152,-174,152,-147,152,-99,82,-86,60,-58 x m120,-149 c121,-167,109,-179,94,-179,62,-179,47,-115,55,-75,78,-95,120,-108,120,-149 x e m-571,-394 l726,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-571,-394" coordsize="1297,577"></cvml:shape><cvml:shape style="width: 65px; height: 29px" path=" m63,-154 c69,-182,103,-204,127,-183 l127,-152 c107,-185,64,-157,63,-128 l63,0,29,0,29,-189,63,-189,63,-154 x e m-731,-394 l566,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-731,-394" coordsize="1297,577"></cvml:shape><cvml:shape style="width: 65px; height: 29px" path=" m-852,-394 l445,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-852,-394" coordsize="1297,577"></cvml:shape></cufoncanvas><cufontext>Lorem </cufontext></cufon><cufon style="width: 45px; height: 18px" class="cufon cufon-vml" alt="Lorem "><cufoncanvas style="height: 29px; top: -5px; left: -2px"><cvml:shape style="width: 65px; height: 29px" path=" m85,-32 c85,-74,123,-134,113,-189 l148,-189,187,-33 c191,-84,214,-142,226,-189 l254,-189 c238,-128,211,-64,202,0 l160,0,131,-118 c121,-77,104,-37,100,0 l59,0,7,-189,42,-189 x e m-41,-394 l1251,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-41,-394" coordsize="1292,577"></cvml:shape><cvml:shape style="width: 65px; height: 29px" path=" m11,-94 c11,-144,47,-192,94,-192,146,-192,177,-149,177,-94,177,-44,141,2,94,2,41,1,11,-39,11,-94 x m93,-178 c29,-172,34,-21,93,-14,155,-20,157,-172,93,-178 x e m-292,-394 l1000,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-292,-394" coordsize="1292,577"></cvml:shape><cvml:shape style="width: 65px; height: 29px" path=" m63,-154 c69,-182,103,-204,127,-183 l127,-152 c107,-185,64,-157,63,-128 l63,0,29,0,29,-189,63,-189,63,-154 x e m-472,-394 l820,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-472,-394" coordsize="1292,577"></cvml:shape><cvml:shape style="width: 65px; height: 29px" path=" m60,0 l27,0,27,-257,60,-257,60,0 x e m-593,-394 l699,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-593,-394" coordsize="1292,577"></cvml:shape><cvml:shape style="width: 65px; height: 29px" path=" m15,-86 c15,-142,46,-192,95,-192,114,-192,127,-185,133,-174 l133,-257,168,-257,168,-34 c154,-10,130,2,95,2,52,2,15,-42,15,-86 x m134,-153 c128,-167,117,-177,98,-178,68,-178,54,-147,54,-86,54,-24,94,2,133,-24 x e m-666,-394 l626,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-666,-394" coordsize="1292,577"></cvml:shape><cvml:shape style="width: 65px; height: 29px" path=" m-847,-394 l445,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-847,-394" coordsize="1292,577"></cvml:shape></cufoncanvas><cufontext>Lorem </cufontext></cufon><cufon style="width: 44px; height: 18px" class="cufon cufon-vml" alt="Lorem "><cufoncanvas style="height: 29px; top: -5px; left: -2px"><cvml:shape style="width: 64px; height: 29px" path=" m68,-172 l68,0,33,0,33,-172,3,-172 c32,-188,54,-208,68,-232 l68,-189,108,-189,108,-168 c100,-173,82,-172,68,-172 x e m-41,-394 l1226,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-41,-394" coordsize="1267,577"></cvml:shape><cvml:shape style="width: 64px; height: 29px" path=" m63,-154 c69,-182,103,-204,127,-183 l127,-152 c107,-185,64,-157,63,-128 l63,0,29,0,29,-189,63,-189,63,-154 x e m-142,-394 l1125,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-142,-394" coordsize="1267,577"></cvml:shape><cvml:shape style="width: 64px; height: 29px" path=" m44,-175 c74,-204,147,-194,147,-143 l147,0,112,0,112,-23 c96,12,7,13,14,-45,18,-86,44,-97,87,-114,126,-130,124,-173,84,-175,66,-175,53,-166,44,-149 l44,-175 x m112,-116 c94,-97,41,-84,47,-43,52,-8,96,-9,112,-31 l112,-116 x e m-263,-394 l1004,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-263,-394" coordsize="1267,577"></cvml:shape><cvml:shape style="width: 64px; height: 29px" path=" m95,-27 c108,-97,122,-119,145,-189 l174,-189 c155,-128,125,-66,113,0 l70,0,7,-189,41,-189 x e m-422,-394 l845,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-422,-394" coordsize="1267,577"></cvml:shape><cvml:shape style="width: 64px; height: 29px" path=" m60,-58 c71,-14,125,3,152,-36 l151,-13 c94,27,15,-19,15,-91,15,-143,44,-192,94,-192,124,-192,152,-174,152,-147,152,-99,82,-86,60,-58 x m120,-149 c121,-167,109,-179,94,-179,62,-179,47,-115,55,-75,78,-95,120,-108,120,-149 x e m-589,-394 l678,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-589,-394" coordsize="1267,577"></cvml:shape><cvml:shape style="width: 64px; height: 29px" path=" m60,0 l27,0,27,-257,60,-257,60,0 x e m-749,-394 l518,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-749,-394" coordsize="1267,577"></cvml:shape><cvml:shape style="width: 64px; height: 29px" path=" m-822,-394 l445,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-822,-394" coordsize="1267,577"></cvml:shape></cufoncanvas><cufontext>Lorem </cufontext></cufon><cufon style="width: 36px; height: 18px" class="cufon cufon-vml" alt="Lorem "><cufoncanvas style="height: 29px; top: -5px; left: -2px"><cvml:shape style="width: 55px; height: 29px" path=" m85,-32 c85,-74,123,-134,113,-189 l148,-189,187,-33 c191,-84,214,-142,226,-189 l254,-189 c238,-128,211,-64,202,0 l160,0,131,-118 c121,-77,104,-37,100,0 l59,0,7,-189,42,-189 x e m-41,-394 l1063,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-41,-394" coordsize="1104,577"></cvml:shape><cvml:shape style="width: 55px; height: 29px" path=" m67,0 l32,0,32,-190,67,-190,67,0 x m69,-241 c69,-229,60,-221,49,-221,38,-221,29,-230,29,-241,29,-252,38,-261,49,-261,60,-261,69,-253,69,-241 x e m-292,-394 l812,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-292,-394" coordsize="1104,577"></cvml:shape><cvml:shape style="width: 55px; height: 29px" path=" m68,-172 l68,0,33,0,33,-172,3,-172 c32,-188,54,-208,68,-232 l68,-189,108,-189,108,-168 c100,-173,82,-172,68,-172 x e m-376,-394 l728,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-376,-394" coordsize="1104,577"></cvml:shape><cvml:shape style="width: 55px; height: 29px" path=" m61,-160 c90,-207,171,-202,171,-141 l171,2,136,2,136,-140 c133,-179,79,-176,61,-145 l61,2,26,2,26,-257,61,-257,61,-160 x e m-477,-394 l627,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-477,-394" coordsize="1104,577"></cvml:shape><cvml:shape style="width: 55px; height: 29px" path=" m-659,-394 l445,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-659,-394" coordsize="1104,577"></cvml:shape></cufoncanvas><cufontext>Lorem </cufontext></cufon><cufon style="width: 60px; height: 18px" class="cufon cufon-vml" alt="Lorem ipsum"><cufoncanvas style="height: 29px; top: -5px; left: -2px"><cvml:shape style="width: 78px; height: 29px" path=" m185,-35 c185,-12,172,0,146,0 l27,0,27,-257 c84,-255,173,-268,179,-221,169,-240,103,-239,63,-238 l63,-163 c102,-164,138,-164,148,-136,141,-144,90,-143,63,-143 l63,-20 c103,-22,170,-12,185,-35 x e m-41,-394 l1514,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-41,-394" coordsize="1555,577"></cvml:shape><cvml:shape style="width: 78px; height: 29px" path=" m160,-158 c176,-210,259,-200,259,-141 l259,0,225,0,225,-138 c224,-175,174,-179,161,-142 l161,0,126,0,126,-139 c127,-180,62,-176,62,-141 l62,0,27,0,27,-189,62,-189,62,-158 c73,-198,152,-205,160,-158 x e m-217,-394 l1338,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-217,-394" coordsize="1555,577"></cvml:shape><cvml:shape style="width: 78px; height: 29px" path=" m67,0 l32,0,32,-190,67,-190,67,0 x m69,-241 c69,-229,60,-221,49,-221,38,-221,29,-230,29,-241,29,-252,38,-261,49,-261,60,-261,69,-253,69,-241 x e m-492,-394 l1063,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-492,-394" coordsize="1555,577"></cvml:shape><cvml:shape style="width: 78px; height: 29px" path=" m63,-154 c69,-182,103,-204,127,-183 l127,-152 c107,-185,64,-157,63,-128 l63,0,29,0,29,-189,63,-189,63,-154 x e m-569,-394 l986,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-569,-394" coordsize="1555,577"></cvml:shape><cvml:shape style="width: 78px; height: 29px" path=" m44,-175 c74,-204,147,-194,147,-143 l147,0,112,0,112,-23 c96,12,7,13,14,-45,18,-86,44,-97,87,-114,126,-130,124,-173,84,-175,66,-175,53,-166,44,-149 l44,-175 x m112,-116 c94,-97,41,-84,47,-43,52,-8,96,-9,112,-31 l112,-116 x e m-690,-394 l865,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-690,-394" coordsize="1555,577"></cvml:shape><cvml:shape style="width: 78px; height: 29px" path=" m68,-172 l68,0,33,0,33,-172,3,-172 c32,-188,54,-208,68,-232 l68,-189,108,-189,108,-168 c100,-173,82,-172,68,-172 x e m-849,-394 l706,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-849,-394" coordsize="1555,577"></cvml:shape><cvml:shape style="width: 78px; height: 29px" path=" m60,-58 c71,-14,125,3,152,-36 l151,-13 c94,27,15,-19,15,-91,15,-143,44,-192,94,-192,124,-192,152,-174,152,-147,152,-99,82,-86,60,-58 x m120,-149 c121,-167,109,-179,94,-179,62,-179,47,-115,55,-75,78,-95,120,-108,120,-149 x e m-950,-394 l605,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-950,-394" coordsize="1555,577"></cvml:shape><cvml:shape style="width: 78px; height: 29px" path=" m13,-36 c28,-4,92,-11,88,-53,84,-91,9,-100,12,-147,15,-193,77,-204,110,-176 l110,-152 c100,-176,50,-189,45,-156,50,-111,121,-110,121,-59,121,-6,50,20,13,-12 l13,-36 x e m-1110,-394 l445,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-1110,-394" coordsize="1555,577"></cvml:shape></cufoncanvas><cufontext>Lorem ipsum</cufontext><cvml:shape coordsize="1000,1000"></cvml:shape></cufon></h2>

<div class="contentContainer">
<p>Lorem ipsum dolor sit amet, Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet</p>
<p>Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet</p>
</div>

预期过滤输出

<h2>Lorem Lorem Lorem Lorem Lorem Lorem ipsum</h2>

<div>
<p>Lorem ipsum dolor sit amet, Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet</p>
<p>Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet</p>
</div>

再举一个例子,说得更清楚:-

输入

  1. &lt;abc id="test"&gt;new tag and known attribute&lt;/abc&gt;
  2. &lt;a id="test" href="http://www.google.com/" xyz="testattr"&gt;known tag, attribute and one unknown attr&lt;/a&gt;

输出

  1. 新标签和已知属性
  2. &lt;a id="test" href="http://www.google.com/"&gt;known tag, attribute and one unknown attr&lt;/a&gt;

感谢您的帮助。

【问题讨论】:

  • 您能向我们展示示例输入和预期输出吗?
  • @Shekhar:我已经更新了示例 HTML 和预期的过滤输出。
  • 如果有一个您不想过滤的标签但如果该标签包含不需要的属性怎么办?
  • @Shekhar:我已经用另一个例子更新了我的问题。
  • 标签有效但结构无效的字符串应该怎么办?喜欢&lt;b&gt;hi&lt;/h2&gt; &lt;p&gt;bai&lt;/b&gt;。如果无论结构如何,这些都应该通过,那么正则表达式是一个有效的选择。另外,要使用什么语言/正则表达式风格?

标签: regex


【解决方案1】:

这似乎与我不久前发布的一个问题非常相似:

How do I filter all HTML tags except a certain whitelist?

【讨论】:

  • 感谢发帖。在此链接中,它仅适用于 HTML 标记。但我也需要定位属性。如果有任何想法或建议,请发表。
【解决方案2】:

You can't parse HTML with regex (这是 Stackoverflow 上票数最高的帖子之一的原因)

【讨论】:

  • 感谢发帖。我已经看到了这个问题,但我觉得在我的场景中。应该可以的。
  • 这不是解析 HTML,这是标记 HTML。
【解决方案3】:

这是一个使用 PCRE 兼容正则表达式的 Perl 解决方案。它不知道 cmets、doctype、CDATA 等。应该添加这些以获得更完整的解决方案。

# allowed tag and attribute names

my $allowed_tags_open = 'p|body|b|u|em|strong|ul|ol|li|h1|h2|h3|h4|h5|h6|a|tr|td|table|tbody|label|div|sup|sub|caption';

my $allowed_tags_self_closing = 'img|br|hr';

my $allowed_attributes = 'alt|href|tcmuri|title|height|width|align|valign|rowspan|colspan|src|summary|class|id|name|title|target|nowrap|scope|axis|cellpadding|cellspacing|dir|lang|rel';

$allowed_attributes .= '|style'; # for testing


# definitions for matching allowed tag and attribute names

my $re_tags = qr~(?(DEFINE)
    (?<tags_open>
        /?+
        (?>
            (?: $allowed_tags_open )
            (?! [^\s>/] )       # from (?&tagname)
        )
    )
    (?<tags_self_closing>
        (?>
            (?: $allowed_tags_self_closing )
            (?! [^\s>/] )       # from (?&tagname)
        )
    )
    (?<tags>    (?> (?&tags_open) | (?&tags_self_closing) )    )
    (?<attribs>
        (?>
            (?: $allowed_attributes )
            (?! [^\s=/>] )      # from (?&attname)
        )
    )
)~xi;


# definitions for matching the tags
# trying to follow compatible tokenization characteristics of modern browsers

my $re_defs = qr~(?(DEFINE)
    (?<tagname> [a-z/][^\s>/]*+    )    # will match the leading / in closing tags
    (?<attname> [^\s>/][^\s=/>]*+    )  # first char can be pretty much anything, including =
    (?<attval>  (?>
                    "[^"]*+" |
                    \'[^\']*+\' |
                    [^\s>]*+            # unquoted values can contain quotes, = and /
                )
    ) 
    (?<attrib>  (?&attname)
                (?: \s*+
                    = \s*+
                    (?&attval)
                )?+
    )
    (?<crap>    (?!/>)[^\s>]    )       # most crap inside tag is ignored, but don't eat the last / in self closing tags
    (?<tag>     <(?&tagname)
                (?: \s*+                # spaces between attributes not required: <b/foo=">"style=color:red>bold red text</b>
                    (?>
                        (?&attrib) |    # order matters
                        (?&crap)        # if not an attribute, eat the crap
                    )
                )*+
                \s*+ /?+
                >
    )
)~xi;



sub sanitize_html{
    my $str = shift;
    $str =~ s/(?&tag) $re_defs/ sanitize_tag($&) /gexo;
    return $str;
}


sub sanitize_tag{
    my $tag = shift;

    my ($name, $attr, $end) =
        $tag =~ /^ < ((?&tags)) (.*?) ( \/?+ > ) $   $re_tags/xo
        or return '';  # return empty string if not allowed tag

    # return a new clean closing tag if it's a closing tag
    return "<$name>" if substr($name, 0, 1) eq '/';

    # clean attributes
    return "<$name" . sanitize_attributes($attr) . $end;
}


sub sanitize_attributes{
    my $attr = shift;
    my $new = '';

    $attr =~ s{
        \G
        \s*+                 # spaces between attributes not required
        (?>
            ( (?&attrib) ) | # order matters
            (?&crap)         # if not an attribute, eat the crap
        )

        $re_defs
    }{
        my $att = $1;
        $new .= " $att" if $att && $att =~ /^(?&attribs) $re_tags/xo;
        '';
    }gexo;

    return $new;
}

测试(ideone):

my $test = <<'_TEST_';
<b>simple</b>
self <img>closing</img>

<abc id="test">new tag and known attribute</abc>
<a id="test" xyz="testattr" href="/foo">one unknown attr</a>
<a id="foo">attr in closing tag</a id="foo">

<b/#ñ%&/()!¢º`=">="">crap be gone</b> not bold<br/x"/>
<b/style=color:red;background:url("x.gif");/*="still.CSS*/ id="x"zz"<script class="x">tricky</b/ x=">"//> not bold
_TEST_

print $test, "\n";
print '-' x 70, "\n";
print sanitize_html $test;

输出:

<b>simple</b>
self <img>closing</img>

<abc id="test">new tag and known attribute</abc>
<a id="test" xyz="testattr" href="/foo">one unknown attr</a>
<a id="foo">attr in closing tag</a id="foo">

<b/#ñ%&/()!¢º`=">="">crap be gone</b> not bold<br/x"/>
<b/style=color:red;background:url("x.gif");/*="still.CSS*/ id="x"zz"<script class="x">tricky</b/ x=">"//> not bold

----------------------------------------------------------------------
<b>simple</b>
self <img>closing

new tag and known attribute
<a id="test" href="/foo">one unknown attr</a>
<a id="foo">attr in closing tag</a>

<b>crap be gone</b> not bold<br/>
<b style=color:red;background:url("x.gif");/*="still.CSS*/ id="x" class="x">tricky</b> not bold

查看您的浏览器如何解析棘手的标签:jsFiddle

可能相关:

【讨论】:

  • 您能再添加一个通用格式的正则表达式吗?由于我无法将 perl 正则表达式更改为通用正则表达式格式。
  • @SivaCharan,没有“通用格式”。您到底需要什么口味(例如,您使用什么语言)?如果您使用 PHP(或其他任何带有 PCRE 的东西),这些表达式将在编写时工作,您只需要更改引用和周围的代码。
  • 其实我是在Classic ASP上使用的
  • “通用格式”的意思是“正则表达式应该能够以最小的变化在其他语言中使用”。因此其他用户可以轻松使用我们的正则表达式。如果您可以以通用格式提供更好的解决方案,那么我可以将您的答案标记为已接受。但请确保您提供单行正则表达式。感谢您抽出宝贵时间。
【解决方案4】:

最后我分两步实现了:-

//Allowed list of HTML Tags

<(?!/?(p|body|b|u|em|strong|ul|ol|li|h1|h2|h3|h4|h5|h6|hr|a|br|img|tr|td|table|tbody|label|div|sup|sub|caption)(>|\s))[^<]+?>

//Allowed list of HTML Attributes

\s(?!(alt|href|tcmuri|title|height|width|align|valign|rowspan|colspan|src|summary|class|id|name|title|target|nowrap|scope|axis|cellpadding|cellspacing|dir|lang|rel))\w+(\s*=\s*["|']?[/.,#?\w\s:;-]+["|']?)

使用以上两个正则表达式,我过滤了整个 html。

编辑:

现在我将其简化为一个正则表达式,用于过滤所有必需的 HTML 标记和属性

(<(?!/?(p|body|b|u|em|strong|ul|ol|li|h1|h2|h3|h4|h5|h6|hr|a|br|img|tr|td|table|tbody|label|div|sup|sub|caption)(>|\s))[^<]+?>)|(\s(?!(alt|href|tcmuri|title|height|width|align|valign|rowspan|colspan|src|summary|class|id|name|title|target|nowrap|scope|axis|cellpadding|cellspacing|dir|lang|rel)\b)[\w:]+(\s*=\s*["|']?[/.,#?\w\s:;-]+["|']?))

【讨论】:

  • 不要使用这些。例如 &lt;script/fail&gt;alert(1);&lt;/script/&gt;&lt;a href="#"onclick=alert(1)&gt;fail&lt;/a&gt; 失败。请告诉我答案中的表达有什么问题。
  • @Qtax:不幸的是我没有遇到这种类型的示例格式。感谢您更新失败的场景。但这是我可能遇到的罕见情况。
  • 是的,这是您没有遇到过的“罕见”案例。还有更多。无论您如何修复您的正则表达式,仍然会有更多它没有涵盖的情况。这是因为您无法使用正则表达式解析 HTML。不可能。看到这个:codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html
  • @AHM:为什么这被否决了?如果您将输入作为字符串传递,我的答案将达到最大程度。
  • But this is rare case which I may encounter.。这是一个跨站点脚本漏洞(请参阅owasp.org/index.php/…),它可能允许恶意用户(例如)添加 JavaScript 以侦听输入字段(例如密码字段)的更改并将该数据发送到远程服务器.因此,这可能会给您网站的用户带来严重风险。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2011-02-01
  • 1970-01-01
  • 1970-01-01
  • 2017-07-15
相关资源
最近更新 更多