【问题标题】:Parse HTML string with Nokogiri使用 Nokogiri 解析 HTML 字符串
【发布时间】:2015-10-19 17:25:50
【问题描述】:

我正在尝试编写一个解析 HTML 字符串并从特定节点获取一些值的 ruby​​ 脚本。

目前我正在努力将字符串读入 Nokogiri 文档:

这段代码:

#!/usr/bin/ruby

html_doc = Nokogiri::HTML("<html>  <meta content="text/html; charset=UTF-8"/>  <body style='margin:20px'>    <p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p>    <ul style='list-style-type:none; margin:25px 15px;'>      <li><b>User name:</b> Test User</li>      <li><b>User email:</b> test@abc.com</li>      <li><b>Identifier:</b> abc123def132afd1213afas</li>      <li><b>Description:</b> Tom's iPad</li>      <li><b>Model:</b> iPad 3</li>      <li><b>Platform:</b> </li>      <li><b>App:</b> Test app name</li>      <li><b>UserID:</b> </li>     </ul>    <p>Review user: https://cirrus.app47.com/users?search=test@abc.com</p>            <hr style='height=2px; color:#aaa'/>        <p>We hope you enjoy the app store experience!</p>        <p style='font-size:18px; color:#999'>Powered by App47</p>      <img src='https://cirrus.app47.com/notifications/562506219ac25b1033000904/img' alt=''/></body></html>")

产生这个错误:

$ ruby emailParser.rb 
emailParser.rb:3: syntax error, unexpected tIDENTIFIER, expecting ')'
...ML("<html>  <meta content="text/html; charset=UTF-8"/>  <bod...
...                               ^
emailParser.rb:3: syntax error, unexpected tSTRING_BEG, expecting end-of-input
...tent="text/html; charset=UTF-8"/>  <body style='margin:20px'...
...                               ^

请注意,我在这里尝试了相同的解决方案:

"syntax error, unexpected tIDENTIFIER, expecting $end"

【问题讨论】:

  • 现在在 HTML 内使用单引号或在 HTML 外使用单引号 text/html; charset=UTF-8 不是字符串的一部分

标签: html ruby nokogiri


【解决方案1】:

问题是您的字符串中有双引号,这会使解析器感到困惑,因为您还使用双引号将字符串括起来。举例说明:

puts "foo"bar"
# => SyntaxError: unexpected tIDENTIFIER, expecting end-of-input
#    puts "foo"bar"
#                 ^

您可能打算打印foo"bar,但是当解析器到达第二个"(在foo 之后)时,它认为字符串已经结束,因此后面的内容会导致语法错误。 (Stack Overflow 的语法高亮甚至给了你一个提示——看看第一行 "foo"bar" 的颜色有何不同?一个好的语法高亮文本编辑器会做同样的事情。)

一种解决方案是改用单引号:

puts 'bar"baz'
# => bar"baz

这解决了这种情况下的问题,但实际上并不能帮助你,因为你的字符串里面也有单引号!

另一种解决方案是转义引号,在引号前面加上\,如下所示:

puts "foo\"bar"
# => foo"bar

...但是对于像您这样的长字符串来说,这有点乏味(有时还很棘手)。一个更好的解决方案是使用一种特殊的字符串,称为“heredoc”(对于“here document”,因为它的价值):

str = <<-END_OF_HTML
  <html>  <meta content="text/html; charset=UTF-8"/>  <body style='margin:20px'>    <p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p>    <ul style='list-style-type:none; margin:25px 15px;'>      <li><b>User name:</b> Test User</li>      <li><b>User email:</b> test@abc.com</li>      <li><b>Identifier:</b> abc123def132afd1213afas</li>      <li><b>Description:</b> Tom's iPad</li>      <li><b>Model:</b> iPad 3</li>      <li><b>Platform:</b> </li>      <li><b>App:</b> Test app name</li>      <li><b>UserID:</b> </li>     </ul>    <p>Review user: https://cirrus.app47.com/users?search=test@abc.com</p>            <hr style='height=2px; color:#aaa'/>        <p>We hope you enjoy the app store experience!</p>        <p style='font-size:18px; color:#999'>Powered by App47</p>      <img src='https://cirrus.app47.com/notifications/562506219ac25b1033000904/img' alt=''/></body></html>
END_OF_HTML

html_doc = Nokogiri::HTML(str)

分隔符“END_OF_HTML”是任意的。您可以使用EOFXYZZY 或任何适合您的东西,尽管使用有意义的东西是个好主意。 (您会注意到 Stack Overflow 的语法高亮在 heredocs 上有点问题;不过,大多数代码编辑器都可以很好地处理它们。)

你可以像这样使它更紧凑:

Nokogiri::HTML <<-END_OF_HTML
  <html>  <meta content="text/html; charset=UTF-8"/>  <body style='margin:20px'>    <p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p>    <ul style='list-style-type:none; margin:25px 15px;'>      <li><b>User name:</b> Test User</li>      <li><b>User email:</b> test@abc.com</li>      <li><b>Identifier:</b> abc123def132afd1213afas</li>      <li><b>Description:</b> Tom's iPad</li>      <li><b>Model:</b> iPad 3</li>      <li><b>Platform:</b> </li>      <li><b>App:</b> Test app name</li>      <li><b>UserID:</b> </li>     </ul>    <p>Review user: https://cirrus.app47.com/users?search=test@abc.com</p>            <hr style='height=2px; color:#aaa'/>        <p>We hope you enjoy the app store experience!</p>        <p style='font-size:18px; color:#999'>Powered by App47</p>      <img src='https://cirrus.app47.com/notifications/562506219ac25b1033000904/img' alt=''/></body></html>
END_OF_HTML

或带括号(看起来有点奇怪,但它有效,有时是必要的):

Nokogiri::HTML(<<-END_OF_HTML)
  <html>  <meta content="text/html; charset=UTF-8"/>  <body style='margin:20px'>    <p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p>    <ul style='list-style-type:none; margin:25px 15px;'>      <li><b>User name:</b> Test User</li>      <li><b>User email:</b> test@abc.com</li>      <li><b>Identifier:</b> abc123def132afd1213afas</li>      <li><b>Description:</b> Tom's iPad</li>      <li><b>Model:</b> iPad 3</li>      <li><b>Platform:</b> </li>      <li><b>App:</b> Test app name</li>      <li><b>UserID:</b> </li>     </ul>    <p>Review user: https://cirrus.app47.com/users?search=test@abc.com</p>            <hr style='height=2px; color:#aaa'/>        <p>We hope you enjoy the app store experience!</p>        <p style='font-size:18px; color:#999'>Powered by App47</p>      <img src='https://cirrus.app47.com/notifications/562506219ac25b1033000904/img' alt=''/></body></html>
END_OF_HTML

您可以在 Ruby 文档的 Literals 部分阅读更多关于 heredocs 和其他表示字符串的方法。

【讨论】:

    【解决方案2】:

    您必须将 html 字符串引号从 " 更改为 ' 并将字符串引号 inside html 更改为 "。像这样的东西应该可以工作:

    #!/usr/bin/ruby
    
    html_doc = Nokogiri::HTML('<html>  <meta content="text/html; charset=UTF-8"/>  <body style="margin:20px">    <p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p>    <ul style="list-style-type:none; margin:25px 15px;">      <li><b>User name:</b> Test User</li>      <li><b>User email:</b> test@abc.com</li>      <li><b>Identifier:</b> abc123def132afd1213afas</li>      <li><b>Description:</b> Tom\'s iPad</li>      <li><b>Model:</b> iPad 3</li>      <li><b>Platform:</b> </li>      <li><b>App:</b> Test app name</li>      <li><b>UserID:</b> </li>     </ul>    <p>Review user: https://cirrus.app47.com/users?search=test@abc.com</p>            <hr style="height=2px; color:#aaa"/>        <p>We hope you enjoy the app store experience!</p>        <p style="font-size:18px; color:#999">Powered by App47</p>      <img src="https://cirrus.app47.com/notifications/562506219ac25b1033000904/img" alt=""/></body></html>')
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2014-03-17
      • 2023-03-03
      • 2011-11-16
      • 2013-08-15
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多