【问题标题】:Javascript variable with html code regex email matching带有 html 代码正则表达式电子邮件匹配的 Javascript 变量
【发布时间】:2014-12-29 13:59:28
【问题描述】:

此 Python 脚本无法输出此案例的电子邮件地址 example@email.com。

这是我之前的帖子。

How can I use BeautifulSoup or Slimit on a site to output the email address from a javascript variable

#!/usr/bin/env python

from bs4 import BeautifulSoup
import re

soup = '''
<script LANGUAGE="JavaScript">
function something()
{
var ptr;
ptr = "";
ptr += "<table><td class=france></td></table>";
ptr += "<table><td class=france><a href=mail";
ptr += "to:example@email.com>email</a></td></table>";
document.all.something.innerHTML = ptr;
}
</script>
'''


soup = BeautifulSoup(soup)

for script in soup.find_all('script'):
  reg = '(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)'
  reg2 = 'mailto:.*'
  secondHalf= re.search(reg, script.text)
  firstHalf= re.search(reg2, script.text)
  secondHalfEmail = secondHalf.group()
  firstHalfEmail = firstHalf.group()
  firstHalfEmail = firstHalfEmail.replace('mailto:', '')
  firstHalfEmail = firstHalfEmail.replace('";', '')
  if firstHalfEmail == secondHalfEmail:
     email = secondHalfEmail
  else:
     if ('>') not in firstHalfEmail:
        if ('>') not in secondHalfEmail:
            if firstHalfEmail != secondHalfEmail:
                email = firstHalfEmail + secondHalfEmail
        else:
            email = firstHalfEmail
    else:
        email = secondHalfEmail

    print email

如果有人可以帮助我,那就太好了。

谢谢

【问题讨论】:

    标签: javascript python regex email beautifulsoup


    【解决方案1】:

    这是一个相当有趣(我认为)的方法。

    而不是解析这段 javascript 代码 - 执行它

    获取ptr 值,通过BeautifulSoup 加载它并从a 标记中获取href 属性值。使用V8 engine 的示例:

    from bs4 import BeautifulSoup
    from pyv8 import PyV8
    
    data = """
    <script LANGUAGE="JavaScript">
    function something()
    {
    var ptr;
    ptr = "";
    ptr += "<table><td class=france></td></table>";
    ptr += "<table><td class=france><a href=mail";
    ptr += "to:example@email.com>email</a></td></table>";
    document.all.something.innerHTML = ptr;
    }
    </script>
    """
    
    soup = BeautifulSoup(data)
    
    # prepare the function to return a value and add a function call
    js_code = soup.script.text.strip().replace('document.all.something.innerHTML = ptr;', 'return ptr;') + "; something()"
    
    ctxt = PyV8.JSContext()
    ctxt.enter()
    
    soup = BeautifulSoup(ctxt.eval(str(js_code)))
    print soup.a['href'].split('mailto:')[1]
    

    打印:

    example@email.com
    

    【讨论】:

      【解决方案2】:

      您的问题是您在文本中找不到“mailto”,因为前半部分“mail”与后半部分“to”不在同一行。要正确解决您的问题,只需知道该程序结束时 ptr 的值。

      我知道这是一个不好的方法,但如果你确定结构总是这样:

      soup = """
      <script LANGUAGE="JavaScript"> function ...() 
      { var ptr; 
      ptr = ""; 
      ptr += "..."; 
      ptr += "..."; 
      ptr += "...";
      document.all.something.innerHTML = ptr; 
      }
      </script> 
      """
      

      你可以用这个:

      soup = BeautifulSoup(soup)
      
      for script in soup.find_all('script'):
          #This matches everything between "{ var ptr;" 
          #and "document"
          regex = "{ var ptr;(.*)document"
          code = re.search(regex, script.text, flags=re.DOTALL).groups()[0]
          #This is actually dangerous because anything 
          #in the code will be executed here, but if
          #it's like your example everything will 
          #work fine and you can access the value of ptr
          exec(code)
          print ptr
      

      现在您可以使用 Beautifulsoup 或 re 来解析 ptr。如果你不知道它的结构,你可以使用这个:

          mail = re.search("<a href=mailto:(.*?)>", ptr).groups()[0]
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2017-06-27
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2018-11-09
        • 1970-01-01
        相关资源
        最近更新 更多