【问题标题】:Finding unique file names from an html file从 html 文件中查找唯一的文件名
【发布时间】:2010-12-14 06:15:27
【问题描述】:
$ cat downloaded_file.html

1373 <A HREF="http://site.com/STDMON11202010_company.txt">STDMON11202010_company.txt</A><br> Monday, November 22, 2010  1:31 AM  

如何从我的 shell 脚本中搜索 html 文件并选择以 STDMON 开头并以 _company.txt 结尾的唯一文件名

【问题讨论】:

    标签: regex shell sed awk grep


    【解决方案1】:

    如果您在 STDMON_company.txt 之间只有数字,您可以这样做:

    grep -o 'STDMON[0-9]*_company\.txt' input.txt | sort -u
    

    See it

    如果有什么你可以做的:

    grep -oP 'STDMON.*?_company\.txt' input.txt | sort -u
    

    【讨论】:

      【解决方案2】:
       awk -F'>|<' '$3 ~ /STDMON[0-9]+_company.txt/ && !a[$0=$3]++' download_file.html
      

      输入

      $ cat downloaded_file.html
      1373 <A HREF="http://site.com/STDMON11202010_company.txt">STDMON11202010_company.txt</A><br> Monday, November 22, 2010  1:31 AM
      1373 <A HREF="http://site.com/STDMON11202010_company.txt">STDMON11202010_company.txt</A><br> Monday, November 22, 2010  1:31 AM
      1373 <A HREF="http://site.com/STDMON11202010_company.txt">STDMON14959440_company.txt</A><br> Monday, November 22, 2010  1:31 AM
      1373 <A HREF="http://site.com/STDMON11202010_company.txt">STDMON11202010_company.txt</A><br> Monday, November 22, 2010  1:31 AM
      1373 <A HREF="http://site.com/STDMON11202010_company.txt">STDMON14959440_company.txt</A><br> Monday, November 22, 2010  1:31 AM
      1373 <A HREF="http://site.com/STDMON11202010_company.txt">STDMON11202010_company.txt</A><br> Monday, November 22, 2010  1:31 AM
      1373 <A HREF="http://site.com/STDMON11202010_company.txt">STDMON12342440_company.txt</A><br> Monday, November 22, 2010  1:31 AM
      

      输出

      $ awk -F'>|<' '$3 ~ /STDMON[0-9]+_company.txt/ && !a[$0=$3]++'
      STDMON11202010_company.txt
      STDMON14959440_company.txt
      STDMON12342440_company.txt
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2020-06-11
        • 1970-01-01
        • 1970-01-01
        • 2020-01-31
        • 2018-05-08
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多