【问题标题】:Find unique URLs in a file在文件中查找唯一的 URL
【发布时间】:2018-05-08 00:53:29
【问题描述】:

情况

我在一个文件中有许多 URL,我需要找出存在多少个唯一 URL。

我想运行 bash 脚本或命令。

myfile.log

/home/myfiles/www/wp-content/als/xm-sf0ab5df9c1262f2130a9b313192deca4-f0ab5df9c1262f2130a9b313192deca4-c23c5fbca96e8d641d148bac41017635|https://public.rgfl.org/HS/PowerPoint%20Presentations/Health%20and%20Safety%20Law.ppt,18,17
/home/myfiles/www/wp-content/als/xm-s4bf050d47df5bfaf0486a50a8528cb16-4bf050d47df5bfaf0486a50a8528cb16-c23c5fbca96e8d641d148bac41017635|https://public.rgfl.org/HS/PowerPoint%20Presentations/Health%20and%20Safety%20Law.ppt,15,14
/home/myfiles/www/wp-content/als/xm-sad122bf22152ba4823a520cc2fe59f40-ad122bf22152ba4823a520cc2fe59f40-c23c5fbca96e8d641d148bac41017635|https://public.rgfl.org/HS/PowerPoint%20Presentations/Health%20and%20Safety%20Law.ppt,17,16
/home/myfiles/www/wp-content/als/xm-s3c0f031eebceb0fd5c4334ecef15292d-3c0f031eebceb0fd5c4334ecef15292d-c23c5fbca96e8d641d148bac41017635|https://public.rgfl.org/HS/PowerPoint%20Presentations/Health%20and%20Safety%20Law.ppt,12,11
/home/myfiles/www/wp-content/als/xm-sff661e8c3b4f94957926d5434d0ad549-ff661e8c3b4f94957926d5434d0ad549-c23c5fbca96e8d641d148bac41017635|https://quality.gha.org/Portals/2/documents/HEN/Meetings/nursesinstitute/062013/nursesroleineliminatingharm_moddydunning.pptx,17,16
/home/myfiles/www/wp-content/als/xm-s32c41ec2a5440ad220008b9abfe9add2-32c41ec2a5440ad220008b9abfe9add2-c23c5fbca96e8d641d148bac41017635|https://quality.gha.org/Portals/2/documents/HEN/Meetings/nursesinstitute/062013/nursesroleineliminatingharm_moddydunning.pptx,19,18
/home/myfiles/www/wp-content/als/xm-s28787ca2f4372ddb3616d3fd53c161ab-28787ca2f4372ddb3616d3fd53c161ab-c23c5fbca96e8d641d148bac41017635|https://quality.gha.org/Portals/2/documents/HEN/Meetings/nursesinstitute/062013/nursesroleineliminatingharm_moddydunning.pptx,22,21
/home/myfiles/www/wp-content/als/xm-s89a7b68158e38391da9f0de1e636c0d5-89a7b68158e38391da9f0de1e636c0d5-c23c5fbca96e8d641d148bac41017635|https://quality.gha.org/Portals/2/documents/HEN/Meetings/nursesinstitute/062013/nursesroleineliminatingharm_moddydunning.pptx,13,12
/home/myfiles/www/wp-content/als/xm-sc4b14e10f6151995f21334061ff1d139-c4b14e10f6151995f21334061ff1d139-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,13,12
/home/myfiles/www/wp-content/als/xm-se589d47d163e43fa0c0d68e824e2c286-e589d47d163e43fa0c0d68e824e2c286-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,19,18
/home/myfiles/www/wp-content/als/xm-s52f897a623c539d09bfb988bfb153888-52f897a623c539d09bfb988bfb153888-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,14,13
/home/myfiles/www/wp-content/als/xm-sccf27a904c5b88e96a3522b2e1180fed-ccf27a904c5b88e96a3522b2e1180fed-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,18,17
/home/myfiles/www/wp-content/als/xm-s6874bf9d589708764dab754e5af06ddf-6874bf9d589708764dab754e5af06ddf-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,17,16
/home/myfiles/www/wp-content/als/xm-s46c55ec8387dbdedd7a83b3ad541cdc1-46c55ec8387dbdedd7a83b3ad541cdc1-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,19,18
/home/myfiles/www/wp-content/als/xm-s08cfdc15f5935b947bbaa93c7193d496-08cfdc15f5935b947bbaa93c7193d496-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hydro-power-plant.ppt,9,8
/home/myfiles/www/wp-content/als/xm-s86e267bd359c12de262c0279cee0c941-86e267bd359c12de262c0279cee0c941-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hydro-power-plant.ppt,15,14
/home/myfiles/www/wp-content/als/xm-s5aa60354d134b87842918d760ec8bc30-5aa60354d134b87842918d760ec8bc30-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hydro-power-plant.ppt,14,13

期望的结果

Unique Urls: 4

【问题讨论】:

    标签: bash shell unix unique


    【解决方案1】:
    cut -d "|" -f 2 file | cut -d "," -f 1  | sort -u | wc -l
    

    输出:

    4

    见:man cutman sort

    【讨论】:

    • [root]# time cut -d "|" -f 2 myfile.log |剪切 -d "," -f 1 |排序-u | wc -l 2010 real 0m0.242s user 0m0.258s sys 0m0.019s [root]# time echo 唯一网址:$(sed 's/^.*|([^,]*),.*$/\1/ ' myfile.log | uniq | wc -l) 唯一 URL:2010 真实 0m3.676s 用户 0m4.136s sys 0m0.070s [root]# time awk '{sub(/^[^|]*\|/,"" );gsub(/,[^,]*/,"");i+=a[$0]++?0:1}END{print i}' myfile.log 2010 真实 0m0.943s 用户 0m0.934s 系统 0m0 .006s [root]# 时间 tr , '|'
    • 有点乱,但如果您知道如何使评论更易于阅读,您的回答是最快的,请帮忙。
    【解决方案2】:

    awk 解决方案是

    awk '{sub(/^[^|]*\|/,"");gsub(/,[^,]*/,"");i+=a[$0]++?0:1}END{print i}' file
    4
    

    如果你碰巧使用GNU awk,那么下面也会给你同样的结果

    awk '{i+=a[gensub(/.*(http[^,]*).*/,"\\1",1)]++?0:1}END{print i}' file
    4
    

    或者甚至像this cracker 评论中指出的那样简短,@cyrus

    awk -F '[|,]' '{i+=!a[$2]++} END{print i}' file
    4
    

    它使用 awk 多字段分隔符功能和更惯用的 awk。


    注意:请参阅[ awk manual ]了解更多信息。

    【讨论】:

    • 或者根据您的想法:awk -F '[|,]' '{i+=!a[$2]++} END{print i}' file
    • @Cyrus 赞赏这个建议。已附加。
    【解决方案3】:
    1. 使用sed 解析,并且由于 file 似乎已经排序, (相对于 URL),只需运行 uniq 并计算它:

      echo Unique URLs: $(sed 's/^.*|\([^,]*\),.*$/\1/' file | uniq | wc -l)
      
    2. 使用 GNU grep 提取 URL:

      echo Unique URLs: $(grep -o 'ht[^|,]*' file | uniq | wc -l)
      

    输出(任一方法):

    Unique URLs: 4
    

    【讨论】:

      【解决方案4】:
      tr , '|' < myfile.log | sort -u -t '|' -k 2,2 | wc -l
      
      • tr , '|' &lt; myfile.log 将所有逗号转换为管道字符
      • sort -u -t '|' -k 2,2 排序唯一 (-u),管道分隔 (-t '|'),仅在第二个字段中 (-k 2,2)
      • wc -l 计算唯一行数

      【讨论】:

        猜你喜欢
        • 2020-06-11
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多