一、自定义in/outputFormat

  1.需求  

  现有一些原始日志需要做增强解析处理,流程:

    1、 从原始日志文件中读取数据

    2、 根据日志中的一个URL字段到外部知识库中获取信息增强到原始日志

    3、 如果成功增强,则输出到增强结果目录;如果增强失败,则抽取原始数据中URL字段输出到待爬清单目录

1374609560.11    1374609560.16    1374609560.16    1374609560.16    110    5    8615038208365    460023383869133    8696420056841778    2    460    0    14615            54941    10.188.77.252    61.145.116.27    35020    80    6    cmnet    1    221.177.218.34    221.177.217.161    221.177.218.34    221.177.217.167    ad.veegao.com    http://ad.veegao.com/veegao/iris.action        Apache-HttpClient/UNAVAILABLE (java 1.4)    POST    200    593    310    4    3    0    0    4    3    0    0    0    0    http://ad.veegao.com/veegao/iris.action    5903903079251243019    5903903103500771339    5980728
1374609558.91    1374609558.97    1374609558.97    1374609559.31    112    461    8615038208365    460023383869133    8696420056841778    2    460    0    14615            54941    10.188.77.252    101.226.76.175    37293    80    6    cmnet    1    221.177.218.34    221.177.217.161    221.177.218.34    221.177.217.167    short.weixin.qq.com    http://short.weixin.qq.com/cgi-bin/micromsg-bin/getcdndns        Android QQMail HTTP Client    POST    200    543    563    2    3    0    0    2    3    0    0    0    0    http://short.weixin.qq.com/cgi-bin/micromsg-bin/getcdndns    5903903079251243019    5903903097240039435    5980728
1374609514.70    1374609514.75    1374609514.75    1374609515.58    110    5    8613674976196    460004901700207    8623350100353878    2    460    0    14694            58793    10.184.80.32    111.13.13.222    36181    80    6    cmnet    1    221.177.156.4    221.177.217.145    221.177.156.4    221.177.217.156    retype.wenku.bdimg.com    http://retype.wenku.bdimg.com/img/97308d2b7375a417866f8f09        AMB_400    GET    200    345    4183    5    5    0    0    5    5    0    0    0    0    http://retype.wenku.bdimg.com/img/97308d2b7375a417866f8f09    5903900710696611851    5903902908140003339    5937307
1374609511.98    1374609512.02    1374609512.02    1374609512.48    110    362    8613674976196    460004901700207    8623350100353878    2    460    0    14694            58793    10.184.80.32    120.204.207.160    33548    80    6    cmnet    1    221.177.156.4    221.177.217.145    221.177.156.4    221.177.217.156    t4.qpic.cn    http://t4.qpic.cn/mblogpic/217cf24d43f1f19255e2/120        AMB_400    GET    200    346    3184    4    4    0    0    4    4    0    0    0    0    http://t4.qpic.cn/mblogpic/217cf24d43f1f19255e2/120    5903900710696611851    5903902896317288459    5937307
1374609518.14    1374609518.24    1374609518.24    1374609518.72    110    362    8613674976196    460004901700207    8623350100353878    2    460    0    14694            58793    10.184.80.32    120.204.207.160    33548    80    6    cmnet    1    221.177.156.4    221.177.217.145    221.177.156.4    221.177.217.156    t4.qpic.cn    http://t4.qpic.cn/mblogpic/96e02ad781c9be6f5ad2/120        AMB_400    GET    200    346    3328    4    4    0    0    4    4    0    0    0    0    http://t4.qpic.cn/mblogpic/96e02ad781c9be6f5ad2/120    5903900710696611851    5903902896317288459    5937307
日志示例

相关文章:

  • 2021-11-30
  • 2021-11-15
  • 2021-09-13
  • 2022-12-23
  • 2021-04-10
猜你喜欢
  • 2022-12-23
  • 2022-12-23
  • 2022-12-23
  • 2021-10-30
  • 2022-12-23
  • 2022-12-23
  • 2022-12-23
相关资源
相似解决方案