从一个巨大的日志文件中提取大量模式答案

【问题标题】：Grep large number of patterns from a huge log file从一个巨大的日志文件中提取大量模式
【发布时间】：2017-08-22 15:30:00
【问题描述】：

我有一个 shell 脚本，它通过 cron 作业每小时调用一次，用于搜索星号日志并为我提供以原因 31 结束的呼叫的唯一 ID。

while read ref
do
cat sample.log | grep "$ref" | grep 'got hangup request, cause 31' | grep -o 'C-[0-9a-z][0-9a-z][0-9a-z][0-9a-z][0-9a-z][0-9a-z][0-9a-z][0-9a-z]' >> cause_temp.log
done < callref.log

问题是 while 循环太慢了，为了准确起见，我已经包含了 4 个如上所述的 while 循环来执行各种检查。

callref.log 文件由呼叫标识符值组成，每小时大约有 50-90,000 个值，脚本大约需要 45-50 分钟才能完成执行并将报告通过电子邮件发送给我。

如果我能够减少循环的执行时间，那将是非常有帮助的。由于 sample.log 文件的大小约为 20 GB，并且每个循环都会打开文件并执行搜索，因此我认为 while 循环是这里的瓶颈。

已完成研究并找到了一些有用的链接，例如 Link 1 Link 2

但建议的解决方案我无法实施或不知道如何实施。任何建议都会有所帮助。谢谢

由于 sample.log 包含敏感信息，我无法分享任何日志，但以下是我从互联网上获得的一些示例日志。

Dec 16 18:02:04 asterisk1 asterisk[31774]: NOTICE[31787]: chan_sip.c:11242 in handle_request_register: Registration from '"503"<sip:503@192.168.1.107>' failed for '192.168.1.137' - Wrong password
Dec 16 18:03:13 asterisk1 asterisk[31774]: NOTICE[31787]: chan_sip.c:11242 in handle_request_register: Registration from '"502"<sip:502@192.168.1.107>' failed for '192.168.1.137' - Wrong password
Dec 16 18:04:49 asterisk1 asterisk[31774]: NOTICE[31787]: chan_sip.c:11242 in handle_request_register: Registration from '"1737245082"<sip:1737245082@192.168.1.107>' failed for '192.168.1.137' - Username/auth name mismatch
Dec 16 18:04:49 asterisk1 asterisk[31774]: NOTICE[31787]: chan_sip.c:11242 in handle_request_register: Registration from '"100"<sip:100@192.168.1.107>' failed for '192.168.1.137' - Username/auth name mismatch
Jun 27 18:09:47 host asterisk[31774]: ERROR[27910]: chan_zap.c:10314 setup_zap: Unable to register channel '1-2'
Jun 27 18:09:47 host asterisk[31774]: WARNING[27910]: loader.c:414 __load_resource: chan_zap.so: load_module failed, returning -1
Jun 27 18:09:47 host asterisk[31774]: WARNING[27910]: loader.c:554 load_modules: Loading module chan_zap.so failed!

文件 callref.log 由一系列行组成，看起来像 -

C-001ec22d
C-001ec23d
C-001ec24d
C-001ec31d
C-001ec80d

上述while循环的期望输出看起来像C-001ec80d

另外，我主要关心的是让 while 循环运行得更快。就像将 callref.log 的所有值加载到一个数组中一样，如果可能的话，在一次 sample.log 中同时搜索所有值。

【问题讨论】：

可能值得研究一下。 grep 的 -F 标志，当您使用固定字符串时，这可能会提高前两个 grep 的性能（但不要将它用于最后一个）。有一些很好的提示 here 应该会有所帮助。
你的意思是你不能使用awk？
您如何发布一些sample.log 和callref.log 以及预期的输出，我相信我们可能会对您有所帮助。
@hnefatl - 这不是查询耗时，而是耗时的 while 循环
@Maurice Perry - 我可以使用 awk，但我不熟悉它，您可以通过示例提出任何想法

标签： shell

【解决方案1】：

由于即使在请求时您也无法生成足够的样本日志进行测试，所以我自己整理了一些测试材料：

$ cat callref.log
a
b
$ cat sample.log
a 1
b 2
c 1

使用 awk：

$ awk 'NR==FNR {             # hash callrefs
    a[$1]
    next
}
{                            # check callrefs from sample records and output when match
    for(l in a)
        if($0 ~ l && $0 ~ 1) # 1 is the static string you look for along a callref
            print l
}' callref.log sample.log
a 1

HTH

【讨论】：

【解决方案2】：

我花了一天时间构建一个测试框架并测试不同命令的变体，我认为你已经拥有了最快的。

这让我认为，如果你想获得更好的性能，你应该研究一个日志消化框架，比如 ossec（你的日志样本来自哪里）也许是 splunk。对于您的意愿，这些可能太笨拙了。或者，您应该考虑在 java/C/perl/awk 中设计和构建更适合解析的东西。

更频繁地运行现有脚本也会有所帮助。

祝你好运！如果你喜欢，我可以将我所做的工作打包并发布在此处，但我认为这有点矫枉过正。

根据要求； CalFuncs.sh：我在大部分脚本中都使用的库

#!/bin/bash

LOGDIR="/tmp"
LOG=$LOGDIR/CalFunc.log
[ ! -d "$LOGDIR" ] && mkdir -p $(dirname $LOG)

SSH_OPTIONS="-o StrictHostKeyChecking=no -q -o ConnectTimeout=15"
SSH="ssh $SSH_OPTIONS -T"
SCP="scp $SSH_OPTIONS"
SI=$(basename $0)

Log() {
    echo "`date` [$SI] $@" >> $LOG
}

Run() {
    Log "Running '$@' in '`pwd`'"
  $@ 2>&1 | tee -a $LOG
}

RunHide() {
    Log "Running '$@' in '`pwd`'"
    $@ >> $LOG 2>&1
}

PrintAndLog() {
    Log "$@"
    echo "$@"
}

ErrorAndLog() {
    Log "[ERROR] $@ "
    echo "$@" >&2
}

showMilliseconds(){
  date +%s
}

runMethodForDuration(){
  local startT=$(showMilliseconds)
  $1
  local endT=$(showMilliseconds)
  local totalT=$((endT-startT))
  PrintAndLog "that took $totalT seconds to run $1"
  echo $totalT
}

genCallRefLog.sh - 根据参数生成虚构的 callref.log 大小

#!/bin/bash
#Script to make 80000 sequential lines of callref.log this should suffice for a POC
if [ -z "$1" ] ; then
  echo "genCallRefLog.sh requires an integer of the number of lines to pump out of callref.log"
  exit 1
fi
file="callref.log"
[ -f "$file" ] && rm -f "$file"  # del file if exists
i=0 #put start num in here
j="$1" #put end num in here
echo "building $j lines of callref.log"
for ((  a=i ;  a < j;  a++  ))
do
  printf 'C-%08x\n' "$a" >> $file
done

genSampleLog.sh 根据参数生成虚构的 sample.log 大小

#!/bin/bash
#Script to make 80000 sequential lines of callref.log this should suffice for a POC
if [ -z "$1" ] ; then
  echo "genSampleLog.sh requires an integer of the number of lines to pump out of sample.log"
  exit 1
fi
file="sample.log"
[ -f "$file" ] && rm -f "$file"  # del file if exists
i=0 #put start num in here
j="$1" #put end num in here
echo "building $j lines of sample.log"
for ((  a=i ;  a < j;  a++  ))
do
  printf 'Dec 16 18:02:04 asterisk1 asterisk[31774]: NOTICE[31787]: C-%08x got hangup request, cause 31\n' "$a" >> $file
done

最后是我使用的实际测试脚本。通常我会注释掉构建脚本，因为它们只需要在更改日志大小时运行。我通常一次只运行一个测试函数并记录结果。

test.sh

#!/bin/bash
source "./CalFuncs.sh"

targetLogFile="cause_temp.log"
Log "Starting"

checkTargetFileSize(){
  expectedS="$1"
  hasS=$(cat $targetLogFile | wc -l)
  if [ "$expectedS" != "$hasS" ] ; then
    ErrorAndLog "Got $hasS but expected $expectedS, when inspecting $targetLogFile"
    exit 244
  fi
}

standard(){
  iter=0
  while read ref
  do
    cat sample.log | grep "$ref" | grep 'got hangup request, cause 31' | grep -o 'C-[0-9a-z][0-9a-z][0-9a-z][0-9a-z][0-9a-z][0-9a-z][0-9a-z][0-9a-z]' >> $targetLogFile
  done < callref.log
}

subStandardVarient(){
  iter=0
  while read ref
  do
    cat sample.log | grep 'got hangup request, cause 31' | grep -o "$ref"  >> $targetLogFile
  done < callref.log
}

newFunction(){
  grep -f callref.log sample.log | grep 'got hangup request, cause 31'  >> $targetLogFile
}

newFunction4(){
  grep 'got hangup request, cause 31' sample.log | grep -of 'callref.log'>> $targetLogFile
}

newFunction5(){
  #splitting grep
  grep 'got hangup request, cause 31' sample.log > /tmp/somefile
  grep -of 'callref.log' /tmp/somefile >> $targetLogFile
}

newFunction2(){
  iter=0

  while read ref
  do
    ((iter++))
    echo "$ref" | grep 'got hangup request, cause 31' | grep -of 'callref.log' >> $targetLogFile
  done < sample.log
}

newFunction3(){
  iter=0
  pat=""
  while read ref
  do
    if [[ "$pat." != "." ]] ; then
      pat="$pat|"
    fi
    pat="$pat$ref"
  done < callref.log
  # Log "Have pattern $pat"
  while read ref
  do
    ((iter++))
    echo "$ref" | grep 'got hangup request, cause 31' | grep -oP "$pat" >> $targetLogFile
  done < sample.log
  #grep: regular expression is too large
}

[ -f "$targetLogFile" ] && rm -f "$targetLogFile"

numLines="100000"
Log "testing algorithms with $numLines in each log file."

setupCallRef(){
  ./genCallRefLog.sh $numLines
}

setupSampleLog(){
  ./genSampleLog.sh $numLines
}

setupCallRef
setupSampleLog

runMethodForDuration standard > /dev/null
checkTargetFileSize "$numLines"
[ -f "$targetLogFile" ] && rm -f "$targetLogFile"
runMethodForDuration subStandardVarient > /dev/null
checkTargetFileSize "$numLines"
[ -f "$targetLogFile" ] && rm -f "$targetLogFile"
runMethodForDuration newFunction > /dev/null
checkTargetFileSize "$numLines"
# [ -f "$targetLogFile" ] && rm -f "$targetLogFile"
# runMethodForDuration newFunction2 > /dev/null
# checkTargetFileSize "$numLines"
# [ -f "$targetLogFile" ] && rm -f "$targetLogFile"
# runMethodForDuration newFunction3 > /dev/null
# checkTargetFileSize "$numLines"
# [ -f "$targetLogFile" ] && rm -f "$targetLogFile"
# runMethodForDuration newFunction4 > /dev/null
# checkTargetFileSize "$numLines"
[ -f "$targetLogFile" ] && rm -f "$targetLogFile"
runMethodForDuration newFunction5 > /dev/null
checkTargetFileSize "$numLines"

以上表明现有方法总是比我想出的任何方法都快。我想有人会小心优化它。

【讨论】：

感谢您的建议，但我已经从主日志文件中获取了过去 1 小时的日志，然后使用 sed 通过匹配主日志文件中的时间戳来处理它。 callref.log 文件由呼叫标识符值组成，每小时它将有大约 50-90,000 个值。另外我主要关心的是让while循环运行得更快。就像在一个数组中加载 callref.log 的所有值并在一次 sample.log 中同时搜索所有值
所以您的搜索是动态的？例如，您的 callref.log 中有变化的值？
是的，callref.log 中的值每小时都在变化
感谢您的努力，对我来说意义重大。请分享你的发现，可能对我来说很有价值。我也尝试将脚本从每小时 cronjob 减少到 30 分钟，但这不是一个永久的解决方案，我现在尝试将所有日志放入弹性搜索并从那里查询它，这将消除每小时 cronjobs 并使用 kibana 仪表板希望实时查看动作。
我没有重新运行任何东西，只是用我使用的脚本进行了更新。我正在向弹性搜索发送简历。 =) 谢谢