如何打印匹配的行、后面的 3 行和匹配的 URL答案

【问题标题】：How to print matching line, 3 lines after, and matching URL如何打印匹配的行、后面的 3 行和匹配的 URL
【发布时间】：2019-10-14 03:52:36
【问题描述】：

我尝试从 SMTP 邮件中提取文本信息，即：

日期（例如：2019 年 10 月 9 日星期三 01:55:58 -0700 (PDT)
发件人（例如：来自 xxx.yyy.com (zzz:com. [111.222.333.444])
邮件中的网址（例如：http://some.thing）

这是一个输入示例：

SOME_HTML
SOME_HTML
href="http://URL1"><img
SOME_HTML
src="http://URL2"
SOME_HTML

示例被故意截断，因为标题较长，但这是为了示例

我已经尝试过 sed 和 awk，我设法做了一些事情，但不是我想要的。

SED：

sed -e 's/http/\nhttp/g' -n -e '/Received: from/{h;n;n;n;H;x;s/\n \+/;/;p}' a.txt

第一个是将 URL 放在一个留置权上，但之后我没有设法使用它。无论如何，这不是按顺序排列的。

AWK：

BEGIN{
    RS = "\n";
    FS = "";
}
/Received: from/{
    from = $0;
    getline;
    getline;
    getline;
    date = $0
}
/"\"https?://[^\"]+"/
{
    FS="\"";
    print $0;
}
END{
    print date";"from;
};

除了 URL 之外，此方法有效。 rexgexp 在单行中不起作用是。我还尝试通过使用 NR+3 的值来寻找更优雅的日期方式，但没有奏效。

并以 csv 格式显示：

日期;发件人;URL1;URL2;...

我更喜欢纯 sed 或纯 awk，因为我认为我可以使用 grep、tail、sed 和 awk 来完成，但我想学习，我更喜欢其中一个或两个 :)

【问题讨论】：

您提供的简短输入的输出是什么？我不知道该选择哪些网址？此外，最好使用 html/xml 感知工具解析 html，而不是使用 sed。
应该是Wed, 9 Oct 2019 01:55:58 -0700 (PDT);Received: from SOME.URL.COM (SOME.OTHER.URL.COM. [X.X.X.X]);http://URL1;http://URL2
我们假设有 _only_(!) 一个 Received: from 并且它的第四行总是有日期？（我问，因为 smpt 标头要复杂得多）。好吧，你有 http://URL1><img 并从中提取了 http://URL1 所以 [^"]+ 将不起作用。或者输入中是否缺少"？
对于 URL，我也不知道，但由于它在 HTML 中，一些典型的正则表达式应该像这样与 SED 一起工作，例如：sed -rne 's#.+?(https?://[^"]+).*#\1#p'
是的，我的代码就是以此为基础的（一个“Received; from”），如果我没记错的话，我发现了一个例子，日期在第二行，但我不是当然...但为简单起见，请考虑以下 3 行

标签： awk sed

【解决方案1】：

嗯，下面这个长长的 sed 脚本，里面有 cmets：

sed -nE '
/Received: from /{
    # hold mu line!
    h

    # ach, spagetti code, here we go again
    : notdate
    ${
        s/.*/ERROR: INVALID INPUT: DATE NOT FOUND/
        p
        q1
    }
    # the next line after the line ending with ; should be the date
    /;$/!{
        # so search for a line ending with ;
        n
        b notdate
    }
    # the next line is the date
    n
    # remove leading spaces
    s/^[[:space:]]*//
    # grab the Received: from line
    G
    # and save it for later
    h
}

# headers end with an empty line
/^$/{
    # loop over lines
    : read_next_line
    n

    # flag with \x01<URL>\x02 all occurences of URLs
    s/"(http[^"]+)"/\x01\1\x02/g

    # we found at least one URL if there is \x01 in the pattern space
    /\x01/{

        # extract each occurence to the end of pattern  space with a newline
        : again
        s/^([^\x01]*)\x01([^\x02]*)\x02(.*)$/\1\3\n\2/
            t again

        # remove everything in front of separator - the unparsed part of line
        s/^[^\n]*\n//
        # add URLs to hold space
        H
    }

    # if this is the last line, we should finally print something!, and, exit
    ${
        # grab the hold space
        x
        # replace the separator for a ;
        s/\n/;/g
        # print and exit successfully
        p
        q 0
    }

    # here we go again!
    b read_next_line
}

'

对于以下输入：

Delivered-To: SOME@ADDRESS.COM
Received: by X.X.X.X with SMTP id SOMEID;
        Wed, 9 Oct 2019 01:55:58 -0700 (PDT)
X-Received: by X.X.X.X with SMTP id SOMEID;
        Wed, 09 Oct 2019 01:55:58 -0700 (PDT)
Return-Path: <SOME@ADDRESS.COM>
Received: from SOME.URL.COM (SOME.OTHER.URL.COM. [X.X.X.X])
        by SOME.THIRD.URL.COM with ESMTP id SOMEID
        for <SOME@ADDRESS.COM>;
        Wed, 09 Oct 2019 01:55:58 -0700 (PDT)

SOME_HTML
SOME_HTML
href="http://URL1"><img
SOME_HTML
src="http://URL2"
SOME_HTML
SOMEHTML src="http://URL3" SOMEHTML src="http://URL4"

输出：

Wed, 09 Oct 2019 01:55:58 -0700 (PDT);Received: from SOME.URL.COM (SOME.OTHER.URL.COM. [X.X.X.X]);http://URL1;http://URL2;http://URL3;http://URL4

【讨论】：

还有两个问题：如果我有多个相同的 URL，是否可以不重复这些 URL？我尝试将您的批次放入脚本中，但它返回错误：sed -f test.sed a.txt sed: file test.sed line 36: invalid reference \1 on `s' command's RHS。你有什么主意吗？无论如何，谢谢你的剧本和你的解释！
to not duplicate these ones - 这对于 sed 的 sed 作业来说非常困难，而且这个脚本已经很复杂并且使用了保持空间。而不是在一个 sed 中执行此操作，只需在 bash sort -u 中逐行读取 seds 输出并再次输出它们。 invalid reference \1 我已经使用扩展正则表达式编写了sed，您必须使用-E 或-r 或--regexp-extended 标志和sed。无论如何，您需要 GNU sed（或至少一些好的 sed）来执行此脚本，因为我使用 \n 和 \x01 和 \x02，因此它可能不适用于所有 sed 实现。
to not duplicate these ones: 好的，我试试。否则，我想我可以用 awk 处理它...invalid reference \1 你是对的，这很有效:) 再次感谢！