如何使用 sed 交叉链接（“wikify”）不同文件中的行？答案

【问题标题】：How to cross-link ("wikify") lines in different files with sed?如何使用 sed 交叉链接（“wikify”）不同文件中的行？
【发布时间】：2014-08-12 20:43:44
【问题描述】：

我有包含一行注释的文件，其中包含指向其他注释的链接 >filename_without_extension:line_nr的形式：

m01.txt:
Line 1. >m02:2
Line 2. >m02:3
Line 3.

m02.txt:
Line 1.
Line 2. >m01:3
Line 3. >m01:1 >m01:3

我想为每个链接行添加类似 wiki 的自动“反向链接” 还没有。所以想要的输出应该是这样的：

m01.txt:
Line 1. >m02:2 >m02:3
Line 2. >m02:3
Line 3. >m02:3 >m02:2

m02.txt:
Line 1.
Line 2. >m01:3 >m01:1
Line 3. >m01:1 >m01:3 >m01:2

我想出了一些非常糟糕且对 sed 不起作用的东西。它应该遍历我的笔记目录中的所有文件：

link_regex=$(sed -e '/(\>m[0-9]+\:[0-9]+?)+?/p')
linenr_from_link_regex=$(sed -e '/\>m[0-9]+?\:/d')
fname_from_cur_link=$(sed -e '/\:[0-9]+?\b/d;/\.txt/a')
link_from_f=$(sed -e '/^/\>/g;/\.txt$/d;/\:=/a' < "$f")
new_link_to_cur_f=$(sed -i "${linenr_fom_cur_link}a\ ${link_from_f}" ${fname_from_cur_link})

function create-cross-references () {
    while read line; do
        echo "$link_regex" | \          # look up links 
        echo "$linenr_from_link_regex"      # pipe to get line number from current link 
        echo "$fname_from_cur_link"         # turn current link to new file name
        echo "$link_from_f"                 # turn current file name name to new link
        echo "$new_link_to_cur_f"           # add new link to current fname
    done
}

for f in *.txt; do
    create-cross-references
done

我哪里错了？另外，什么是更合理的解决方案（最好仍然使用 sed），它可以避免遍历所有行（包括那些没有链接的行）我的笔记文件夹每次？感谢您的帮助！

【问题讨论】：

你能发布想要的输出吗？我的意思是处理后的m01和m02。如果 m01 中的一个项目由 7 个文件链接，您将如何处理这种情况？你如何决定，应该“返回”到哪个文件？
我添加了输出。我猜这里的“反向链接”不太正确；它们是相互交叉的项目。如果文件 1 中的项目 1 具有指向文件 2 中项目 2 的链接，则文件 2 中的项目 2 也应该获得指向文件 1 中项目 1 的链接。如果项目有 7 个链接，它们只是被列出并在项目之后（行）一个接一个。感谢您的思考！
我认为您不能像在示例中那样将命令保存在变量中。 var=$(command) 将command 的输出保存到var。
@whereswalden 是的，我也不确定。它没有用，所以你可能是对的。 :)

标签： regex bash replace sed

【解决方案1】：

你可以试试这样的：

#!/bin/bash

function getlinks() {
    # $1 must be something like >m01:1
    grep "$1" *.txt | sed -e 's/\(.*\)\.\(.*\):Line \([0-9]\+\)..*/>\1:\3 /' | \
    # all matches in one single line
    tr -d '\n'
}
for fileName in *.txt;do
    echo "$fileName:"
    while read line;do
        #Line 1. whatever ==> 1
        lineNumber=$( echo $line | grep -Po '(?<=(Line )).*(?=\.)' )
        #m01.txt ==> >m01
        fileNameFormatted=$( echo "$fileName" | sed -e 's/\(.*\)\..*/>\1/'  )
        links=$( getlinks "$fileNameFormatted:$lineNumber" )
        echo "$line $links"
    done < $fileName
done

输出：

m01.txt:
Line 1. >m02:2 >m02:3 
Line 2. >m02:3 
Line 3. >m02:2 >m02:3 
m02.txt:
Line 1. 
Line 2. >m01:3 >m01:1 
Line 3. >m01:1 >m01:3 >m01:2

编辑：由于@martt 的评论，

[...] 你能从正则表达式中删除第 1 行前缀吗？这行实际上只包含随机文本+链接（如Blablalbla. >m01:1；这是我的一个误导性示例）。另外，如何将更改回显到真实文件？

我对原始脚本做了一些更改。

文本文件中不存在的行号，因此需要一个变量。 ($lineNumber)
如果脚本多次运行，cross-links会重复，所以要避免。
结果必须存储在同一个文件中。

#!/bin/bash


for fileName in *.txt;do
    #"Line 1" it is not present now. We've to carry the count of lines processed
    let lineNumber=1
    while read line;do 
        # transform m01.txt into >m01
        fileNameFormatted=$( echo "$fileName" | sed -E 's/(.*)\..*/>\1/'  )
        links=$( \
        #search for occurrences of >filename : grep -nr will return something like
        # m02.txt:3:whatever. >m01:1 >m01:3
        # in this example,
        # we take the filename (m02) and the line number (3).
        # adding '>' and ':'. Result: >m02:3
        grep -nr "$fileNameFormatted:$lineNumber" *.txt  | \
        sed -E 's/(.*)\.(.*):([0-9]+):(.*).(.*)/>\1:\3/' | \
        # replace new lines with spaces
        tr '\n' ' ')
        # skipping duplicates :
        links=$( \
        #merge existing line with links found
        echo "$line $links" | \
        #strip all before the dot
        sed -E 's/(.*)\.(.*)/\2/' | \
        # replace spaces with new line
        tr ' ' '\n' | \
        # remove duplicates: >m02:2 >m02:2 >m03:3
        # ==> >m02:2 >m03:3
        sort -u | \
        # replace newlines with spaces.
        tr '\n' ' ')
        # remove all before the last dot: 
        # Line 1. >m02:2 >m03:3 ==> Line 1
        line=$(echo $line | sed 's/\(.*\)\..*/\1/')
        #merge both strings and append them to a temporary file
        echo "$line.$links" >> "$fileName.tmp"
        let lineNumber++
    done < "$fileName"
        #replace the original file
        mv "$fileName.tmp" "$fileName"
done

【讨论】：

非常感谢！您能否从正则表达式中删除 Line 1. 前缀？这些行实际上只包含随机文本 + 链接（如Blablalbla. >m01:1；这是我的一个误导性示例）。另外，如何将更改回显到真实文件？
@marttt 我现在不在电脑前。你介意我在几个小时内完成吗？如果您赶时间，可能需要一个小时。
当然，我一点也不着急，慢慢来！我试着自己做，但我是一个完全的业余爱好者，而且正则表达式相当复杂，等等。再次感谢！
@marttt 更改完成！ .我添加了更多解释，以便您更好地理解代码。
太棒了，现在它做了需要做的事情以及更多（我想我会想出一个单独的函数来删除重复的链接，但现在你为我做了:)。此外，您的 cmets 非常有价值且很有帮助，因此这是一个很好的答案。再次非常感谢！