如何计算csv文件中字符串的出现次数答案

【问题标题】：How to count occurence of a string in csv file如何计算csv文件中字符串的出现次数
【发布时间】：2017-03-02 23:52:12
【问题描述】：

我有 CSV 文件

author,host,authority,contents
_angelsuman,http://twitter.com/_angelsuman,5,green tea piyo :( #kicktraileron6thjune
_angelsuman,http://twitter.com/_angelsuman,5,rt @121training fat burning foods: grapefruit  watermelon  berries  hot peppers  celery  greek yogurt  eggs  fish  green tea  coffee  water  oatmeal.
_angelsuman,http://twitter.com/_angelsuman,5,rt @121training fat burning foods: â´ grapefruit â´ watermelon â´ berries â´ hot peppers â´ celery â´ greek yogurt â´ eggs â´ fish â´ green tea â´ oatmeal
anukshp,http://twitter.com/anukshp,4,rt @_angelsuman dear green tea u suck..:/ but i need to sip uh for myh rsn :( zindagi ka kdwa such :/ :(

我想确定第一列的出现次数：第四列“内容”中的“作者”

例如：在内容中找到“_angelsuman”。

请提出建议；我怎样才能做到这一点？

【问题讨论】：

“内容”可以包含逗号吗？
现在；我们可以假设没有。内容不包含逗号
我在第四个字段中没有看到任何“作者”。你确定你的问题措辞正确吗？您是否正在寻找第四列中的任何第一列值？
是的；我正在寻找第四列中的第一列值（字符串中的任何位置）。

标签： csv unix text awk sed

【解决方案1】：

使用perl：

use Text::CSV;

my $col = 4; // 4th column

my $count = 0;
my @rows;
my $csv = Text::CSV->new ( { binary => 1 } )  # should set binary attribute.
    or die "Cannot use CSV: ".Text::CSV->error_diag ();

open my $fh, "<:encoding(utf8)", "/tmp/test.csv" or die "test.csv: $!";
while ( my $row = $csv->getline( $fh ) ) {
    if ($row->[$col -1] eq 'author') {
        $count++;
    }
}
$csv->eof or $csv->error_diag();
close $fh;
print "There's $count occurences of 'author'\n";

输出：

There's 1 occurences of 'author'

注意：

这是使用 perl 模块的正确解析。

用你自己的文件替换/tmp/test.csv

【讨论】：

我正在尝试在第四列中查找第一列的出现。

【解决方案2】：

您可以按如下方式进行（假设您所说的值中没有逗号）。

单行：

awk -F, 'NR>1 {author[$1]=0; content[NR]=$4} END {for (a in author) {for (c in content) {count[a]+=gsub(a,"",content[c])} print a, count[a]}}' file

扩展：

awk -F, ' NR>1 { 作者[$1]=0; 内容[NR]=$4 } 结尾 { 对于（作者）{ for (c in content) { 计数[a] += gsub(a,"",content[c]) } 打印一个，计数[a] } }' 文件

工作原理

使用逗号分隔符-F,读取文件并跳过第一行NR>1

awk -F, 'NR>1
将数组author 中的第一列存储为键 - 因此每个唯一值都将存储一次。将内容存储在数组content 中，键等于行号NR - 这是存储每一行内容的结果。
{ 作者[$1]=0; 内容[NR]=$4 }
最后迭代每个唯一作者for (a in author) 和foreach 作者迭代内容for (c in content) 并增加作者在特定作者count[a]+=gsub(a,"",content[c]) 的内容中的出现次数。如果计算特定author，则打印结果print a, count[a]。
结尾 { 对于（作者）{ for (c in content) { 计数[a]+=gsub(a,"",内容[c]) } 打印一个，计数[a] } }' 文件

输出

_angelsuman 1
anukshp 0

【讨论】：

for _angelsuman count 被接受为 1 ；因为它出现在第 4 行。第 1 列的值可以是第 4 列中的任何位置
好的，明白了，我已经改进了对这些要求的回答，请检查一下