【问题标题】:Filter text files from source code by filetype in linux在linux中按文件类型过滤源代码中的文本文件
【发布时间】:2020-07-01 12:43:02
【问题描述】:

我有一个被抓取的原始文件列表,它包含文本和源代码。下面是列出的文件类型,我想删除所有类型为 C 源、python 脚本、HTML 和空文件的文件,只保留 ASCII 和 unicode 文件。

file *
1dW6WJMN.txt:  Python script, ASCII text executable
9dJbZ3Vv.txt:  ASCII text, with CRLF line terminators
9dQsmVU4.txt:  Python script, UTF-8 Unicode text executable, with CRLF line terminators
A5hENB7D.txt:  C source, ASCII text, with CRLF line terminators
cidREdJG.txt:  UTF-8 Unicode text, with very long lines, with CRLF line terminators
exhjw1gK.txt:  UTF-8 Unicode text, with CRLF line terminators
iu7LPrqz.txt:  ASCII text, with very long lines, with CRLF line terminators
LsDHarjD.txt:  ASCII text
nLABt1a6.txt:  C source, ASCII text, with CRLF line terminators
nqMDtVuz.txt:  ASCII text, with CRLF line terminators
nqPuYb23.txt:  UTF-8 Unicode text, with CRLF line terminators
nQtzxhfQ.txt:  ASCII text, with CRLF line terminators
NQuLWwpt.txt:  ASCII text, with CRLF line terminators
nQXeJeED.txt:  ASCII text, with CRLF line terminators
nqXGv6ws.txt:  UTF-8 Unicode text, with CRLF line terminators
nQxr4Hwi.txt:  ASCII text, with CRLF line terminators
nQxr4Hwii.txt: empty
VQjrxevh.txt:  HTML document, UTF-8 Unicode text, with very long lines, with CRLF line terminators
yfDEfn4L.txt:  C source, ASCII text, with CRLF line terminators
yydAEDRn.txt:  HTML document, ASCII text, with very long lines, with CRLF line terminators

我尝试使用带有 ASCII 的简单 grep,但所有源代码文件也包含术语 ASCII。有没有其他方法可以过滤掉这些源代码文件,因为有时还有我想摆脱的 PHP、javascript 文件。我对linux很陌生,任何帮助将不胜感激。提前致谢

【问题讨论】:

  • 简单回答:grep 无法做到这一点,因为您需要从文件内容中确定文件类型。一旦你有了一个工具来从文件内容中确定文件类型,那么你需要做的就是编写一个脚本来遍历这些文件及其类型描述以过滤它们。

标签: linux file text grep find


【解决方案1】:

我只是要扩展 Filter text files from source code by filetype in linux 关于使用 grep 的说明。我会从我的文件中找到 file 的输出。

find -type f -exec file {} \; | sed 's/^.*: //' | sort -u

如果我用 grep 搜索 text 我会得到这样的结果:

find -type f -exec file {} \; | sed 's/^.*: //' | sort -u | grep text

ASCII text
ASCII text, with no line terminators
ASCII text, with very long lines
ASCII text, with very long lines, with no line terminators
assembler source, ASCII text
a /usr/bin/env php script, ASCII text executable
Bourne-Again shell script, ASCII text executable
C source, ASCII text
C++ source, ASCII text
C++ source, ASCII text, with very long lines
C++ source, ISO-8859 text
C++ source, UTF-8 Unicode text
C++ source, UTF-8 Unicode text, with very long lines
exported SGML document, ASCII text
GNU gettext message catalogue, ASCII text
GNU gettext message catalogue, UTF-8 Unicode text
HTML document, ASCII text
HTML document, ASCII text, with very long lines
HTML document, UTF-8 Unicode text
HTML document, UTF-8 Unicode text, with very long lines
PHP script, ASCII text
PHP script, ASCII text executable
PHP script, ASCII text, with very long lines
PHP script, ISO-8859 text
PHP script, UTF-8 Unicode text
POSIX shell script, ASCII text executable
UTF-8 Unicode text
UTF-8 Unicode text, with very long lines
UTF-8 Unicode text, with very long lines, with no line terminators
XML 1.0 document, ASCII text

...所以我可能会排除 sourcescriptdocument

【讨论】:

  • 谢谢蒂姆。要排除源、脚本和文档,您是否认为我还必须读取每个文件的文件内容并根据该内容进行过滤,或者使用文件类型进行简单过滤?
  • 我认为在您的示例中,您正在寻找所有“ASCII 文本”,可能不是任何“Unicode 文本”文件,所以类似于 file * | grep "ASCII text" | grep -v -e "source," -e "script," -e "document,"
猜你喜欢
  • 1970-01-01
  • 2013-09-26
  • 1970-01-01
  • 2015-10-13
  • 1970-01-01
  • 2018-11-15
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多