如果文件未被另一个文件使用，如何删除该文件答案

【问题标题】：How to remove a file if it is not used by another file如果文件未被另一个文件使用，如何删除该文件
【发布时间】：2019-08-02 03:47:21
【问题描述】：

我必须通过删除所有未使用的文件来清理目录及其子目录。（如果一个文件没有在任何的 HTML 文件，或者如果没有明确指定该文件正在使用中）。可以通过href 或img src 在HTML 文件中链接文件。

例如，我有一个 I.html、1.html、2.html 和 1 文件夹。在I.html 文件中，href 使用1.html 和1 目录，但2.html 不被任何其他文件使用。那么，如何删除未使用的2.html 文件？

use strict;
use warnings;
my($path,$regexExpression) = @ARGV;
my $fileNames = "data.txt";
my @abc= ();
if(not defined $path){
  die "File directory not given, please  try again \n"
}
print "added file ";  
if (not defined $regexExpression) {  
  $regexExpression="*";
  print "--Taking default Regular Expression. \n"
}
if (defined $regexExpression) {
  print "The regular Expression : $regexExpression \n";
  my $directorypathx= `pwd`;
  my ($listofFileNames) = findFilesinDir($path); 
  my ($listofLinks) = readallHrefInaFile();
  my ($listofImage) = readImageFile();
  print $listofLinks; 
 }
sub findFilesinDir{
  print "inside subroutines ", $path,"\n";
  my($pathName) = @_;
  my $fileNames =`find '$pathName' -name '$regexExpression' | sort -h -r > $fileNames ` ;
  if (-l $fileNames){
    return $fileNames;
  } 
 }
sub readallHrefInaFile{
  my $getAllLinks = ` grep -Eo "<a .*href=.*>" $path*.html | uniq ` ;
  push (@abc,$getAllLinks);
}

sub readImageFile{
  print "image files \n";
  my $getAllImage = ` grep -Eo "<img .*src=.*>" $path*.html | uniq `;
  push (@abc,$getAllImage);
}
print @abc;

I.html

<html>
  <head>
    <title>Index</title>
  </head>

  <body>
    <h1>Index</h1>

    <a href="1.html">1</a>

    <h1>Downloads</h1>

    <a href="downloads/s.zip">Compressed craters</a>

    <hr>
  </body>
</html>

1.html

<html>
  <head>
    <title>1</title>
  </head>

  <body>
    <h1>1</h1>

    <img src="images/1-1.gif" />
    <img src="images/1-2.gif" />


    <hr>
  </body>
</html>

【问题讨论】：

如果您提供一个简短但完整的示例（包括任何输入数据），您可以提高获得好答案的机会。您还可以提供预期的输出。
在输出 2.html 文件将移动到另一个文件夹，因为该文件没有与另一个文件链接。
请注意，push 返回数组中元素的数量，而不是文件名列表。例如，readImageFile 不会返回文件名列表
@HåkonHægland 它返回文件名，但它也会返回 2.html 文件。我只想要文件名是 I.html 和 1.html 和 1 个目录。
@Jack 对$fileNames 进行-l 测试的目的是什么？您正在将反引号的输出重定向到 $fileNames，但同时在同一变量中收集反引号的 STDOUT。但 STDOUT 将为空，因为您将其重定向到文件。

标签： html perl

【解决方案1】：

您展示的整体方法是合理的，但是关于代码本身有很多话要说。这样做的地方是code review，我鼓励你也在那里提交你的代码。

我想说的一个总体评论是，没有理由如此频繁地使用外部工具；您的程序使用外部grep 和find 和sort 和pwd。我们几乎总是可以使用 Perl 提供的大量工具来完成整个工作。

这里有一个简单的例子来满足你的需要，其中大部分工作都是使用模块完成的。

在我们的 HTML 中搜索的文件列表是使用 File::Find::Rule 组合而成的，递归地在 $dir 下。另一种选择是核心File::Find 模块。

尽管在这种情况下 HTML 解析看起来很简单，但最好还是使用模块而不是正则表达式。 HTML::TreeBuilder 是您需要的标准。该模块本身使用其他模块，主力是HTML::Element

以下程序适用于一个 HTML 文件 ($source_file)，为此我们需要在给定目录 ($dir) 下查找未在 href 属性或 src 属性中使用的文件在img 标签中。需要删除这些文件（该行已被注释掉）。

use warnings;
use strict;
use feature 'say';

use File::Find::Rule;
use HTML::TreeBuilder;

my ($dir, $source_file) = @ARGV;    
die "Usage: $0 dir-name file-name\n" if not $dir or not $source_file;

my @files = File::Find::Rule->file->in($dir);
#say for @files;

foreach my $file (@files) {
    next if $file eq $source_file;  # not the file itself!
    say "Processing $file...";
    my $tree = HTML::TreeBuilder->new_from_file($source_file);

    my $esc_file = quotemeta $file;    
    my @in_href    = $tree->look_down(                'href', qr/$esc_file/ );
    my @in_img_src = $tree->look_down( _tag => 'img', 'src',  qr/$esc_file/ );

    if (@in_href == 0 and @in_img_src == 0) {
        say "\tthis file is not used in 'href' or 'img-src' in $source_file";
        # To delete it uncomment the next line -- after all is fully tested
        #unlink $file or warn "Can't unlink $file: $!";
    }
}

使用unlink 实际删除文件的语句当然被注释掉了。只有在您彻底检查了脚本的最终版本并进行了备份后才能启用它。

注意事项

通过使用File::Find::Rule 添加“规则”来优化您要查找的文件
我在文件名上使用quotemeta，它会转义其中的所有特殊字符；否则可能会有一些东西潜入其中，从而导致look_down

sub { }

look_down

必须使用目录名和主 HTML 文件名调用脚本。请使用 Getopt::Long

在这里可以进行更多微调，包括搜索文件和解析 HTML；模块的文档中有很多信息，在这个站点的许多帖子中还有更多信息。

代码针对简单的情况进行了测试；请根据您的实际需求进行调整。

这是一个完整的用法示例。

我将此脚本 (script.pl) 放在一个目录中，该目录包含一个文件 I.html 和一个目录 www。

I.html 文件：

<!DOCTYPE html>
<html> <head> <title>Test handling of unused files</title> </head>
<body>
<a href="www/used.html">Used file from www</a>
<img src="www/images/used.jpg" alt="no_image_really">
</body>
</html>

目录www 有文件used.html 和another.html，还有一个子目录images 里面有文件used.jpg 和another.jpg，所以我们总共有

. ├── 脚本.pl ├── I.html └── 万维网 ├── used.html ├── 另一个.html └── 图片 ├── 二手.jpg └── 另一个.jpg

此测试不需要www 中的任何文件中的任何内容。这只是一个最小的设置；我在I.html 中添加了更多文件和目录以及标签来进行测试。

然后我运行script.pl www I.html 并得到预期的输出。

【讨论】：

我运行了你的程序，但它不会在第 10 行之后运行 die "Usage: $0 dir-name source-file-name\n" if not $dir or not $source_file;。
@Jack 这意味着您需要将其运行为：program directory html-file。 directory 是所有这一切发生的目录的名称（如果是当前目录，您可以使用 .），html-file 是您要在其中的 html 文件的名称检查未使用的文件。
@Jack 在您在问题中给出的示例中，这似乎是program . I.html（如果所有这些都在当前目录中）
@Jack 在答案中添加了关于如何使用程序的声明
现在我在my $tree = HTML::TreeBuilder->new_from_file($source_file); 这一行出现错误。错误是unable to parse file: No such file or directory at new.pl line 17. 文件已经存在。