来自 PDFS 的高分辨率图像答案

【问题标题】：High-res images from PDFS来自 PDFS 的高分辨率图像
【发布时间】：2012-01-11 21:12:56
【问题描述】：

我正在做一个项目，我需要从多页 PDF 中提取每页的 TIFF。 PDF 仅包含图像，每页只有一张图像（我相信它们是在某种复印机/扫描仪上制作的，但尚未证实这一点）。然后使用 TIFF 来创建文档的其他几个衍生版本，因此分辨率越高越好。

我找到了两个食谱，都有帮助，但都不是理想的。希望有人可以帮助我调整其中一个，或提供第三种选择。

配方 1、pdfimages 和 ImageMagick：

先做：

$ pdfimages $MY_PDF.pdf foo"

这会导致多个.pbm 文件（命名为foo-000.pbm、foo-001.pbm）等

然后对每个*.pbm 做：

$ convert $each -resize 3200x3200\> -quality 100 $new_name.tif

专业人士：生成的 TIFF 在长维度上是健康的 3300+ 像素，（-resize 仅用于标准化所有内容）

缺点：页面的方向丢失了，它们以不同的方向旋转出来（它们遵循逻辑模式，所以它们可能是它们被送入扫描仪的方向？？）。

配方 2 Imagemagick 独奏：

convert +adjoin $MY_PDF.pdf pages.tif

这给了我每页的 TIFF（pages-0.tif、pages-1.tif 等）。

专业人士：方向不变！

缺点：生成文件的长尺寸

如何放弃 PDF 中图像流的缩放，但保留方向？ ImageMagick 中是否还有一些我缺少的魔法？完全不同的东西？

【问题讨论】：

您愿意使用非免费的解决方案吗？
也许——它需要有一个 API（没有 GUI）并且可以合理地集成；我正在处理数以万计的文档。你有什么想法？
写信给我详细信息，我会看看是否可以提供帮助（bitbank@pobox.com）。
我不想听起来充满敌意，但是您的解决方案真的如此秘密以至于您不能将其发布在此处以便对其他人有所帮助吗？
这不是秘密解决方案。我已经编写了自己的成像代码，并且根据您的需要，我可能可以很快地将一些东西放在一起。例如如果您需要 Windows x86/arm 命令行工具来获取 PDF 文件并将它们拆分为 TIFF 文件而不重新压缩它们，我可以帮助您。

标签： pdf imagemagick image-manipulation tiff

【解决方案1】：

对于这个老话题的噪音，很抱歉，但谷歌将我列为最佳结果之一，它可能需要其他人，所以我想我会发布我在这里找到的 TO 问题的解决方案：http://robfelty.com/2008/03/11/convert-pdf-to-png-with-imagemagick

简而言之：您必须告诉 ImageMagick 它应该以何种密度扫描 PDF。

所以convert -density 600x600 foo.pdf foo.png 将告诉 ImageMagick 将 PDF 视为具有 600dpi 分辨率，从而输出更大的 PNG。就我而言，生成的 foo.png 大小为 5000x6600px。您可以选择添加-resize 3000x3000 或您需要的任何大小，它将按比例缩小。

请注意，只要您的 PDF 文件中只有矢量图像或文本，就可以根据需要将密度设置为高。如果 PDF 包含光栅化图像，如果您将其设置为高于这些图像的 dpi，它看起来并不好，令人惊讶！ :)

克里斯

【讨论】：

太棒了，谢谢！这几乎不是噪音，因为我从来没有得到答案。为了完整起见，这是我制作每页 TIFF、标准化大小并转换为灰度的最终方法：convert +adjoin -density 300x300 -depth 8 -resize 3200x3200\> in.pdf out_prefix.tif

【解决方案2】：

我想分享我的解决方案...它可能不适用于所有人，但由于没有其他方法出现，也许它会对其他人有所帮助。我最终选择了我的问题中的第一个选项，即使用pdfimages 来获取以各种方式旋转的大图像。然后我找到了一种使用 OCR 和字数来猜测方向的方法，这让我从（估计的）25% 准确旋转到 90% 以上。

流程如下：

使用pdfimages（apt-get install poppler-utils）获取一组pbm 文件（下面未显示）。
对于每个文件：
1. 制作四个版本，分别旋转 0、90、180 和 270 度（我在我的代码中将它们称为“北”、“东”、“南”和“西”）。
2. OCR 每个。字数最少的两个可能是正面朝上和上下颠倒的版本。这在我迄今为止处理的一组图像中的准确率超过 99%。
3. 从字数最少的两个中，通过拼写检查运行 OCR 输出。拼写错误最少（即最容易识别的单词）的文件可能是正确的。对于我的数据集，基于 500 个样本，准确率约为 93%（高于 25%）。

YMMV。我的文件是双色调且高度文本的。源图像的长边平均为 3300 像素。我不能说灰度或彩色，或者有很多图像的文件。我的大多数源 PDF 都是旧复印件的错误扫描，因此使用更干净的文件可能会更好。在旋转期间使用-despeckle 没有任何区别，并且大大减慢了速度（~5×）。我选择 ocrad 是为了速度而不是准确性，因为我只需要粗略的数字并且正在丢弃 OCR。回复：性能，我没什么特别的 Linux 桌面机器可以以每秒 2-3 个文件的速度运行整个脚本。

这是一个简单的 bash 脚本中的实现：

#!/bin/bash
# Rotates a pbm file in place.

# Pass a .pbm as the only arg.
file=$1

TMP="/tmp/rotation-calc"
mkdir $TMP

# Dependencies:                                                                 
# convert: apt-get install imagemagick                                          
# ocrad: sudo apt-get install ocrad                                               
ASPELL="/usr/bin/aspell"
AWK="/usr/bin/awk"
BASENAME="/usr/bin/basename"
CONVERT="/usr/bin/convert"
DIRNAME="/usr/bin/dirname"
HEAD="/usr/bin/head"
OCRAD="/usr/bin/ocrad"
SORT="/usr/bin/sort"
WC="/usr/bin/wc"

# Make copies in all four orientations (the src file is north; copy it to make 
# things less confusing)
file_name=$(basename $file)
north_file="$TMP/$file_name-north"
east_file="$TMP/$file_name-east"
south_file="$TMP/$file_name-south"
west_file="$TMP/$file_name-west"

cp  $file $north_file
$CONVERT -rotate 90 $file $east_file
$CONVERT -rotate 180 $file $south_file
$CONVERT -rotate 270 $file $west_file

# OCR each (just append ".txt" to the path/name of the image)
north_text="$north_file.txt"
east_text="$east_file.txt"
south_text="$south_file.txt"
west_text="$west_file.txt"

$OCRAD -f -F utf8 $north_file -o $north_text
$OCRAD -f -F utf8 $east_file -o $east_text
$OCRAD -f -F utf8 $south_file -o $south_text
$OCRAD -f -F utf8 $west_file -o $west_text

# Get the word count for each txt file (least 'words' == least whitespace junk
# resulting from vertical lines of text that should be horizontal.)
wc_table="$TMP/wc_table"
echo "$($WC -w $north_text) $north_file" > $wc_table
echo "$($WC -w $east_text) $east_file" >> $wc_table
echo "$($WC -w $south_text) $south_file" >> $wc_table
echo "$($WC -w $west_text) $west_file" >> $wc_table

# Take the bottom two; these are likely right side up and upside down, but 
# generally too close to call beyond that.
bottom_two_wc_table="$TMP/bottom_two_wc_table"
$SORT -n $wc_table | $HEAD -2 > $bottom_two_wc_table

# Spellcheck. The lowest number of misspelled words is most likely the 
# correct orientation.
misspelled_words_table="$TMP/misspelled_words_table"
while read record; do
    txt=$(echo $record | $AWK '{ print $2 }')
    misspelled_word_count=$(cat $txt | $ASPELL -l en list | wc -w)
    echo "$misspelled_word_count $record" >> $misspelled_words_table
done < $bottom_two_wc_table

# Do the sort, overwrite the input file, save out the text
winner=$($SORT -n $misspelled_words_table | $HEAD -1)
rotated_file=$(echo $winner | $AWK '{ print $4 }')

mv $rotated_file $file

# Clean up.
if [ -d $TMP ]; then
    rm -r $TMP
fi

【讨论】：