如何使用 JavaScript 从 Word 文档中提取图像？答案

【问题标题】：How to extract images from Word documents using JavaScript?如何使用 JavaScript 从 Word 文档中提取图像？
【发布时间】：2015-11-01 09:36:13
【问题描述】：

我正在尝试使用 JavaScript 中的 ActiveXObject（仅限 IE）从 Word 文档中提取图像。

我无法找到 Word 对象的任何 API 参考，只能从 Internet 上获得一些提示：

var filename = 'path/to/word/doc.docx'
var word = new ActiveXObject('Word.Application')
var doc = w.Documents.Open(filename)
// Displays the text
var docText = doc.Content

如何使用doc.Content 之类的方式访问 Word 文档中的图像？

此外，如果有人有 API 的明确来源（最好来自 Microsoft），那将非常有帮助。

【问题讨论】：

msdn.microsoft.com/en-us/office/aa905496.aspx
肯，感谢您提供此链接！我知道它就在某个地方，但我一辈子都找不到它。我看看能不能找到这个问题的答案。

标签： javascript activexobject

【解决方案1】：

因此，经过几周的研究，我发现使用 Word ActiveXObject 中的SaveAs 函数最容易提取图像。如果文件保存为 HTML 文档，Word 将创建一个包含图像的文件夹。

从那里，您可以使用 XMLHttp 获取 HTML 文件并创建可由浏览器查看的新 IMG 标记（我使用的是 IE (9)，因为 ActiveXObject 仅适用于 Internet Explorer)。

让我们从SaveAs 部分开始：

// Define the path to the file
var filepath = 'path/to/the/word/doc.docx'
// Make a new ActiveXWord application
var word = new ActiveXObject('Word.Application')
// Open the document
var doc = word.Documents.Open(filepath)
// Save the DOCX as an HTML file (the 8 specifies you want to save it as an HTML document)
doc.SaveAs(filepath + '.htm', 8)

现在我们应该在同一个目录中有一个文件夹，其中包含图像文件。

注意：在 Word HTML 中，图像使用 <v:imagedata> 标签，这些标签存储在 <v:shape> 标签中；例如：

<v:shape style="width: 241.5pt; height: 71.25pt;">
     <v:imagedata src="path/to/the/word/doc.docx_files/image001.png">
         ...
     </v:imagedata>
</v:shape>

我已经删除了 Word 保存的无关属性和标签。

要使用 JavaScript 访问 HTML，请使用 XMLHttpRequest 对象。

 var xmlhttp = new XMLHttpRequest()
 var html_text = ""

因为我要访问数百个 Word 文档，所以我发现最好在发送调用之前定义 XMLHttp 的 onreadystatechange 回调。

// Define the onreadystatechange callback function
xmlhttp.onreadystatechange = function() {
    // Check to make sure the response has fully loaded
    if (xmlhttp.readyState==4 && xmlhttp.status==200) {
        // Grab the response text
        var html_text=xmlhttp.responseText
        // Load the HTML into the innerHTML of a DIV to add the HTML to the DOM
        document.getElementById('doc_html').innerHTML=html_text.replace("<html>", "").replace("</html>","")
        // Define a new array of all HTML elements with the "v:imagedata" tag
        var images =document.getElementById('doc_html').getElementsByTagName("v:imagedata")
        // Loop through each image
        for(j=0;j<images.length;j++) {
            // Grab the source attribute to get the image name
            var src = images[j].getAttribute('src')
            // Check to make sure the image has a 'src' attribute
            if(src!=undefined) {
                ...

我在加载正确的 src 属性时遇到了很多问题，因为 IE 在将它们加载到 innerHTML doc_html div 时会转义它的 HTML 属性，所以在下面的示例中，我使用的是伪路径和 @ 987654332@获取图片名称（如果正斜杠超过1个，此方法将不起作用！）：

                ...
                images[j].setAttribute('src', '/path/to/the/folder/containing/the/images/'+src.split('/')[1])
                ...

在这里，我们使用父级（v:shape 对象）的父级（恰好是p 对象）向 HTML div 添加一个新的img 标记。我们通过从图像中获取src 属性和从v:shape 元素中获取style 信息，将新的img 标记附加到innerHTML：

                ...
                images[j].parentElement.parentElement.innerHTML+="<img src='"+images[j].getAttribute('src')+"' style='"+images[j].parentElement.getAttribute('style')+"'>"

            }
        }       
    }
}
// Read the HTML Document using XMLHttpRequest
xmlhttp.open("POST", filepath + '.htm', false)
xmlhttp.send()

虽然有点具体，但上述方法能够成功地将img标签添加到原始文档中它们所在的HTML中。

【讨论】：