【问题标题】:Quickly Convert (.rtf|.doc) Files to Markdown Syntax with PHP使用 PHP 快速将 (.rtf|.doc) 文件转换为 Markdown 语法
【发布时间】:2010-11-05 20:07:05
【问题描述】:

我这几天一直在手动将文章转换为 Markdown 语法,而且变得相当乏味。其中一些是 3 或 4 页、斜体和其他强调的文本。有没有更快的方法将 (.rtf|.doc) 文件转换为我可以利用的清理 Markdown 语法?

【问题讨论】:

    标签: php automation markdown file-conversion .doc


    【解决方案1】:

    如果你碰巧在 Mac 上,textutil 可以很好地将 doc、docx 和 rtf 转换为 html,而 pandoc 可以很好地将生成的 html 转换为 markdown:

    $ textutil -convert html file.doc -stdout | pandoc -f html -t markdown -o file.md
    

    我有一个 script 不久前我拼凑起来的,它试图使用 textutil、pdf2html 和 pandoc 将我扔给它的任何内容转换为 markdown。

    【讨论】:

    • 我刚刚尝试了从 Word 到 Markdown 的转换,效果非常好。谢谢。
    • 非常感谢!我发现我必须使用:textutil -convert html file.doc -stdout | pandoc -f html -t markdown -o file.md
    • textutil 仅适用于 Mac OS X。 unix.stackexchange.comGNU unrtfPandocunoconv 上列出了一些 Linux 替代方案。
    【解决方案2】:

    ProgTips 有一个Word macro (source download) 的可能解决方案:

    simple macro (source download) 用于自动转换最琐碎的事情。 这个宏做了:

    • 替换粗体和斜体
    • 替换标题(标记为标题 1-6)
    • 替换编号和项目符号列表

    它非常有问题,我相信它会挂在较大的文档上,但是我 无论如何都没有说它是一个稳定的版本! :-) 仅供实验使用, 重新编码并根据需要重新使用它,如果您发现了,请发表评论 更好的解决方案。

    来源:ProgTips

    宏源

    安装

    • 打开 WinWord,
    • 按 Alt+F11 打开 VBA 编辑器,
    • 右键单击项目浏览器中的第一个项目
    • 选择插入->模块
    • 粘贴文件中的代码
    • 关闭宏编辑器
    • 转到工具>宏>宏;运行名为 MarkDown 的宏

    来源:ProgTips

    来源

    ProgTips 删除帖子或网站被清除时安全保存的宏源:

    '*** A simple MsWord->Markdown replacement macro by Kriss Rauhvargers, 2006.02.02.
    '*** This tool does NOT implement all the markup specified in MarkDown definition by John Gruber, only
    '*** the most simple things. These are:
    '*** 1) Replaces all non-list paragraphs to ^p paragraph so MarkDown knows it is a stand-alone paragraph
    '*** 2) Converts tables to text. In fact, tables get lost.
    '*** 3) Adds a single indent to all indented paragraphs
    '*** 4) Replaces all the text in italics to _text_
    '*** 5) Replaces all the text in bold to **text**
    '*** 6) Replaces Heading1-6 to #..#Heading (Heading numbering gets lost)
    '*** 7) Replaces bulleted lists with ^p *  listitem ^p*  listitem2...
    '*** 8) Replaces numbered lists with ^p 1. listitem ^p2.  listitem2...
    '*** Feel free to use and redistribute this code
    Sub MarkDown()
        Dim bReplace As Boolean
        Dim i As Integer
        Dim oPara As Paragraph
        
            
        'remove formatting from paragraph sign so that we dont get **blablabla^p** but rather **blablabla**^p
        Call RemoveBoldEnters
        
        
        For i = Selection.Document.Tables.Count To 1 Step -1
                Call Selection.Document.Tables(i).ConvertToText
        Next
        
        'simple text indent + extra paragraphs for non-numbered paragraphs
        For i = Selection.Document.Paragraphs.Count To 1 Step -1
            Set oPara = Selection.Document.Paragraphs(i)
            If oPara.Range.ListFormat.ListType = wdListNoNumbering Then
                If oPara.LeftIndent > 0 Then
                    oPara.Range.InsertBefore (">")
                End If
                oPara.Range.InsertBefore (vbCrLf)
            End If
            
            
        Next
        
        'italic -> _italic_
        Selection.HomeKey Unit:=wdStory
        bReplace = ReplaceOneItalic  'first replacement
        While bReplace 'other replacements
            bReplace = ReplaceOneItalic
        Wend
    
        'bold-> **bold**
        Selection.HomeKey Unit:=wdStory
        bReplace = ReplaceOneBold 'first replacement
        While bReplace
            bReplace = ReplaceOneBold 'other replacements
        Wend
        
       
        
        'Heading -> ##heading
        For i = 1 To 6 'heading1 to heading6
            Selection.HomeKey Unit:=wdStory
            bReplace = ReplaceH(i) 'first replacement
            While bReplace
                bReplace = ReplaceH(i) 'other replacements
            Wend
        Next
        
        Call ReplaceLists
        
        
        Selection.HomeKey Unit:=wdStory
    End Sub
    
    
    '***************************************************************
    ' Function to replace bold with _bold_, only the first occurance
    ' Returns true if any occurance found, false otherwise
    ' Originally recorded by WinWord macro recorder, probably contains
    ' quite a lot of useless code
    '***************************************************************
    Function ReplaceOneBold() As Boolean
        Dim bReturn As Boolean
    
        Selection.Find.ClearFormatting
        With Selection.Find
            .Text = ""
            .Forward = True
            .Wrap = wdFindContinue
            .Font.Bold = True
            .Format = True
            .MatchCase = False
            .MatchWholeWord = False
            .MatchWildcards = False
            .MatchSoundsLike = False
            .MatchAllWordForms = False
        End With
        
        bReturn = False
        While Selection.Find.Execute = True
            bReturn = True
            Selection.Text = "**" & Selection.Text & "**"
            Selection.Font.Bold = False
            Selection.Find.Execute
        Wend
        
        ReplaceOneBold = bReturn
    End Function
    
    '*******************************************************************
    ' Function to replace italic with _italic_, only the first occurance
    ' Returns true if any occurance found, false otherwise
    ' Originally recorded by WinWord macro recorder, probably contains
    ' quite a lot of useless code
    '********************************************************************
    Function ReplaceOneItalic() As Boolean
        Dim bReturn As Boolean
    
            Selection.Find.ClearFormatting
        
        With Selection.Find
            .Text = ""
            .Forward = True
            .Wrap = wdFindContinue
            .Font.Italic = True
            .Format = True
            .MatchCase = False
            .MatchWholeWord = False
            .MatchWildcards = False
            .MatchSoundsLike = False
            .MatchAllWordForms = False
        End With
        
        bReturn = False
        While Selection.Find.Execute = True
            bReturn = True
            Selection.Text = "_" & Selection.Text & "_"
            Selection.Font.Italic = False
            Selection.Find.Execute
        Wend
        ReplaceOneItalic = bReturn
    End Function
    
    '*********************************************************************
    ' Function to replace headingX with #heading, only the first occurance
    ' Returns true if any occurance found, false otherwise
    ' Originally recorded by WinWord macro recorder, probably contains
    ' quite a lot of useless code
    '*********************************************************************
    Function ReplaceH(ByVal ipNumber As Integer) As Boolean
        Dim sReplacement As String
        
        Select Case ipNumber
        Case 1: sReplacement = "#"
        Case 2: sReplacement = "##"
        Case 3: sReplacement = "###"
        Case 4: sReplacement = "####"
        Case 5: sReplacement = "#####"
        Case 6: sReplacement = "######"
        End Select
        
        Selection.Find.ClearFormatting
        Selection.Find.Style = ActiveDocument.Styles("Heading " & ipNumber)
        With Selection.Find
            .Text = ""
            .Replacement.Text = ""
            .Forward = True
            .Wrap = wdFindContinue
            .Format = True
            .MatchCase = False
            .MatchWholeWord = False
            .MatchWildcards = False
            .MatchSoundsLike = False
            .MatchAllWordForms = False
        End With
        
       
         bReturn = False
        While Selection.Find.Execute = True
            bReturn = True
            Selection.Range.InsertBefore (vbCrLf & sReplacement & " ")
            Selection.Style = ActiveDocument.Styles("Normal")
            Selection.Find.Execute
        Wend
        
        ReplaceH = bReturn
    End Function
    
    
    
    '***************************************************************
    ' A fix-up for paragraph marks that ar are bold or italic
    '***************************************************************
    Sub RemoveBoldEnters()
        Selection.HomeKey Unit:=wdStory
        Selection.Find.ClearFormatting
        Selection.Find.Font.Italic = True
        Selection.Find.Replacement.ClearFormatting
        Selection.Find.Replacement.Font.Bold = False
        Selection.Find.Replacement.Font.Italic = False
        With Selection.Find
            .Text = "^p"
            .Replacement.Text = "^p"
            .Forward = True
            .Wrap = wdFindContinue
            .Format = True
        End With
        Selection.Find.Execute Replace:=wdReplaceAll
        
        Selection.HomeKey Unit:=wdStory
        Selection.Find.ClearFormatting
        Selection.Find.Font.Bold = True
        Selection.Find.Replacement.ClearFormatting
        Selection.Find.Replacement.Font.Bold = False
        Selection.Find.Replacement.Font.Italic = False
        With Selection.Find
            .Text = "^p"
            .Replacement.Text = "^p"
            .Forward = True
            .Wrap = wdFindContinue
            .Format = True
        End With
        Selection.Find.Execute Replace:=wdReplaceAll
    End Sub
    
    '***************************************************************
    ' Function to replace bold with _bold_, only the first occurance
    ' Returns true if any occurance found, false otherwise
    ' Originally recorded by WinWord macro recorder, probably contains
    ' quite a lot of useless code
    '***************************************************************
    Sub ReplaceLists()
        Dim i As Integer
        Dim j As Integer
        Dim Para As Paragraph
            
        Selection.HomeKey Unit:=wdStory
        
        'iterate through all the lists in the document
        For i = Selection.Document.Lists.Count To 1 Step -1
            'check each paragraph in the list
            For j = Selection.Document.Lists(i).ListParagraphs.Count To 1 Step -1
                Set Para = Selection.Document.Lists(i).ListParagraphs(j)
                'if it's a bulleted list
                If Para.Range.ListFormat.ListType = wdListBullet Then
                            Para.Range.InsertBefore (ListIndent(Para.Range.ListFormat.ListLevelNumber, "*"))
                'if it's a numbered list
                ElseIf Para.Range.ListFormat.ListType = wdListSimpleNumbering Or _
                                                        wdListMixedNumbering Or _
                                                        wdListListNumOnly Then
                    Para.Range.InsertBefore (Para.Range.ListFormat.ListValue & ".  ")
                End If
            Next j
            'inserts paragraph marks before and after, removes the list itself
            Selection.Document.Lists(i).Range.InsertParagraphBefore
            Selection.Document.Lists(i).Range.InsertParagraphAfter
            Selection.Document.Lists(i).RemoveNumbers
        Next i
    End Sub
    
    '***********************************************************
    ' Returns the MarkDown indent text
    '***********************************************************
    Function ListIndent(ByVal ipNumber As Integer, ByVal spChar As String) As String
        Dim i  As Integer
        For i = 1 To ipNumber - 1
            ListIndent = ListIndent & "    "
        Next
        ListIndent = ListIndent & spChar & "    "
    End Function
    

    来源:ProgTips

    【讨论】:

      【解决方案3】:

      如果您愿意使用.docx 格式,您可以使用我放在一起的这个 PHP 脚本来提取 XML、运行一些 XSL 转换并输出相当不错的 Markdown 等价物:

      https://github.com/matb33/docx2md

      请注意,它旨在从命令行工作,并且在其界面中相当基本。但是,它会完成工作!

      如果脚本对您来说不够好,我鼓励您将您的.docx 文件发送给我,以便我可以重现您的问题并解决它。如果您愿意,请在 GitHub 中记录问题或直接与​​我联系。

      【讨论】:

      • +1,效果很好!就我而言,比 textutil + pandoc 更好(特别是保持标题)
      • 认为 OP 要求提供 .doc 文件。据推测,这不适用于.doc,而仅适用于.docx
      【解决方案4】:

      Pandoc 是一个很好的命令行转换工具,但同样,您首先需要将输入转换为 Pandoc 可以读取的格式,即:

      • 降价
      • 重构文本
      • 纺织品
      • HTML
      • 乳胶

      【讨论】:

      • Pandoc 现在可以读取 Microsoft Word DOCX、ODT、OpenDocument 等格式。
      【解决方案5】:

      我们遇到了同样的问题,必须将 Word 文档转换为 markdown。有些是更复杂和(非常)大的文档,包含数学方程式和图像等。所以我制作了这个脚本,它使用多种不同的工具进行转换:https://github.com/Versal/word2markdown

      因为它使用了一系列工具,所以更容易出错,但如果您有更复杂的文档,它可能是一个很好的起点。希望对您有所帮助! :)

      更新: 它目前仅适用于 Mac OS X,并且您需要安装一些要求(Word、Pandoc、HTML Tidy、git、node/npm)。要使其正常工作,您还需要打开一个空的 Word 文档,然后执行:文件->另存为网页->兼容性->编码->UTF-8。然后将此编码保存为默认值。有关如何设置的更多详细信息,请参阅自述文件。

      然后在控制台中运行:

      $ git clone git@github.com:Versal/word2markdown.git
      $ cd word2markdown
      $ npm install
      (copy over the Word files, for example, "document.docx")
      $ ./doc-to-md.sh document.docx document_files > document.md
      

      然后你可以在document.md找到Markdown,在document_files目录下找到图片。

      现在可能有点复杂,所以我欢迎任何使这更容易或使它在其他操作系统上工作的贡献! :)

      【讨论】:

      • 请在你的回答中包含更多细节,如何使用这个工具来回答这个具体问题?
      • @Calimo 完成,感谢您的建议。第一次这么回答。 ;-)
      【解决方案6】:

      你试过这个吗?不确定功能的丰富性,但它适用于简单的文本。 http://markitdown.medusis.com/

      【讨论】:

        【解决方案7】:

        作为大学 ruby​​ 课程的一部分,我开发了一个可以将 openoffice word 文件 (.odt) 转换为 markdown 的工具。 必须做出很多假设才能将其转换为正确的格式。例如,很难确定必须被视为标题的文本的大小。 但是,您可以通过这种转换来放松的唯一想法是格式化任何遇到的文本总是附加到降价文档中。 我开发的工具支持列表、粗体和斜体文本,并且具有表格语法。

        http://github.com/bostko/doc2text 试一试,请给我您的反馈。

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2012-09-29
          • 2023-03-12
          • 1970-01-01
          • 2011-02-22
          • 2013-09-01
          相关资源
          最近更新 更多