【问题标题】:Scala: Parsing concatenated XML documentsScala:解析连接的 XML 文档
【发布时间】:2012-04-29 01:25:43
【问题描述】:

所以我的问题与this previous StackOverflow question 几乎相同,但我重新提出问题是因为我不喜欢接受的答案。

我有一个连接 XML 文档的文件:

<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
...
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>

我想解析出每一个。

据我所知,我不能使用scala.xml.XML,因为这取决于每个文件/字符串模型的一个文档。

是否有 Parser 的子类可用于解析来自输入源的 XML 文档?因为那样我就可以做类似many1 xmldoc 之类的事情了。

【问题讨论】:

  • 这个问题是重复的,除非你解释为什么你不喜欢其他答案。说明没有您建议的类型的解析器不足以 IMO 提供完整的问题/答案。
  • @RexKerr:公平点。我发现那里接受的答案是不可接受的,因为“打破&lt;?xml”对我来说是parsing XML with regular expressions 的味道,标签计数也是如此(因为存在&lt;![CDATA[ 的危险)

标签: xml scala


【解决方案1】:

如果你关心的是安全,你可以用独特的标签来包装你的块:

def mkTag = "block"+util.Random.alphanumeric.take(20).mkString
val reader = io.Source.fromFile("my.xml")
def mkChunk(it: Iterator[String], chunks: Vector[String] = Vector.empty): Vector[String] = {
  val (chunk,extra) = it.span(s => !(s.startsWith("<?xml") && s.endsWith("?>"))
  val tag = mkTag
  def tagMe = "<"+tag+">"+chunk.mkString+"</"+tag+">"
  if (!extra.hasNext) chunks :+ tagMe
  else if (!chunk.hasNext) mkChunk(extra, chunks)
  else mkChunk(extra, chunks :+ tagMe)
}
val chunks = mkChunk(reader.getLines())
reader.close
val answers = xml.XML.fromString("<everything>"+chunks.mkString+"</everything>")
// Now take apart the resulting parse

由于您提供了唯一的封闭标签,如果有人在中间某处嵌入了文字 XML 标签,您可能会遇到解析错误,但您不会不小心弄错了解析次数。

(警告:输入了代码但根本没有检查——这是为了给出想法,而不是完全正确的行为。)

【讨论】:

    【解决方案2】:

    好的,我想出了一个我更满意的答案。

    基本上我尝试使用SAXParser 解析XML,就像scala.xml.XML.load 一样,但注意SAXParseExceptions 表明解析器在错误的地方遇到了&lt;?xml

    然后,我抓取已经解析过的任何根元素,将输入倒回刚好足够,然后从那里重新开始解析。

    // An input stream that can recover from a SAXParseException 
    object ConcatenatedXML {
      // A reader that can be rolled back to the location of an exception
      class Relocator(val re : java.io.Reader)  extends java.io.Reader {
        var marked = 0
        var firstLine : Int = 1
        var lineStarts : IndexedSeq[Int] = Vector(0)
        override def read(arr : Array[Char], off : Int, len : Int) = { 
          // forget everything but the start of the last line in the
          // previously marked area
          val pos = lineStarts(lineStarts.length - 1) - marked
          firstLine += lineStarts.length - 1
    
          // read the next chunk of data into the given array
          re.mark(len)
          marked = re.read(arr,off,len)
    
          // find the line starts for the lines in the array
          lineStarts = pos +: (for (i <- 0 until marked if arr(i+off) == '\n') yield (i+1))
    
          marked
        }
        override def close { re.close }
        override def markSupported = false
        def relocate(line : Int, col : Int , off : Int) {
          re.reset
          val skip = lineStarts( line - firstLine ) + col + off
          re.skip(skip)
          marked = 0
          firstLine = 1
          lineStarts = Vector(0)
        }
      }
    
      def parse( str : String ) : List[scala.xml.Node] = parse(new java.io.StringReader(str))
      def parse( re : java.io.Reader ) : List[scala.xml.Node] = parse(new Relocator(re))
    
      // parse all the concatenated XML docs out of a file
      def parse( src : Relocator ) : List[scala.xml.Node] = {
        val parser = javax.xml.parsers.SAXParserFactory.newInstance.newSAXParser
        val adapter = new scala.xml.parsing.NoBindingFactoryAdapter
    
        adapter.scopeStack.push(scala.xml.TopScope)
        try {
    
          // parse this, assuming it's the last XML doc in the string
          parser.parse( new org.xml.sax.InputSource(src), adapter )
          adapter.scopeStack.pop
          adapter.rootElem.asInstanceOf[scala.xml.Node] :: Nil
    
        } catch {
          case (e : org.xml.sax.SAXParseException) => {
            // we found the start of another xmldoc
            if (e.getMessage != """The processing instruction target matching "[xX][mM][lL]" is not allowed."""
                || adapter.hStack.length != 1 || adapter.hStack(0) == null){
              throw(e)
            }
    
            // tell the adapter we reached the end of a document
            adapter.endDocument
    
            // grab the current root node
            adapter.scopeStack.pop
            val node = adapter.rootElem.asInstanceOf[scala.xml.Node]
    
            // reset to the start of this doc
            src.relocate(e.getLineNumber, e.getColumnNumber, -6)
    
            // and parse the next doc
            node :: parse( src )
          }
        }
      }
    }
    
    println(ConcatenatedXML.parse(new java.io.BufferedReader(
      new java.io.FileReader("temp.xml")
    )))
    println(ConcatenatedXML.parse(
      """|<?xml version="1.0" encoding="UTF-8"?>
         |<firstDoc><inner><innerer><innermost></innermost></innerer></inner></firstDoc>
         |<?xml version="1.0" encoding="UTF-8"?>
         |<secondDoc></secondDoc>
         |<?xml version="1.0" encoding="UTF-8"?>
         |<thirdDoc>...</thirdDoc>
         |<?xml version="1.0" encoding="UTF-8"?>
         |<lastDoc>...</lastDoc>""".stripMargin
    ))
    try {
      ConcatenatedXML.parse(
        """|<?xml version="1.0" encoding="UTF-8"?>
           |<firstDoc>
           |<?xml version="1.0" encoding="UTF-8"?>
           |</firstDoc>""".stripMargin
      )
      throw(new Exception("That should have failed"))
    } catch {
      case _ => println("catches really incomplete docs")
    }
    

    【讨论】:

      猜你喜欢
      • 2020-07-10
      • 1970-01-01
      • 2010-11-22
      • 1970-01-01
      • 2016-05-29
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多