【问题标题】:How to validate big xml against xsd schema?如何针对 xsd 架构验证大 xml?
【发布时间】:2012-03-20 14:36:54
【问题描述】:

我需要使用 有限 内存使用来验证大 xml。到目前为止,我发现的每一个代码都会出现内存不足错误。

我尝试过的方法:

 //method 1
        SAXParserFactory factory = SAXParserFactory.newInstance();
        factory.setValidating(false);
        factory.setNamespaceAware(true);

        SchemaFactory schemaFactory = SchemaFactory.newInstance("http://www.w3.org/2001/XMLSchema");
        factory.setSchema(schemaFactory.newSchema(new Source[] {new StreamSource(Thread.currentThread().getContextClassLoader().getResource("xmlresource/XSD_final2.xsd").getFile())}));
        SAXParser parser = factory.newSAXParser();
        XMLReader reader = parser.getXMLReader();
        reader.setErrorHandler(new SimpleErrorHandler());
        reader.parse(new InputSource(inputXml));
//method2 

XMLValidationSchemaFactory sf = XMLValidationSchemaFactory.newInstance(XMLValidationSchema.SCHEMA_ID_W3C_SCHEMA);
            XMLValidationSchema vs = sf.createSchema(Thread.currentThread().getContextClassLoader().getResource("xmlresource/XSD_final2.xsd"));
            XMLStreamReader2 sr = (XMLStreamReader2) XMLInputFactory2.newInstance().createXMLStreamReader(new FileInputStream(inputXml));
            sr.validateAgainst(vs);
            try {
              while (sr.hasNext()) {
                sr.next();
              }
              System.out.println("Validated ok!");
            } catch (XMLValidationException ve) {
              System.err.println("Validation problem: "+ve);
              isValid = false;
            }
            sr.close();

//方法3

      SchemaFactory factory = SchemaFactory.newInstance("http://www.w3.org/2001/XMLSchema");
          String fileName = Thread.currentThread().getContextClassLoader().getResource("xmlresource/XSD_final2.xsd").getFile();

          Schema schema = factory.newSchema(new File(fileName));
          Validator validator = schema.newValidator();

          // create a source from a file
          StreamSource source = new StreamSource(new File(inputXml));

          // check input

            validator.validate(source);

我每次都会出现 OutOfMemory

编辑

使用 XOM

SAXParserFactory factory = SAXParserFactory.newInstance();
            factory.setValidating(false);
            factory.setNamespaceAware(true);

            SchemaFactory schemaFactory = SchemaFactory.newInstance("http://www.w3.org/2001/XMLSchema");
            factory.setSchema(schemaFactory.newSchema(new Source[] {new StreamSource(Thread.currentThread().getContextClassLoader().getResource("xmlresource/XSD_final2.xsd").getFile())}));
            SAXParser parser = factory.newSAXParser();
            XMLReader reader = parser.getXMLReader();
            reader.setErrorHandler(new SimpleErrorHandler());

            Builder builder = new Builder(reader);
            builder.build(new FileInputStream(new File(inputXml)));

内存使用率仍然很高,对于 15mb xml - 250mb 堆 堆栈跟踪:

Exception in thread "AWT-EventQueue-0" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2367)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:535)
at java.lang.StringBuffer.append(StringBuffer.java:322)
at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator.handleCharacters(XMLSchemaValidator.java:1574)
at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator.characters(XMLSchemaValidator.java:789)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:441)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:835)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:123)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1210)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:568)
at nu.xom.Builder.build(Unknown Source)
at nu.xom.Builder.build(Unknown Source)

编辑 我的 xml 有很大的 base64 字符串

【问题讨论】:

    标签: java xml validation xsd


    【解决方案1】:

    看看这篇来自 Marco Tedone see here 的关于 XML 解组的文章。 根据他的结论,我建议使用低内存消耗的 Stax:

        XMLInputFactory xmlInputFactory = XMLInputFactory.newInstance();
        XMLStreamReader xmlStreamReader = xmlInputFactory.createXMLStreamReader(fileInputStream);
        Validator validator = schema.newValidator();
        validator.validate(new StAXSource(xmlStreamReader));
    

    【讨论】:

    • 感谢您的回复。这仍然使用 xerces,所以我仍然使用 -Xmx250m 获得 OutOfMemory。到目前为止,woodstox 对我来说效果最好。
    【解决方案2】:

    内存可能用于架构,而不是源文档。你还没有说任何关于架构的事情。有些可能会使用大量内存,例如,如果您的内容模型中有较大的 minOccurs 或 maxOccurs 有限值。什么时候会出现内存不足异常?

    【讨论】:

    • 感谢您的回复。 Xsd 有一定数量的 min/max Occurs 但它并不复杂。我的 xml 有 base64 字符串,并在 AbstractStringBuilder 中看到内存不足
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2018-04-17
    • 1970-01-01
    • 2011-05-30
    • 1970-01-01
    相关资源
    最近更新 更多