解析大型 json 文件答案

【问题标题】：Parsing large json files解析大型 json 文件
【发布时间】：2014-12-24 02:58:09
【问题描述】：

我有一个关于将 json 转换为 csv 的问题——尤其是内存问题（至少我认为是一个）。我写了一些应该处理这种情况的函数，它工作得很好——对于小型 json 文件。对于大型 json 文件，JFrame 会卡住，几分钟内什么也没发生（我在约 5 分钟后用任务管理器终止了该进程）。源 json 文件大约有 30.000 行。

我在做什么：

读取（大）json 文件
更正它（某些值不是典型的json，即"actor" : "ObjectId("12345")等应更正为"actor" : "12345"
将较大的 json 文件拆分为较小的文件。
处理小型 json 文件。

到目前为止我所拥有的：

public void mongoExportAndSplitFilter() {
    ReadFileAndSave reader = new ReadFileAndSave();
    String jsonFilePath = this.converterView.sourceTextField.getText();
    //String targetFilePath = this.converterView.targetTextField.getText();
    File jsonFile = new File(jsonFilePath);
    Scanner scanner = new Scanner(reader.readFileAndCorrectOutput(jsonFile));
    int j = 0;
    StringBuffer sb = new StringBuffer();
    reader.readPartOfFileAndSave("src/main/resources", scanner, j, sb);
    //System.out.println("STEP 1: INPUT FILE (" + jsonFilePath + ") HAS BEEN CORRECTED!");
    //System.out.println("STEP 2: INPUT FILE (" + jsonFilePath + ") HAS BEEN SPLITTED WHILE PARSING!");
    this.filterView.setVisible(false);
    this.filterView.dispose();
    this.filterFlag = 1;
}

/**
 * Utility function to correct the MongoExport-JSON-Output.
 *
 * @param file The file which should be corrected.
 * @return Returns the correct JSON-String.
 */
public String readFileAndCorrectOutput(File file) {
    String jsonStringCorrected = "";
    StringBuffer sb = new StringBuffer();
    try {
        Scanner scanner = new Scanner(file);
        while (scanner.hasNext()) {
            String next = scanner.next();

            if (next.contains("ObjectId") || next.contains("ISODate")) {
                Matcher m = Pattern.compile(this.regEx)
                        .matcher(next);

                if (m.find()) {
                    next = next.replaceAll(this.regEx, this.innerString);
                }
            }
            //jsonStringCorrected += next;
            sb.append(next);
        }
        scanner.close();

        jsonStringCorrected = sb.toString();
        JSONObject jsonObject = new JSONObject(jsonStringCorrected);
        jsonStringCorrected = jsonObject.toString(2);
    } catch (FileNotFoundException ex) {
        Logger.getLogger(ReadFileAndSave.class.getName()).log(Level.SEVERE, null, ex);
    }
    return jsonStringCorrected;
}

/*
 * Utility-function to read a json file part by part and save the parts to a separate json file.
 * @param   scanner     The scanner which contains the file and which returns the lines from the file.
 * @param   j               The counter of the file. As the file should change whenever the counter changes.
 * @return  jsonString  The content of the jsonString.
 */
public String readPartOfFileAndSave(String filepath, Scanner scanner, int j, StringBuffer sb) {


    String jsonString = "";
    int i = 0;
    ++j;
    while (scanner.hasNext()) {
        String token = scanner.next();

        //jsonString += token;
        sb.append(token);
        if (token.contains("{")) {
            i++;
        }
        if (token.contains("}")) {
            i--;
        }
        if (i == 0) {
            jsonString = sb.toString();
            JSONObject jsonObject = new JSONObject(jsonString);
            jsonString = jsonObject.toString(2);
            saveFile(filepath, "actor", j, jsonString);
            jsonString = readPartOfFileAndSave(filepath, scanner, j);
        }
    }
    return "";
}

有谁知道如何解决这个问题？

编辑

这是文件的 sn-p（前 3 行）：

{ "verb" : "access", "target" : { "id" : "5485a7050ac61b1339a4da0e", "inquiryPhase" : "Orientation", "displayName" : "Orientation", "objectType" : "phase" }, "generator" : { "id" : "5485a7050ac61b1339a4da09", "displayName" : "LochemC", "objectType" : "ils", "url" : "http://graasp.eu/spaces/5485a7050ac61b1339a4da09" }, "provider" : { "id" : "5485a7050ac61b1339a4da09", "inquiryPhase" : "ils", "displayName" : "LochemC", "objectType" : "ils", "url" : "http://graasp.eu/spaces/5485a7050ac61b1339a4da09" }, "object" : { "id" : "5485a7050ac61b1339a4da09", "displayName" : "LochemC", "objectType" : "ils" }, "actor" : { "id" : "Bas Kollöffel (UT)@5485a7050ac61b1339a4da09", "displayName" : "Bas Kollöffel (UT)", "objectType" : "person" }, "published" : "2014-12-08T13:40:45.409Z", "publishedClient" : "2014-12-08T13:40:45.409Z", "publishedServer" : { "$date" : 1418046045490 }, "_id" : { "$oid" : "5485aa5dc372cdbb21daea33" } }
{ "verb" : "access", "target" : { "id" : "5485a7050ac61b1339a4da13", "inquiryPhase" : "Conceptualisation", "displayName" : "Conceptualisation", "objectType" : "phase" }, "generator" : { "id" : "5485a7050ac61b1339a4da09", "displayName" : "LochemC", "objectType" : "ils", "url" : "http://graasp.eu/spaces/5485a7050ac61b1339a4da09" }, "provider" : { "id" : "5485a7050ac61b1339a4da09", "inquiryPhase" : "ils", "displayName" : "LochemC", "objectType" : "ils", "url" : "http://graasp.eu/spaces/5485a7050ac61b1339a4da09" }, "object" : { "id" : "5485a7050ac61b1339a4da13", "inquiryPhase" : "Conceptualisation", "displayName" : "Conceptualisation", "objectType" : "phase" }, "actor" : { "id" : "Bas Kollöffel (UT)@5485a7050ac61b1339a4da09", "displayName" : "Bas Kollöffel (UT)", "objectType" : "person" }, "published" : "2014-12-08T13:40:46.867Z", "publishedClient" : "2014-12-08T13:40:46.867Z", "publishedServer" : { "$date" : 1418046046952 }, "_id" : { "$oid" : "5485aa5ec372cdbb21daea34" } }
{ "verb" : "access", "target" : { "id" : "5485a7050ac61b1339a4da1e", "inquiryPhase" : "Investigation", "displayName" : "Investigation", "objectType" : "phase" }, "generator" : { "id" : "5485a7050ac61b1339a4da09", "displayName" : "LochemC", "objectType" : "ils", "url" : "http://graasp.eu/spaces/5485a7050ac61b1339a4da09" }, "provider" : { "id" : "5485a7050ac61b1339a4da09", "inquiryPhase" : "ils", "displayName" : "LochemC", "objectType" : "ils", "url" : "http://graasp.eu/spaces/5485a7050ac61b1339a4da09" }, "object" : { "id" : "5485a7050ac61b1339a4da1e", "inquiryPhase" : "Investigation", "displayName" : "Investigation", "objectType" : "phase" }, "actor" : { "id" : "Bas Kollöffel (UT)@5485a7050ac61b1339a4da09", "displayName" : "Bas Kollöffel (UT)", "objectType" : "person" }, "published" : "2014-12-08T13:40:48.582Z", "publishedClient" : "2014-12-08T13:40:48.582Z", "publishedServer" : { "$date" : 1418046048662 }, "_id" : { "$oid" : "5485aa60c372cdbb21daea35" } }

【问题讨论】：

见docs.oracle.com/javase/7/docs/api/javax/swing/SwingWorker.html
看起来像 sax/stax 类的 json 解析器对巨大的 json 很有用
第一个问题：你在循环中使用字符串连接。
@Jon Skeet：为什么这实际上是个问题？
@X-Fate：见yoda.arachsys.com/java/strings.html

标签： java json swing

【解决方案1】：

不要一次读取整个文件。逐行阅读，进行更正，然后随时写入输出。

此外，您似乎不需要在这里解析和重新创建 json。应该能够在原始文本级别完成您需要的所有处理。

而且我认为你不需要在readPartOfFileAndSave() 中进行递归，可以在外循环内完成所有操作。

【讨论】：

其实我是按令牌读取文件令牌的。我将它放入Scanner scanner = new Scanner(file) 并使用scanner.hasNext() 遍历整个文件。或者你是什么意思？
你遍历整个文件，把整个内容放到一个String中，再遍历String，写输出。我的意思是你应该摆脱那个String（这就是消耗你所有内存的东西），并在你去的时候写输出，而不是一次在内存中积累超过你需要的内容。
是的，现在我明白你的意思了，但我应该如何摆脱字符串？我需要它在其他一些功能中处理其中的数据。直接传递它们在我的上下文中没有意义，因为 - 例如 - 我需要整个 jsonString（期望一个 jsonString 是一个有效的 json 文件）将其转换为 XML 并以递归方式处理 XML-String从节点构建新模式的方法，称为 xes 文件。
好吧，我只能评论你展示的代码。要执行问题中的 sn-p 正在执行的任务，您不需要文件的全部内容。还有其他任务，将文件的全部内容读入内存是不可避免的吗？我高度怀疑这一点，但在不知道您的想法的情况下，这只是基于经验的教育（有很多软件，对 TB 的数据进行各种处理，而不会一次将其全部加载到内存中，不太可能，您的情况是一个罕见的例外）。
当然，我不需要整个文件，但我提供的代码也不应该读取整个文件。我打算逐个令牌扫描文件，直到i == 0。如果是这种情况，则应保存文件的这一部分。然后扫描仪应该再次处理直到i == 0...在没有新令牌并且扫描仪关闭之前应该是这种情况。之后，我打算将每个文件转换为另一个临时文件。如果这已经完成，我将合并所有文件并再次创建一个大文件。希望这更清楚。