使用 Java 客户端 API 获取 MarkLogic 中的所有文档 URI答案

【问题标题】：Fetching all the document URI's in MarkLogic Using Java Client API使用 Java 客户端 API 获取 MarkLogic 中的所有文档 URI
【发布时间】：2015-10-20 10:30:47
【问题描述】：

我试图在不知道确切 url 的情况下从数据库中获取所有文档。我有一个查询

DocumentPage documents =docMgr.read();
while (documents.hasNext()) {
    DocumentRecord document = documents.next();
    System.out.println(document.getUri());
}

但是我没有具体的网址，我想要所有的文件

【问题讨论】：

您实际上想要完成什么？如果要导出内容，MLCP 会更容易。如果您想进行一些数字运算，在 MarkLogic 中可能会更容易。

标签： java marklogic

【解决方案1】：

第一步是在数据库中启用您的 uris 词典。

您可以评估一些 XQuery 并运行 cts:uris()（或服务器端 JS 并运行 cts.uris()）：

    ServerEvaluationCall call = client.newServerEval()
        .xquery("cts:uris()");
    for ( EvalResult result : call.eval() ) {
        String uri = result.getString();
        System.out.println(uri);
    }

两个缺点是：(1) 您需要一个具有privileges 的用户；(2) 没有分页。

如果您的文档数量较少，则不需要分页。但是对于大量文档，建议使用分页。下面是一些使用搜索 API 和分页的代码：

    // do the next eight lines just once
    String options =
        "<options xmlns='http://marklogic.com/appservices/search'>" +
        "  <values name='uris'>" +
        "    <uri/>" +
        "  </values>" +
        "</options>";
    QueryOptionsManager optionsMgr = client.newServerConfigManager().newQueryOptionsManager();
    optionsMgr.writeOptions("uriOptions", new StringHandle(options));

    // run the following each time you need to list all uris
    QueryManager queryMgr = client.newQueryManager();
    long pageLength = 10000;
    queryMgr.setPageLength(pageLength);
    ValuesDefinition query = queryMgr.newValuesDefinition("uris", "uriOptions");
    // the following "and" query just matches all documents
    query.setQueryDefinition(new StructuredQueryBuilder().and());
    int start = 1;
    boolean hasMore = true;
    Transaction transaction = client.openTransaction();
    try {
        while ( hasMore ) {
            CountedDistinctValue[] uriValues =
                queryMgr.values(query, new ValuesHandle(), start, transaction).getValues();
            for (CountedDistinctValue uriValue : uriValues) {
                String uri = uriValue.get("string", String.class);
                //System.out.println(uri);
            }
            start += uriValues.length;
            // this is the last page if uriValues is smaller than pageLength
            hasMore = uriValues.length == pageLength;
        }
    } finally {
        transaction.commit();
    }

仅当您需要一个与此过程同时发生的添加/删除隔离的有保证的“快照”列表时，才需要该事务。由于它增加了一些开销，如果您不需要这样的精确性，可以随意删除它。

【讨论】：

你能告诉我我们是否可以指定它应该从中获取 uris 的集合吗？
你可以使用 StructuredQueryBuilder.and 而不是 StructuredQueryBuilder.collection docs.marklogic.com/javadoc/client/com/marklogic/client/query/… 来指定查询
找到了一个更好的选择：ServerEvaluationCall call = client.newServerEval().xquery("for $x in collection(\"RexUserProfiles\") return(fn:document-uri($x) )");

【解决方案2】：

找出页面长度并在queryMgr中指定要访问的起点。继续增加起点并循环遍历所有 URL。我能够获取所有 URI。这可能不是很好的方法，但很有效。

List<String> uriList = new ArrayList<>();       
        QueryManager queryMgr = client.newQueryManager();
        StructuredQueryBuilder qb = new StructuredQueryBuilder();
        StructuredQueryDefinition querydef = qb.and(qb.collection("xxxx"), qb.collection("whatever"), qb.collection("whatever"));//outputs 241152
        SearchHandle results = queryMgr.search(querydef, new SearchHandle(), 10);
        long pageLength = results.getPageLength();
        long totalResults = results.getTotalResults();
        System.out.println("Total Reuslts: " + totalResults);
        long timesToLoop = totalResults / pageLength;
        for (int i = 0; i < timesToLoop; i = (int) (i + pageLength)) {
            System.out.println("Printing Results from: " + (i) + " to: " + (i + pageLength));
            results = queryMgr.search(querydef, new SearchHandle(), i);
            MatchDocumentSummary[] summaries = results.getMatchResults();//10 results because page length is 10
            for (MatchDocumentSummary summary : summaries) {
//                System.out.println("Extracted friom URI-> " + summary.getUri());
                uriList.add(summary.getUri());
            }
            if (i >= 1000) {//number of URI to store/retreive. plus 10
                break;
            }
        }
         uriList= uriList.stream().distinct().collect(Collectors.toList());
        return uriList;

【讨论】：