如何在 OpenRefine ReST-API 的“创建项目”发布请求中传递选项 JSON？答案

【问题标题】：How to pass the options JSON in the "Create Project" Post Rquest of the OpenRefine ReST-API?如何在 OpenRefine ReST-API 的“创建项目”发布请求中传递选项 JSON？
【发布时间】：2019-04-10 16:55:28
【问题描述】：

我目前正在尝试将 Excel 表格（作为 .xls）上传到 Ontotexts GraphDB 的 OpenRefine（或 OntoRefine）模块。由于上传 xls 时遇到问题，我决定先将 xls 文件转换为 csv 文件，然后再上传。不幸的是，OpenRefine 不会每次将文件自动识别为 CSV。因此，每一行中的所有数据都存储在单个列中。例如：

--------------------------------------------------
|      Col1,     Col2,     Col3,     Col4        |
--------------------------------------------------
|      Row11,     Row12,     Row13,     Row14    |
--------------------------------------------------
|      Row21,     Row22,     Row23,     Row24    |
--------------------------------------------------

代替：

--------------------------------------------------
|      Col1    |  Col2    |  Col3    |  Col4     |
--------------------------------------------------
|      Row11   |  Row12   |  Row13   |  Row14    |
--------------------------------------------------
|      Row21   |  Row22   |  Row23   |  Row24    |
--------------------------------------------------

通过发布请求

POST /command/core/create-project-from-upload

'format' 参数中的文件格式和 'options' 参数中带有分隔符的 json 可以添加到 POST 请求中。但是，这也不起作用，并且官方 OpenRefine 文档 (https://github.com/OpenRefine/OpenRefine/wiki/OpenRefine-API) 不包含有关“选项”JSON 语法的任何提示。

我当前的代码如下所示：

import os
import xlrd
import csv
import requests
import re

xls_file_name_ext = os.path.basename('excel_file.xls')

# create the filename with path to the new csv file (path + name stays the same)
csv_file_path = os.path.dirname(xls_file_name_ext) + '/' + os.path.splitext(xls_file_name_ext)[0] + '.csv'

# remove all comma in xls file
xls_wb = xlrd.open_workbook(xls_file_name_ext)
xls_sheet = xls_wb.sheet_by_index(0)
for col in range(xls_sheet.ncols):
    for row in range(xls_sheet.nrows):
        _new_cell_val = str(xls_sheet.cell(row, col).value).replace(",", " ")
        xls_sheet._cell_values[row][col] = _new_cell_val

# write to csv
with open(csv_file_path, 'w', newline='', encoding='utf-8') as csv_file:
    c_w = csv.writer(csv_file, delimiter=',')
    for row in range(xls_sheet.nrows):
        c_w.writerow(xls_sheet.row_values(row))

ontorefine_server = 'http://localhost:7200/orefine'

# filename of csv as project name in OntoRefine
onterefine_project_name = os.path.splitext(os.path.basename(csv_file_path))[0]

# the required paraneters for the post request
ontorefine_data = {"project-name": onterefine_project_name,
                   "format": "text/line-based/*sv",
                   "options": {
                       "separator": ","
                                }
                   }
ontorefine_file = {'project-file': open(csv_file_path, "rb")}

# execute the post request
ontorefine_response = requests.post(
    ontorefine_server + '/command/core/create-project-from-upload', data=ontorefine_data, files=ontorefine_file
)

我假设我错误地传递了 POST 请求参数。

【问题讨论】：

由于您的发布数据看起来不错，这实际上可能与您的格式有关 - 您可以从数据集中发布示例行吗？
如前所述，数据为xls格式，然后转换为CSV格式。由于它是敏感数据，我使它无法识别。 CSV 格式的数据（标题和数据）如下所示：[C1];[C2];C3;C4;[C5];C6;C7;C8;C9;C10;C11;C12;C13;C14;C15;C16;C17;C18;C19;C20;C21;C22;C23;C24;C25;[C26];C27;C28;[C29];[C30];[C31];C32;C33;C34;C35;C36;C37;C38;C39;C40;C41;C42;C43;C44;C45;C46;C47;C48;C49;[C50];C51 ABC;1234;0A1; A AA 13 BB 13 CC;FOO, BAR;FOO_123;100;2;foo bar ;1f4+5b8+9;9000; FO876 ;01.01.1900;;;;;;;1.0.0;AB;;;;;ZY;1234;;1;ZY;;;;;;;;;;;;A;;;A1B;987;65;B;Z0; A AA 13 BB 13 CC;123456
此外，xls 文件通常有 2,000 到 10,000 行，因此通常可以更频繁地正确识别具有更多行的文件。

标签： openrefine graphdb

【解决方案1】：

当然，这完全取决于您的输入数据，但格式看起来没问题。如果您尝试从 UI 导入，这就是 OntoRefine 在“幕后”所做的事情。您可以通过拦截您的网络流量为自己看到相同的有效负载：

{
"format": "text/line-based/*sv",
"options": {
    "project-name":"Your-project-here",
    "separator":","
}

由此看来，项目名称的位置似乎是唯一的区别。这是一个 curl 命令，它做同样的事情：

curl 'http://localhost:7200/orefine/command/core/importing-controller?controller=core%2Fdefault-importing-controller&jobID=1&subCommand=create-project' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' --data 'format%3Dtext%2Fline-based%2F*sv%26options%3D%7B%22separator%22%3A%22%2C%22%22projectName%22%3A%22Your-project-name%22%7D'

【讨论】：