Python / NiFi：ExecuteScript python，将 UTF-16 文本文件转换为 UTF-8答案

【问题标题】：Python / NiFi: ExecuteScript python, to convert an UTF-16 text files to UTF-8Python / NiFi：ExecuteScript python，将 UTF-16 文本文件转换为 UTF-8
【发布时间】：2018-12-10 21:13:42
【问题描述】：

我有我的ExecuteScript 处理器，我正在尝试将通过的所有文件转换为 utf-8，如果它们最初是 utf-16。

到目前为止：

flowFileList = session.get(100)
if not flowFileList.isEmpty():
  for flowFile in flowFileList: 
     # Process each FlowFile here:
     flowFileList.decode("utf-16").encode("utf-8")

我觉得这应该是一个相当简单的操作，如以下答案中所定义：here、here 和 here。

这会引发一个错误，“该对象在”中没有“解码”属性。

如果这是一个愚蠢的问题，请随意说。谢谢

NiFi ExecuteScript 食谱：Cookbook

【问题讨论】：

标签： python utf-8 apache-nifi

【解决方案1】：

问题是您在 flowfileList 对象上调用decode，而不是单个流文件。

此外，您需要实际访问流文件内容，然后使用新编码设置内容。现在，您将流文件对象视为字符串，但事实并非如此。我不在电脑旁，但稍后会有工作示例代码。

更新

我将提供可用的 Python 代码来演示这一点，但为什么你不能只使用 ConvertCharacterSet 处理器呢？这接受输入字符集和输出字符集。

这是将传入流文件内容从 UTF-16 转换为 UTF-8 的工作代码。您应该尝试过滤已经存在的 UTF-8 内容以跳过此处理器，或添加代码以识别它并对其进行无操作处理。您可能也有兴趣关注 NIFI-4550 - Add InferCharacterSet processor 以获得相同的行为。

import java.io
from org.apache.commons.io import IOUtils
from java.nio.charset import StandardCharsets
from org.apache.nifi.processor.io import StreamCallback

# Define a subclass of StreamCallback for use in session.write()
class PyStreamCallback(StreamCallback):
    def __init__(self):
        pass
    def process(self, inputStream, outputStream):
        text = IOUtils.toString(inputStream, StandardCharsets.UTF_16)
        outputStream.write(bytearray(text.encode('utf-8')))
# end class

flowFileList = session.get(100)
if not flowFileList.isEmpty():
    for flowFile in flowFileList:
        flowFile = session.write(flowFile, PyStreamCallback())
        flowFile = session.putAttribute(flowFile, 'script_character_set', 'UTF-8')
        session.transfer(flowFile, REL_SUCCESS)
# implicit return at the end

【讨论】：

不幸的是，我对Python一无所知。感谢您的帮助，这是一个很好的学习机会。我明天考
长话短说，如果您知道传入的内容是 UTF-16 什么不是，只需将 UTF-16 路由到配置了显式输入和输出字符集的 ConvertCharacterSet 处理器。如果不这样做，则必须使用代码来确定字符集，然后使用上面的代码选择性地转换它。
回答为什么ConvertCharacterSet 不起作用 - 它返回的东西完全超出了苍白，因此ExecuteScript
它在 for 循环中的第 18 行抛出错误，flowfile = session.write(flowFile,PyStreamCallback()，说TypeError: write(): 1st arg can't be configured to byte[]。跟班有关系吗？我认为
有趣的是，我删除了 text.encode 之前的 bytearray，这使文件通过了。但是，就像ConvertCharacterSet一样，它返回随机汉字