【发布时间】:2019-12-09 20:35:30
【问题描述】:
我正在处理从 Word 的保存选项(以编程方式)转换为 HTML 的 HTML 文档。这个 HTML 文本文件是 windows-1252 编码的。 (是的,我已经阅读了很多关于字节和 Unicode 代码点的内容,我知道超过 128 的代码点可以是 2,3,最多可以是 6 个字节,等等。)我在我的 Word 文档模板中添加了很多不可打印的字符并编写代码来评估每个 CHARACTER(十进制等价物)。当然,我知道我不想允许十进制 #160,这是 MS Word 将不间断空格的 HTML 翻译。我预计在不久的将来人们会将更多这些“非法”构造放入模板中,我需要捕获它们并处理它们(因为它们会在浏览器中引起有趣的查看:(这是在转储到 Eclipse 控制台,我将所有文档行放入地图中)
DataObj.paragraphMap : {1=, 2=Introduction and Learning Objective, 3=? ©®™§¶…‘’“”????, 4=, 5=, 6=,
7=This is paragraph 1 no formula, 8=,
我将十进制 #160 替换为 #32(常规空格),然后使用 UTF-8 编码将字符写入新文件 - 我的想法也是如此,我可以使用这种技术来替换还是决定不回写使用十进制等效的特定字符?我想避免使用字符串,因为我可以处理多个文档并且不想耗尽内存....所以我在文件中进行...
public static void convert1252toUFT8(String fileName) throws IOException {
File f = new File(fileName);
Reader r = new BufferedReader(new InputStreamReader(new FileInputStream(f), "windows-1252"));
OutputStreamWriter writer = new OutputStreamWriter(new FileOutputStream(fileName + "x"), StandardCharsets.UTF_8);
List<Character> charsList = new ArrayList<>();
int count = 0;
try {
int intch;
while ((intch = r.read()) != -1) { //reads a single character and returns integer equivalent
int ch = (char)intch;
//System.out.println("intch=" + intch + " ch=" + ch + " isValidCodePoint()=" + Character.isValidCodePoint(ch)
//+ " isDefined()=" + Character.isDefined(ch) + " charCount()=" + Character.charCount(ch) + " char="
//+ (char)intch);
if (Character.isValidCodePoint(ch)) {
if (intch == 160 ) {
intch = 32;
}
charsList.add((char)intch);
count++;
} else {
System.out.println("unexpected character found but not dealt with.");
}
}
} catch (Exception e) {
e.printStackTrace();
} finally {
System.out.println("Chars read in=" + count + " Chars read out=" + charsList.size());
for(Character item : charsList) {
writer.write((char)item);
}
writer.close();
r.close();
charsList = null;
//check that #160 was replaced File
//f2 = new File(fileName + "x");
//Reader r2 = new BufferedReader(new InputStreamReader(new FileInputStream(f2), "UTF-8"));
//int intch2;
//while ((intch2 = r2.read()) != -1) { //reads a single character and returns integer equivalent
//int ch2 = (char)intch2;
//System.out.println("intch2=" + intch2 + " ch2=" + ch2 + " isValidCodePoint()=" +
//Character.isValidCodePoint(ch2) + " char=" + (char)intch2);
//}
}
}
【问题讨论】:
标签: java file ms-word character-encoding character