虽然@zeppelin 提出的 appendReplacement 解决方案在“最重的数据”上速度惊人,但结果却是更大地图的噩梦。
到目前为止,最好的解决方案是我们所拥有的 (StringUtils.replaceEach) 和建议的组合:
protected BackReplacer createBackReplacer(Map<ReplacementKey, String> replacementMap) {
if (replacementMap.isEmpty()) {
return new BackReplacer() {
@Override
public String backReplace(String str) {
return str;
}
};
}
if (replacementMap.size() > MAX_SIZE_FOR_REGEX) {
final String[] searchStrings = new String[replacementMap.size()];
final String[] replacementStrings = new String[replacementMap.size()];
int counter = 0;
for (Map.Entry<ReplacementKey, String> replacementEntry : replacementMap.entrySet()) {
searchStrings[counter] = replacementEntry.getValue();
replacementStrings[counter] = replacementEntry.getKey().getValue();
counter++;
}
return new BackReplacer() {
@Override
public String backReplace(String str) {
return StringUtils.replaceEach(str, searchStrings, replacementStrings);
}
};
}
final Map<String, String> replacements = new HashMap<>();
StringBuilder patternBuilder = new StringBuilder();
patternBuilder.append('(');
for (Map.Entry<ReplacementKey, String> entry : replacementMap.entrySet()) {
replacements.put(entry.getValue(), entry.getKey().getValue());
patternBuilder.append(entry.getValue()).append('|');
}
patternBuilder.setLength(patternBuilder.length() - 1);
patternBuilder.append(')');
final Pattern pattern = Pattern.compile(patternBuilder.toString());
return new BackReplacer() {
@Override
public String backReplace(String str) {
if (str.isEmpty()) {
return str;
}
StringBuffer sb = new StringBuffer(str.length());
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
matcher.appendReplacement(sb, replacements.get(matcher.group(0)));
}
matcher.appendTail(sb);
return sb.toString();
}
};
}
StringUtils 算法(MAX_SIZE_FOR_REGEX=0):
type=TIMER, name=*.run, count=8127, min=4.239809, max=4235197.925261,
平均值=645.736554,标准差=47197.97968925558,持续时间单位=毫秒
追加替换算法(MAX_SIZE_FOR_REGEX=1000000):
type=TIMER, name=*.run, count=8155, min=4.374516, max=7806145.439165999,
平均值=1145.757953,标准差=86668.38562815856,持续时间单位=毫秒
混合解决方案(MAX_SIZE_FOR_REGEX=5000):
type=TIMER, name=*.run, count=8155, min=3.5862789999999998, max=376242.25076799997,
mean=389.68986564688714, stddev=11733.9997814448, duration_unit=毫秒
我们的数据:
type=HISTOGRAM, name=initialValueLength, count=569549, min=0, max=6352327, mean=6268.940661478599, stddev=198123.040651236, median=12.0, p75=16.0, p95=32.0, p98=854.0, p99=1014.5600000000013, p999=6168541.008000023
type=HISTOGRAM, name=replacementMap.size, count=8155, min=0, max=65008, mean=73.46108949416342, stddev=2027.471388983965, median=4.0, p75=7.0, p95=27.549999999999955, p98=55.41999999999996, p99=210.10000000000036, p999=63138.68900000023
此更改将以前解决方案中 StringUtils.replaceEach 花费的时间减半,并使我们的模块(主要是 IO 绑定)的性能提升了 25%。