【问题标题】:Hibernate search sorting with collation使用排序规则进行休眠搜索排序
【发布时间】:2020-03-16 09:38:54
【问题描述】:

我将 Hibernate 搜索从版本 - 4.3.0.Final 升级到了最新的稳定版本 - 5.4.12.Final。除了对挪威语单词进行排序外,一切都很好。在旧版本的 hibernate 中,构造函数中有带有 locale 的 SortField:

/** Creates a sort, possibly in reverse, by terms in the given field sorted
   * according to the given locale.
   * @param field  Name of field to sort by, cannot be <code>null</code>.
   * @param locale Locale of values in the field.
   */
  public SortField (String field, Locale locale, boolean reverse) {
    initFieldType(field, STRING);
    this.locale = locale;
    this.reverse = reverse;
  }

但是在新的休眠搜索中,SortField 没有语言环境。根据休眠参考文档 (https://docs.jboss.org/hibernate/stable/search/reference/en-US/html_single/#_analysis) 对外语单词进行排序,我们应该使用 Collat​​ionKeyFilterFactory 和规范器。但是这个版本的hibernate search没有这个类。 Maven pom:

<dependency>
   <groupId>org.hibernate</groupId>
   <artifactId>hibernate-search-orm</artifactId>
   <version>5.11.5.Final</version>
</dependency>

问题:我应该在休眠搜索排序挪威单词时使用/创建什么?

现在我有这样的排序顺序:

test、btest、ctest、ztest、åtest、ætest、øtest

正确的顺序:

test、btest、ctest、ztest、ætest、øtest、åtest

有 Collat​​ionKeyAnalyzer 类,但我不知道如何使用它进行排序:

  public final class CollationKeyAnalyzer extends Analyzer {
  private final CollationAttributeFactory factory;

  /**
   * Create a new CollationKeyAnalyzer, using the specified collator.
   *
   * @param collator CollationKey generator
   */
  public CollationKeyAnalyzer(Collator collator) {
    this.factory = new CollationAttributeFactory(collator);
  }

  @Override
  protected TokenStreamComponents createComponents(String fieldName) {
    KeywordTokenizer tokenizer = new KeywordTokenizer(factory, KeywordTokenizer.DEFAULT_BUFFER_SIZE);
    return new TokenStreamComponents(tokenizer, tokenizer);
  }
}

非常相似的问题没有答案:How to do case insensitive sorting of Norwegian characters (Æ, Ø, and Å) using Hibernate Lucene Search?

【问题讨论】:

    标签: java lucene hibernate-search


    【解决方案1】:

    我不确定它对您有多大帮助,但 CollationKeyFilterFactory 已被弃用,实际上已被删除。

    在类的 Javadoc 中它说:

    已弃用。
    请改用CollationKeyAnalyzer

    您可以找到Javadoc here

    【讨论】:

      【解决方案2】:

      但是这个版本的hibernate search没有这个类。

      这部分文档看起来已经过时了,我会考虑更新它。

      我找到了 CollationKeyAnalyzer,但 javadoc 指出它已过时,应该改用 ICUCollationKeyAnalyzer

      尝试将此依赖项添加到您的 POM:

      <dependency>
         <groupId>org.apache.lucene</groupId>
         <artifactId>lucene-analyzers-icu</artifactId>
         <version>5.5.5</version>
      </dependency>
      

      然后创建您自己的分析器类,使用硬编码的语言环境重新实现 ICUCollationKeyAnalyzer

      public class MyCollationKeyAnalyzer extends Analyzer {
          private final ICUCollationAttributeFactory factory;
      
          public MyCollationKeyAnalyzer(Version luceneVersion) {
              this.factory = new ICUCollationAttributeFactory( Collactor.getInstance( Locale.getInstance( "nb_NO" ) ) );
          }
      
          @Override
          protected TokenStreamComponents createComponents(String fieldName) {
              KeywordTokenizer tokenizer = new KeywordTokenizer(factory, KeywordTokenizer.DEFAULT_BUFFER_SIZE);
              return new TokenStreamComponents(tokenizer, tokenizer);
          }
      }
      

      然后创建你的字段:

      @Entity
      @Indexed
      public class MyEntity {
      
          // ...
      
          @Field(name = "title_sort", index = Index.NO, normalizer = @Normalizer(impl = MyCollationKeyAnalyzer.class))
          @SortableField(forField = "title_sort")
          private String title;
      
         // ...
      }
      

      然后像这样对该字段进行排序:

      FullTextEntityManager ftEm = Search.getFullTextEntityManager( entityManager );
      QueryBuilder qb = ...; // The usual
      Query luceneQuery = ...; // The usual
      FullTextQuery ftQuery = ftEm.createFullTextQuery( luceneQuery, MyEntity.class );
      ftQuery.setSort( qb.sort().byField( "title_sort" ).createSort() );
      ftQuery.setMaxResults( 20 );
      List<MyEntity> hits = ftQuery.getResultList();
      

      虽然我没有尝试过,所以如果它对你有用,请告诉我们。

      【讨论】:

      • ICUCollat​​ionKeyAnalyzer 是最终版本,无法扩展
      • 您可以将代码从ICUCollationKeyAnalyzer 复制到您的班级,这非常简单。我更新了答案。
      【解决方案3】:

      为了解决排序问题,我创建了自己的 NorthernCollat​​ionFactory。这不是完美的解决方案,因为我从旧版本的 Hibernate Search (IndexableBinaryStringTools.class) 复制代码,但它工作正常。
      NorwegianCollat​​ionFactory 类

      import org.apache.lucene.analysis.TokenStream;
      import org.apache.lucene.analysis.util.TokenFilterFactory;
      
      import java.text.Collator;
      import java.util.Locale;
      import java.util.Map;
      
      public final class NorwegianCollationFactory extends TokenFilterFactory {
      
          public NorwegianCollationFactory(Map<String, String> args) {
              super(args);
          }
      
          @Override
          public TokenStream create(TokenStream input) {
              Collator norwegianCollator = Collator.getInstance(new Locale("no", "NO"));
              return new CollationKeyFilter(input, norwegianCollator);
          }
      
      }
      

      Collat​​ionKeyFilter 类

      import org.apache.lucene.analysis.TokenFilter;
      import org.apache.lucene.analysis.TokenStream;
      import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
      
      import java.io.IOException;
      import java.text.Collator;
      import java.util.Objects;
      
      public final class CollationKeyFilter extends TokenFilter {
      
          // This code is copied from IndexableBinaryStringTools.class from the old version of hibernate search  4.3.0.Final
          private static final CollationKeyFilter.CodingCase[] CODING_CASES = {
                  new CollationKeyFilter.CodingCase(7, 1),
                  new CollationKeyFilter.CodingCase(14, 6, 2),
                  new CollationKeyFilter.CodingCase(13, 5, 3),
                  new CollationKeyFilter.CodingCase(12, 4, 4),
                  new CollationKeyFilter.CodingCase(11, 3, 5),
                  new CollationKeyFilter.CodingCase(10, 2, 6),
                  new CollationKeyFilter.CodingCase(9, 1, 7),
                  new CollationKeyFilter.CodingCase(8, 0)
          };
      
          private final Collator collator;
          private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
      
          public CollationKeyFilter(TokenStream input, Collator collator) {
              super(input);
              this.collator = (Collator) collator.clone();
          }
      
          @Override
          public boolean incrementToken() throws IOException {
              if (input.incrementToken()) {
                  byte[] collationKey = collator.getCollationKey(termAtt.toString()).toByteArray();
                  int encodedLength = getBinaryStringEncodedLength(collationKey.length);
                  termAtt.resizeBuffer(encodedLength);
                  termAtt.setLength(encodedLength);
                  encodeToBinaryString(collationKey, collationKey.length, termAtt.buffer());
                  return true;
              } else {
                  return false;
              }
          }
      
          // This code is copied from IndexableBinaryStringTools class from the old version of hibernate search  4.3.0.Final
          private void encodeToBinaryString(byte[] inputArray, int inputLength, char[] outputArray) {
              if (inputLength > 0) {
                  int inputByteNum = 0;
                  int caseNum = 0;
                  int outputCharNum = 0;
                  CollationKeyFilter.CodingCase codingCase;
                  for (; inputByteNum + CODING_CASES[caseNum].numBytes <= inputLength; ++outputCharNum) {
                      codingCase = CODING_CASES[caseNum];
                      if (codingCase.numBytes == 2) {
                          outputArray[outputCharNum] = (char) (((inputArray[inputByteNum] & 0xFF) << codingCase.initialShift)
                                  + (((inputArray[inputByteNum + 1] & 0xFF) >>> codingCase.finalShift) & codingCase.finalMask) & (short) 0x7FFF);
                      } else {
                          outputArray[outputCharNum] = (char) (((inputArray[inputByteNum] & 0xFF) << codingCase.initialShift)
                                  + ((inputArray[inputByteNum + 1] & 0xFF) << codingCase.middleShift)
                                  + (((inputArray[inputByteNum + 2] & 0xFF) >>> codingCase.finalShift) & codingCase.finalMask) & (short) 0x7FFF);
                      }
                      inputByteNum += codingCase.advanceBytes;
                      if (++caseNum == CODING_CASES.length) {
                          caseNum = 0;
                      }
                  }
                  codingCase = CODING_CASES[caseNum];
                  if (inputByteNum + 1 < inputLength) {
                      outputArray[outputCharNum++] = (char) ((((inputArray[inputByteNum] & 0xFF) << codingCase.initialShift)
                              + ((inputArray[inputByteNum + 1] & 0xFF) << codingCase.middleShift)) & (short) 0x7FFF);
                      outputArray[outputCharNum] = (char) 1;
                  } else if (inputByteNum < inputLength) {
                      outputArray[outputCharNum++] = (char) (((inputArray[inputByteNum] & 0xFF) << codingCase.initialShift) & (short) 0x7FFF);
                      outputArray[outputCharNum] = caseNum == 0 ? (char) 1 : (char) 0;
                  } else {
                      outputArray[outputCharNum] = (char) 1;
                  }
              }
          }
      
          // This code is copied from IndexableBinaryStringTools class from the old version of hibernate search 4.3.0.Final
          private int getBinaryStringEncodedLength(int inputLength) {
              return (int) ((8L * inputLength + 14L) / 15L) + 1;
          }
      
          // This code is copied from IndexableBinaryStringTools class from the old version of hibernate search 4.3.0.Final
          private static class CodingCase {
              int numBytes;
              int initialShift;
              int middleShift;
              int finalShift;
              int advanceBytes = 2;
              short middleMask;
              short finalMask;
      
              CodingCase(int initialShift, int middleShift, int finalShift) {
                  this.numBytes = 3;
                  this.initialShift = initialShift;
                  this.middleShift = middleShift;
                  this.finalShift = finalShift;
                  this.finalMask = (short) ((short) 0xFF >>> finalShift);
                  this.middleMask = (short) ((short) 0xFF << middleShift);
              }
      
              CodingCase(int initialShift, int finalShift) {
                  this.numBytes = 2;
                  this.initialShift = initialShift;
                  this.finalShift = finalShift;
                  this.finalMask = (short) ((short) 0xFF >>> finalShift);
                  if (finalShift != 0) {
                      advanceBytes = 1;
                  }
              }
          }
      
          @Override
          public boolean equals(Object o) {
              if (this == o) {
                  return true;
              }
              if (o == null || getClass() != o.getClass()) {
                  return false;
              }
              if (!super.equals(o)) {
                  return false;
              }
              CollationKeyFilter that = (CollationKeyFilter) o;
              return Objects.equals(collator, that.collator) &&
                      Objects.equals(termAtt, that.termAtt);
          }
      
          @Override
          public int hashCode() {
              return Objects.hash(super.hashCode(), collator, termAtt);
          }
      
      }
      

      实体映射示例:

      @Entity
      @NormalizerDef(name = "textSortNormalizer",
              filters = {
                      @TokenFilterDef(factory = LowerCaseFilterFactory.class),
                      @TokenFilterDef(factory = PatternReplaceFilterFactory.class, params = {
                              @Parameter(name = "pattern", value = "('-&\\.,\\(\\))"),
                              @Parameter(name = "replacement", value = " "),
                              @Parameter(name = "replace", value = "all")
                      }),
                      @TokenFilterDef(factory = PatternReplaceFilterFactory.class, params = {
                              @Parameter(name = "pattern", value = "([^0-9\\p{L} ])"),
                              @Parameter(name = "replacement", value = ""),
                              @Parameter(name = "replace", value = "all")
                      }),
                      @TokenFilterDef(factory = NorwegianCollationFactory.class)
              }
      )
      public class Entity {
      
          @Field(name = "name_for_sort", normalizer = @Normalizer(definition = "textSortNormalizer"))
          @SortableField(forField = "name_for_sort")
          private String name;
      
      }
      

      【讨论】:

        猜你喜欢
        • 2015-07-25
        • 2012-11-14
        • 1970-01-01
        • 1970-01-01
        • 2019-07-13
        • 2016-10-04
        • 2015-11-06
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多