Skip to content

索引与查找使用相同的analyzer,结果无法命中 #1851

@SxunS

Description

@SxunS

以下是lucene9.7的官方示例,仅修改了保存值。

    @org.junit.jupiter.api.Test
    public void test3() throws IOException, ParseException {
        Analyzer analyzer = new HanLPAnalyzer();

        Path indexPath = Files.createTempDirectory("tempIndex");
        Directory directory = FSDirectory.open(indexPath);
        IndexWriterConfig config = new IndexWriterConfig(analyzer);
        IndexWriter iwriter = new IndexWriter(directory, config);
        Document doc = new Document();
        String text = "中国人";
        doc.add(new TextField("fieldname", text, Field.Store.YES));
        iwriter.addDocument(doc);
        iwriter.close();

        // Now search the index:
        DirectoryReader ireader = DirectoryReader.open(directory);
        IndexSearcher isearcher = new IndexSearcher(ireader);
        // Parse a simple query that searches for "text":
        QueryParser parser = new QueryParser("fieldname", analyzer);
        Query query = parser.parse(text);
        ScoreDoc[] hits = isearcher.search(query, 10).scoreDocs;
        assertEquals(1, hits.length);
        // Iterate through the results:
        StoredFields storedFields = isearcher.storedFields();
        for (int i = 0; i < hits.length; i++) {
            Document hitDoc = storedFields.document(hits[i].doc);
            assertEquals("中国人", hitDoc.get("fieldname"));
        }
        ireader.close();
        directory.close();
        IOUtils.rm(indexPath);
    }

运行结果:

org.opentest4j.AssertionFailedError: 
Expected :1
Actual   :0

调试过程中发现:analyzer的查找分词会将 中国人 分成 中国。导致查询不到。
但commit 和 search 是使用的同一个analyzer。

尝试将搜索条件 修改成 A 中国人,发现可以命中结果,此时查询时分词正常,分成 A,中国人。、

这是一个bug还是特性?

System information

  • WIN11
  • HanLP-portable:1.8.4
  • hanlp-lucene-plugin:1.1.7
  • I've completed this form and searched the web for solutions.

Metadata

Metadata

Assignees

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions