我尝试使用Elasticsearch
因为我完全不了解Elasticsearch是什么,所以我安装了它并试着使用了一下。
准备
首先,在本地搭建运行环境。
操作系统:Windows 11 Home 22H2
引擎:elasticsearch 8.10.2
工具:kibana 8.10.2
语言:C#
请从以下网站下载elasticsearch,并将其解压缩到一个合适的文件夹中。
https://www.elastic.co/jp/downloads/elasticsearch
然后,从以下网站下载kibana并将其解压缩到任意文件夹中。
据说kibana是与Elasticsearch协作的工具,用于探索、可视化和分析数据。
https://www.elastic.co/jp/downloads/kibana
在本地环境中,由于只是做一些试验,所以我将config文件中的security设置为false,并且由于在Elasticsearch启动时出现了OutOfMemory的问题,所以我修改了jvm.options。
xpack.security.enabled: false
-Xms4g
-Xmx4g
另外,打开命令提示符并转到elasticsearch-8.10.2\bin目录,执行以下命令来安装插件。
elasticsearch-plugin install analysis-kuromoji
elasticsearch-plugin install analysis-icu
打开命令提示符并转到elasticsearch-8.10.2\bin目录,启动Elasticsearch。
elasticsearch.bat
打开浏览器并输入以下URL,如果显示出相关信息就可以了。
http://localhost:9200/
kibana.bat
请打开浏览器并访问以下URL,如果页面显示正常,则表示OK。
http://localhost:5601/
尝试使用NGram
当您执行以下的C#程序时,将会创建一个使用NGram进行索引设置的Tokenizer。
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<OutputType>Exe</OutputType>
<TargetFramework>net7.0</TargetFramework>
<ImplicitUsings>enable</ImplicitUsings>
<Nullable>enable</Nullable>
</PropertyGroup>
<ItemGroup>
<PackageReference Include="Elastic.Clients.Elasticsearch" Version="8.10.0" />
<PackageReference Include="System.Text.Encoding.CodePages" Version="7.0.0" />
</ItemGroup>
</Project>
using System.Text;
using Elastic.Clients.Elasticsearch;
using Elastic.Clients.Elasticsearch.Analysis;
using Elastic.Clients.Elasticsearch.IndexManagement;
using Elastic.Clients.Elasticsearch.Mapping;
Console.WriteLine(DateTime.Now.ToString("yyyy/MM/dd HH:mm:ss.fff"));
Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
var settings = new ElasticsearchClientSettings(new Uri("http://localhost:9200"));
var client = new ElasticsearchClient(settings);
var isa = new IndexSettingsAnalysis();
// Tokenizers
isa.Tokenizers = new Tokenizers();
isa.Tokenizers.Add("custom_ngram_tokenizer", new NGramTokenizer()
{
MinGram = 1,
MaxGram = 10
});
// Tokenizers
isa.TokenFilters = new TokenFilters();
// Analyzers
isa.Analyzers = new Analyzers();
isa.Analyzers.Add("custom_ngram_analyzer", new CustomAnalyzer()
{
Tokenizer = "custom_ngram_tokenizer"
});
// delete index & create index
var delRes = await client.Indices.DeleteAsync("my-index");
var creRes = client.Indices.Create("my-index", c => c
.Settings(s => s
.Index(
new IndexSettings()
{
Analysis = isa,
MaxNgramDiff = 10
})
)
);
Console.WriteLine(DateTime.Now.ToString("yyyy/MM/dd HH:mm:ss.fff"));
GET /my-index/_settings
GET /my-index/_analyze
{
"analyzer": "custom_ngram_analyzer",
"text": "ロキソプロフェン錠60mg"
}
我尝试使用kuromoji。
这次我将尝试使用支持日语的kuromoji分词器。执行以下的C#程序将会创建custom_kuromoji_analyzer。
using System.Text;
using Elastic.Clients.Elasticsearch;
using Elastic.Clients.Elasticsearch.Analysis;
using Elastic.Clients.Elasticsearch.IndexManagement;
using Elastic.Clients.Elasticsearch.Mapping;
Console.WriteLine(DateTime.Now.ToString("yyyy/MM/dd HH:mm:ss.fff"));
Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
var settings = new ElasticsearchClientSettings(new Uri("http://localhost:9200"));
var client = new ElasticsearchClient(settings);
var isa = new IndexSettingsAnalysis();
// Tokenizers
isa.Tokenizers = new Tokenizers();
isa.Tokenizers.Add("custom_ngram_tokenizer", new NGramTokenizer()
{
MinGram = 1,
MaxGram = 10
});
isa.Tokenizers.Add("custom_kuromoji_tokenizer", new KuromojiTokenizer()
{
Mode = KuromojiTokenizationMode.Search,
DiscardPunctuation = true
});
// TokenFilters
isa.TokenFilters = new TokenFilters();
// Analyzers
isa.Analyzers = new Analyzers();
isa.Analyzers.Add("custom_ngram_analyzer", new CustomAnalyzer()
{
Tokenizer = "custom_ngram_tokenizer"
});
isa.Analyzers.Add("custom_kuromoji_analyzer", new CustomAnalyzer()
{
Tokenizer = "custom_kuromoji_tokenizer",
});
// delete index & create index
var delRes = await client.Indices.DeleteAsync("my-index");
var creRes = client.Indices.Create("my-index", c => c
.Settings(s => s
.Index(
new IndexSettings()
{
Analysis = isa,
MaxNgramDiff = 10
})
)
);
Console.WriteLine(DateTime.Now.ToString("yyyy/MM/dd HH:mm:ss.fff"));
请使用custom_kuromoji_analyzer将“ロキソプロフェン錠60mg”进行与之前相同的分词处理。
请在Kibana的Dev Tools中运行以下命令。
GET /my-index/_analyze
{
"analyzer": "custom_kuromoji_analyzer",
"text": "ロキソプロフェン錠60mg"
}
尝试使用kuromoji将文本转换为假名。
然后,使用kuromoji的TokenFilter将其转换为假名。执行下面的C#程序将创建custom_kuromoji_yomi_analyzer。
using System.Text;
using Elastic.Clients.Elasticsearch;
using Elastic.Clients.Elasticsearch.Analysis;
using Elastic.Clients.Elasticsearch.IndexManagement;
using Elastic.Clients.Elasticsearch.Mapping;
Console.WriteLine(DateTime.Now.ToString("yyyy/MM/dd HH:mm:ss.fff"));
Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
var settings = new ElasticsearchClientSettings(new Uri("http://localhost:9200"));
var client = new ElasticsearchClient(settings);
var isa = new IndexSettingsAnalysis();
// Tokenizers
isa.Tokenizers = new Tokenizers();
isa.Tokenizers.Add("custom_ngram_tokenizer", new NGramTokenizer()
{
MinGram = 1,
MaxGram = 10
});
isa.Tokenizers.Add("custom_kuromoji_tokenizer", new KuromojiTokenizer()
{
Mode = KuromojiTokenizationMode.Search,
DiscardPunctuation = true
});
// TokenFilters
isa.TokenFilters = new TokenFilters();
// Analyzers
isa.Analyzers = new Analyzers();
isa.Analyzers.Add("custom_ngram_analyzer", new CustomAnalyzer()
{
Tokenizer = "custom_ngram_tokenizer"
});
isa.Analyzers.Add("custom_kuromoji_analyzer", new CustomAnalyzer()
{
Tokenizer = "custom_kuromoji_tokenizer"
});
isa.Analyzers.Add("custom_kuromoji_yomi_analyzer", new CustomAnalyzer()
{
Tokenizer = "custom_kuromoji_tokenizer",
Filter = new List<string>{ "kuromoji_readingform" }
});
// delete index & create index
var delRes = await client.Indices.DeleteAsync("my-index");
var creRes = client.Indices.Create("my-index", c => c
.Settings(s => s
.Index(
new IndexSettings()
{
Analysis = isa,
MaxNgramDiff = 10
})
)
);
Console.WriteLine(DateTime.Now.ToString("yyyy/MM/dd HH:mm:ss.fff"));
请尝试在Kibana的Dev Tools中使用custom_kuromoji_yomi_analyzer将”ロキソプロフェン錠60mg”进行分词,就像之前所介绍的那样执行以下命令。
GET /my-index/_analyze
{
"analyzer": "custom_kuromoji_yomi_analyzer",
"text": "ロキソプロフェン錠60mg"
}
试着删除不必要的文字
我将尝试使用CharFilter来删除不必要的字符。
using System.Text;
using Elastic.Clients.Elasticsearch;
using Elastic.Clients.Elasticsearch.Analysis;
using Elastic.Clients.Elasticsearch.IndexManagement;
using Elastic.Clients.Elasticsearch.Mapping;
Console.WriteLine(DateTime.Now.ToString("yyyy/MM/dd HH:mm:ss.fff"));
Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
var settings = new ElasticsearchClientSettings(new Uri("http://localhost:9200"));
var client = new ElasticsearchClient(settings);
var isa = new IndexSettingsAnalysis();
// Charfilters
isa.CharFilters = new CharFilters();
isa.CharFilters.Add("custom_char_filter", new PatternReplaceCharFilter()
{
Pattern = "[0-9]|mg",
Replacement = ""
});
// Tokenizers
isa.Tokenizers = new Tokenizers();
isa.Tokenizers.Add("custom_ngram_tokenizer", new NGramTokenizer()
{
MinGram = 1,
MaxGram = 10
});
isa.Tokenizers.Add("custom_kuromoji_tokenizer", new KuromojiTokenizer()
{
Mode = KuromojiTokenizationMode.Search,
DiscardPunctuation = true
});
// TokenFilters
isa.TokenFilters = new TokenFilters();
// Analyzers
isa.Analyzers = new Analyzers();
isa.Analyzers.Add("custom_ngram_analyzer", new CustomAnalyzer()
{
Tokenizer = "custom_ngram_tokenizer"
});
isa.Analyzers.Add("custom_kuromoji_analyzer", new CustomAnalyzer()
{
Tokenizer = "custom_kuromoji_tokenizer"
});
isa.Analyzers.Add("custom_kuromoji_yomi_analyzer", new CustomAnalyzer()
{
CharFilter = new List<string>{ "custom_char_filter" },
Tokenizer = "custom_kuromoji_tokenizer",
Filter = new List<string>{ "kuromoji_readingform" }
});
// delete index & create index
var delRes = await client.Indices.DeleteAsync("my-index");
var creRes = client.Indices.Create("my-index", c => c
.Settings(s => s
.Index(
new IndexSettings()
{
Analysis = isa,
MaxNgramDiff = 10
})
)
);
Console.WriteLine(DateTime.Now.ToString("yyyy/MM/dd HH:mm:ss.fff"));
请尝试在Kibana的开发工具中运行以下命令,像之前一样使用”custom_kuromoji_yomi_analyzer”对”ロキソプロフェン錠60mg”进行分词处理。
GET /my-index/_analyze
{
"analyzer": "custom_kuromoji_yomi_analyzer",
"text": "ロキソプロフェン錠60mg"
}
isa.TokenFilters.Add("custom_edge_ngram_filter", new EdgeNGramTokenFilter()
{
MinGram = 1,
MaxGram = 10
});
isa.Analyzers.Add("custom_kuromoji_yomi_analyzer", new CustomAnalyzer()
{
CharFilter = new List<string>{ "custom_char_filter" },
Tokenizer = "custom_kuromoji_tokenizer",
Filter = new List<string>{ "kuromoji_readingform", "custom_edge_ngram_filter" }
});
isa.Analyzers.Add("custom_ngram_analyzer", new CustomAnalyzer()
{
CharFilter = new List<string>{ "custom_char_filter" },
Tokenizer = "custom_ngram_tokenizer",
});
isa.Analyzers.Add("custom_kuromoji_analyzer", new CustomAnalyzer()
{
CharFilter = new List<string>{ "custom_char_filter" },
Tokenizer = "custom_kuromoji_tokenizer",
});
创建索引并进行搜索试试看
我已经创建了三个分析器。最后,使用custom_ngram_analyzer和custom_kuromoji_yomi_analyzer来创建一个医药品搜索索引。
医药品的主文件将从以下网站下载全部文件(ZIP:898KB)。
https://www.ssk.or.jp/smph/seikyushiharai/tensuhyo/kihonmasta/kihonmasta_04.html
在以下的程序中,我们将使用custom_ngram_analyzer和custom_kuromoji_yomi_analyzer对字段进行映射,并创建索引。
请根据您自己的环境修改CSV的路径。
using System.Text;
using Elastic.Clients.Elasticsearch;
using Elastic.Clients.Elasticsearch.Analysis;
using Elastic.Clients.Elasticsearch.IndexManagement;
using Elastic.Clients.Elasticsearch.Mapping;
Console.WriteLine(DateTime.Now.ToString("yyyy/MM/dd HH:mm:ss.fff"));
Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
var settings = new ElasticsearchClientSettings(new Uri("http://localhost:9200"));
var client = new ElasticsearchClient(settings);
var isa = new IndexSettingsAnalysis();
// Charfilters
isa.CharFilters = new CharFilters();
isa.CharFilters.Add("custom_char_filter", new PatternReplaceCharFilter()
{
Pattern = "[0-9]|mg",
Replacement = ""
});
// Tokenizers
isa.Tokenizers = new Tokenizers();
isa.Tokenizers.Add("custom_ngram_tokenizer", new NGramTokenizer()
{
MinGram = 1,
MaxGram = 10
});
isa.Tokenizers.Add("custom_kuromoji_tokenizer", new KuromojiTokenizer()
{
Mode = KuromojiTokenizationMode.Search,
DiscardPunctuation = true
});
// TokenFilters
isa.TokenFilters = new TokenFilters();
isa.TokenFilters.Add("custom_edge_ngram_filter", new EdgeNGramTokenFilter()
{
MinGram = 1,
MaxGram = 10
});
// Analyzers
isa.Analyzers = new Analyzers();
isa.Analyzers.Add("custom_ngram_analyzer", new CustomAnalyzer()
{
CharFilter = new List<string>{ "custom_char_filter" },
Tokenizer = "custom_ngram_tokenizer",
});
isa.Analyzers.Add("custom_kuromoji_analyzer", new CustomAnalyzer()
{
CharFilter = new List<string>{ "custom_char_filter" },
Tokenizer = "custom_kuromoji_tokenizer",
});
isa.Analyzers.Add("custom_kuromoji_yomi_analyzer", new CustomAnalyzer()
{
CharFilter = new List<string>{ "custom_char_filter" },
Tokenizer = "custom_kuromoji_tokenizer",
Filter = new List<string>{ "kuromoji_readingform", "custom_edge_ngram_filter" }
});
isa.Analyzers.Add("keyword_search_analyzer", new CustomAnalyzer()
{
CharFilter = new List<string>{ "custom_char_filter" },
Tokenizer = "whitespace"
});
// delete index & create index
var delRes = await client.Indices.DeleteAsync("my-index");
var creRes = client.Indices.Create("my-index", c => c
.Settings(s => s
.Index(
new IndexSettings()
{
Analysis = isa,
MaxNgramDiff = 10
})
)
.Mappings(map => map
.Properties(
new Properties<YakName>()
{
{ "yakNameNgram", new TextProperty()
{
Analyzer = "custom_ngram_analyzer",
SearchAnalyzer = "keyword_search_analyzer"
}
},
{ "yakNameKana", new TextProperty()
{
Analyzer = "custom_kuromoji_yomi_analyzer",
SearchAnalyzer = "keyword_search_analyzer"
}
}
}
)
)
);
using (StreamReader reader = new StreamReader(@"C:\es\y_ALL20230920.csv", Encoding.GetEncoding("Shift_JIS")))
{
while (!reader.EndOfStream)
{
string line = reader.ReadLine();
string[] values = line.Split(',');
var yak_name = values[4].Trim(new char[] { '"' });
if (String.IsNullOrEmpty(yak_name))
{
continue;
}
var yakName = new YakName
{
YakNameNgram = yak_name,
YakNameKana = yak_name
};
var indexRes = await client.IndexAsync(yakName, "my-index");
}
}
Console.WriteLine(DateTime.Now.ToString("yyyy/MM/dd HH:mm:ss.fff"));
public class YakName
{
public string YakNameNgram { get; set; }
public string YakNameKana { get; set; }
}
我认为创建索引只需要几分钟时间。
完成后,请在Kibana的Dev Tools中运行以下命令,尝试搜索药品。
GET /my-index/_search
{
"query": {
"bool": {
"should": [
{ "match": { "yakNameNgram": "恵美須" } }
]
}
},
"sort" : [{"_score":"desc"}]
}
GET /my-index/_search
{
"query": {
"bool": {
"should": [
{ "match": { "yakNameKana": "ショウサンギン" } }
]
}
},
"sort" : [{"_score":"desc"}]
}