尝试测试将数据放入Elasticsearch的冷冻数据层级 (2)
我尝试测试将数据放入Elasticsearch的Frozen数据分层(1)
我试着测试了将数据放入Elasticsearch的Frozen数据分层(2)
首先
你好。我是Elastic的解决方案架构师关屋。
在上一篇文章中,我们确认了使用Frozen Tier的设置。这次,我们将使用一种叫做Rally的负载测试工具,来确认数据在实际运行中的移动方式。
考试环境
-
- Elastic Cloudのバージョン8.8.0, GCPの asia-northeast1リージョンを利用
Hardware Profile: CPU Optimizedを選択
Hot Tier: 45 GB Storage| 1 GB RAM | Up to 8 vcpu を1 Availability Zoneとして選択。(Hotで設定できる一番小さい構成)
Frozen Tier: 6.25 TB Storage | 4 GB RAM | Up to 2.5 vcpu を1 Availability Zoneとして選択。(これもFrozenで設定できる一番小さい構成)
RallyはローカルのMac PCにインストールして実施
测试1:仅使用热(不使用冷)的情况下的结果。
ILM策略按照以下方式进行配置,仅在热存储中累积50 GB或者30天后产生滚动(这是日志默认策略的设置)。
{
"policy" : {
"phases" : {
"hot" : {
"min_age" : "0ms",
"actions" : {
"rollover" : {
"max_size" : "50gb",
"max_age" : "30d"
}
}
}
}
}
}
结果:

测试2:激活Frozen
在中文中,将以下内容翻译成自然中文,提供一种选择:
此处设置了启用Frozen,并且将Hot索引在累积了500MB后执行弹出条件。为了立即将其移动到Frozen,请将frozen的min_age设置为0天。
{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_size": "500mb"
}
}
},
"frozen": {
"min_age": "0d",
"actions": {
"searchable_snapshot": {
"snapshot_repository": "found-snapshots"
}
}
}
}
}
}
结果:

对于《冰雪奇缘》的搜索
我們將檢查在Frozen中搜索數據時會發生什麼情況。
请在开发工具中执行以下命令,在清除一次已加载到冻结节点的缓存后查看统计数据。
POST /_searchable_snapshots/cache/clear
GET /_searchable_snapshots/cache/stats
{
"nodes": {
"B4XI4QF-SCGOoccs-4-aNw": {
"shared_cache": {
"reads": 0,
"bytes_read_in_bytes": 0,
"writes": 0,
"bytes_written_in_bytes": 0,
"evictions": 0,
"num_regions": 0,
"size_in_bytes": 0,
"region_size_in_bytes": 16777216
}
},
"z1pa00yOSZK7Ita4vr8h3A": {
"shared_cache": {
"reads": 1464978,
"bytes_read_in_bytes": 1504016278,
"writes": 154,
"bytes_written_in_bytes": 2583691264,
"evictions": 139,
"num_regions": 21888,
"size_in_bytes": 367219703808,
"region_size_in_bytes": 16777216
}
}
}
}
今次我们关注的是下面的z1pa00yOSZK7Ita4vr8h3A节点ID被冻结的节点。
我们关注以下两个变化。
-
- writes … オブジェクトストレージのスナップショットからFrozenノードのキャッシュに書き込まれた総回数
- bytes_written_in_bytes … 上記で書き込まれた総バイト数
那么,我将进行搜索。日志记录在2023年01月01日至2023年01月15日期间进行了负载测试,并且已按照日期的顺序进行了冻结。这些数据是准确的,但要精确查明哪些已经被冻结,需要进一步详细调查,而较早的日期可能会有所提示。
GET /logs-kafka.log-default/_search
{
"track_total_hits": false,
"sort": [
{
"@timestamp": {
"order": "desc",
"unmapped_type": "boolean"
}
}
],
"fields": [
{
"field": "*",
"include_unmapped": "true"
},
{
"field": "@timestamp",
"format": "strict_date_optional_time"
}
],
"size": 1000,
"_source": false,
"query": {
"bool": {
"must": [],
"filter": [
{
"range": {
"@timestamp": {
"format": "strict_date_optional_time",
"gte": "2023-01-01T00:00:00.000Z",
"lte": "2023-01-06T06:30:00.000Z"
}
}
}
]
}
}
}
{
"took": 4829,
"timed_out": false,
"_shards": {
"total": 4,
"successful": 4,
"skipped": 0,
"failed": 0
},
"hits": {
"max_score": null,
"hits": [
{
"_index": "partial-.ds-logs-kafka.log-default-2023.05.30-000002",
"_id": "dbQSaogBX2reHfX1RR7s",
"_score": null,
"fields": {
"host.os.name.text": [
"CentOS Linux"
],
....
第一次搜索花了4829毫秒。
"z1pa00yOSZK7Ita4vr8h3A": {
"shared_cache": {
"reads": 1571556,
"bytes_read_in_bytes": 1613129882,
"writes": 169,
"bytes_written_in_bytes": 2835349504,
"evictions": 154,
"num_regions": 21888,
"size_in_bytes": 367219703808,
"region_size_in_bytes": 16777216
}
}
-
- write … 154 -> 169
- bytes_written_in_bytes … 2583691264 -> 2835349504 (240MB増えた)
随后,将展示在更改搜索条件的过程中反复进行的搜索结果。
第一次搜索:搜索范围(2023年01月01日至2023年01月06日)… 搜索结果:4829毫秒,写入次数169次。
第二次搜索:搜索范围(2023年01月01日至2023年01月06日)… 搜索结果:1237毫秒,写入次数169次。
第三次搜索:搜索范围(2023年01月01日至2023年01月10日)… 搜索结果:1992毫秒,写入次数175次。
第四次搜索:搜索范围(2023年01月01日至2023年01月13日)… 搜索结果:1221毫秒,写入次数177次。
第五次搜索:搜索范围(2023年01月01日至2023年01月16日)… 搜索结果:1011毫秒,写入次数177次。
最后一段在扩大搜索范围的数据(2023年1月13日至2023年1月16日)是热数据,所以冰冻数据的写入次数没有增加。
在负荷测试中的性能数值
在负载测试过程中,启用了堆栈监控来查看集群的性能指标。在两种测试中,由于上传数据的量和速度超过了当前集群的规格,因此在Elastic Cloud的CPU信用配额用尽。因此,由于可用的CPU减少,CPU利用率从中途开始变为100%。可以认为,这不是负载增加导致了从25%到100%,而是由于分母CPU数减少导致达到了100%(有关CPU信用配额的信息请参阅此处)。
因此,无论是哪个测试,都出现了错误。测试1(仅考虑热产品)的错误率为41.32%,测试2(启用冷冻产品)的错误率为67.57%。
性能测试1:只使用热状态的结果(不使用冰冻状态)。


性能测试2:启用冻结功能。



集会的负载测试的详细信息
elastic/logsのトラックを利用
esrally race --track=elastic/logs --track-params="params.json" --target-hosts=XXX.es.asia-northeast1.gcp.cloud.es.io:443 --pipeline=benchmark-only --client-options="use_ssl:true,verify_certs:true,basic_auth_user:'XXX',basic_auth_password:'XXX'" --kill-running-processes
{
"force_data_generation": true,
"throttle_indexing": false,
"bulk_indexing_clients": 4,
"bulk_size": 5000,
"raw_data_volume_per_day": "1GB",
"max_total_download_gb": 2,
"number_of_replicas": 0,
"wait_for_status": "yellow",
"start_date": "2023-01-01",
"end_date": "2023-01-15",
"integration_ratios": {
"kafka": {
"corpora": {
"kafka-logs": 1.0
}
}
}
}
积分
-
- bulk_indexing_clients:4でクライアントの並列度、bulk_size:5000で一度のPOSTによるドキュメント数を設定
-
- 2023-01-01 ~ 2023-01-15の期間のデータを生成し、各日につきraw_data_volume_per_day: 1GB分のデータをアップロード
- デフォルトだと、複数のタイプのログとそれぞれのData Streamが作成されるので、今回は1種類のログ(適当にkafkaのログを選択)となるようにintegration_ratiosにて設定変更
{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_size": "500mb"
}
}
},
"frozen": {
"min_age": "0d",
"actions": {
"searchable_snapshot": {
"snapshot_repository": "found-snapshots"
}
}
}
}
}
}
- Rally側が使用するILMの設定もテスト中に生成しているので、Rally側のこのファイルにてILMを設定しないと、Rally実行時にこのファイルの値で上書きされます。
% esrally race --track=elastic/logs --track-params="params.json" --target-hosts=xxx.es.asia-northeast1.gcp.cloud.es.io:443 --pipeline=benchmark-only --client-options="use_ssl:true,verify_certs:true,basic_auth_user:'xxx',basic_auth_password:'xxx'" --kill-running-processes
____ ____
/ __ \____ _/ / /_ __
/ /_/ / __ `/ / / / / /
/ _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
/____/
[INFO] Race id is [25054e47-f84d-43bb-8db7-359c87859791]
[INFO] Racing on track [elastic/logs], challenge [logging-indexing] and car ['external'] with version [8.8.0].
Running insert-pipelines [100% done]
Running insert-ilm [100% done]
Running delete-all-datastreams [100% done]
Running delete-all-composable-templates [100% done]
Running delete-all-component-templates [100% done]
Running create-all-component-templates [100% done]
Running create-all-composable-templates [100% done]
Running create-required-data-streams [100% done]
Running validate-package-template-installation [100% done]
Running update-custom-package-templates [100% done]
Running check-cluster-health [100% done]
Running wait-until-merges-finish [100% done]
Running bulk-index [100% done]
Running compression-stats [100% done]
------------------------------------------------------
_______ __ _____
/ ____(_)___ ____ _/ / / ___/_________ ________
/ /_ / / __ \/ __ `/ / \__ \/ ___/ __ \/ ___/ _ \
/ __/ / / / / / /_/ / / ___/ / /__/ /_/ / / / __/
/_/ /_/_/ /_/\__,_/_/ /____/\___/\____/_/ \___/
------------------------------------------------------
| Metric | Task | Value | Unit |
|---------------------------------------------------------------:|---------------------------------------:|-----------------:|-------:|
| Cumulative indexing time of primary shards | | 107.499 | min |
| Min cumulative indexing time across primary shards | | 0 | min |
| Median cumulative indexing time across primary shards | | 0 | min |
| Max cumulative indexing time across primary shards | | 107.499 | min |
| Cumulative indexing throttle time of primary shards | | 0 | min |
| Min cumulative indexing throttle time across primary shards | | 0 | min |
| Median cumulative indexing throttle time across primary shards | | 0 | min |
| Max cumulative indexing throttle time across primary shards | | 0 | min |
| Cumulative merge time of primary shards | | 18.6241 | min |
| Cumulative merge count of primary shards | | 16 | |
| Min cumulative merge time across primary shards | | 0 | min |
| Median cumulative merge time across primary shards | | 0 | min |
| Max cumulative merge time across primary shards | | 18.6241 | min |
| Cumulative merge throttle time of primary shards | | 4.39638 | min |
| Min cumulative merge throttle time across primary shards | | 0 | min |
| Median cumulative merge throttle time across primary shards | | 0 | min |
| Max cumulative merge throttle time across primary shards | | 4.39638 | min |
| Cumulative refresh time of primary shards | | 0.19035 | min |
| Cumulative refresh count of primary shards | | 883 | |
| Min cumulative refresh time across primary shards | | 0 | min |
| Median cumulative refresh time across primary shards | | 0 | min |
| Max cumulative refresh time across primary shards | | 0.19035 | min |
| Cumulative flush time of primary shards | | 3.9492 | min |
| Cumulative flush count of primary shards | | 252 | |
| Min cumulative flush time across primary shards | | 1.66667e-05 | min |
| Median cumulative flush time across primary shards | | 1.66667e-05 | min |
| Max cumulative flush time across primary shards | | 3.94875 | min |
| Total Young Gen GC time | | 1203.2 | s |
| Total Young Gen GC count | | 107376 | |
| Total Old Gen GC time | | 4.465 | s |
| Total Old Gen GC count | | 10 | |
| Store size | | 3.53154 | GB |
| Translog size | | 0.0426439 | GB |
| Heap used for segments | | 0 | MB |
| Heap used for doc values | | 0 | MB |
| Heap used for terms | | 0 | MB |
| Heap used for norms | | 0 | MB |
| Heap used for points | | 0 | MB |
| Heap used for stored fields | | 0 | MB |
| Segment count | | 48 | |
| Total Ingest Pipeline count | | 3.5673e+07 | |
| Total Ingest Pipeline time | | 5820.18 | s |
| Total Ingest Pipeline failed | | 0 | |
| Min Throughput | insert-pipelines | 41.71 | ops/s |
| Mean Throughput | insert-pipelines | 41.71 | ops/s |
| Median Throughput | insert-pipelines | 41.71 | ops/s |
| Max Throughput | insert-pipelines | 41.71 | ops/s |
| 100th percentile latency | insert-pipelines | 357.968 | ms |
| 100th percentile service time | insert-pipelines | 357.968 | ms |
| error rate | insert-pipelines | 0 | % |
| Min Throughput | insert-ilm | 12.43 | ops/s |
| Mean Throughput | insert-ilm | 12.43 | ops/s |
| Median Throughput | insert-ilm | 12.43 | ops/s |
| Max Throughput | insert-ilm | 12.43 | ops/s |
| 100th percentile latency | insert-ilm | 78.4338 | ms |
| 100th percentile service time | insert-ilm | 78.4338 | ms |
| error rate | insert-ilm | 0 | % |
| Min Throughput | validate-package-template-installation | 17.53 | ops/s |
| Mean Throughput | validate-package-template-installation | 17.53 | ops/s |
| Median Throughput | validate-package-template-installation | 17.53 | ops/s |
| Max Throughput | validate-package-template-installation | 17.53 | ops/s |
| 100th percentile latency | validate-package-template-installation | 56.8506 | ms |
| 100th percentile service time | validate-package-template-installation | 56.8506 | ms |
| error rate | validate-package-template-installation | 0 | % |
| Min Throughput | update-custom-package-templates | 34.05 | ops/s |
| Mean Throughput | update-custom-package-templates | 34.05 | ops/s |
| Median Throughput | update-custom-package-templates | 34.05 | ops/s |
| Max Throughput | update-custom-package-templates | 34.05 | ops/s |
| 100th percentile latency | update-custom-package-templates | 1086.34 | ms |
| 100th percentile service time | update-custom-package-templates | 1086.34 | ms |
| error rate | update-custom-package-templates | 0 | % |
| Min Throughput | bulk-index | 607.81 | docs/s |
| Mean Throughput | bulk-index | 3017.3 | docs/s |
| Median Throughput | bulk-index | 2891.53 | docs/s |
| Max Throughput | bulk-index | 3903.91 | docs/s |
| 50th percentile latency | bulk-index | 3521.55 | ms |
| 90th percentile latency | bulk-index | 8493.09 | ms |
| 99th percentile latency | bulk-index | 12535.7 | ms |
| 99.9th percentile latency | bulk-index | 18846.4 | ms |
| 99.99th percentile latency | bulk-index | 39834.5 | ms |
| 100th percentile latency | bulk-index | 42908.8 | ms |
| 50th percentile service time | bulk-index | 3521.55 | ms |
| 90th percentile service time | bulk-index | 8493.09 | ms |
| 99th percentile service time | bulk-index | 12535.7 | ms |
| 99.9th percentile service time | bulk-index | 18846.4 | ms |
| 99.99th percentile service time | bulk-index | 39834.5 | ms |
| 100th percentile service time | bulk-index | 42908.8 | ms |
| error rate | bulk-index | 41.32 | % |
| 100th percentile latency | compression-stats | 62054.3 | ms |
| 100th percentile service time | compression-stats | 62054.3 | ms |
| error rate | compression-stats | 100 | % |
[WARNING] Error rate is 41.32 for operation 'bulk-index'. Please check the logs.
[WARNING] Error rate is 100.0 for operation 'compression-stats'. Please check the logs.
[WARNING] No throughput metrics available for [compression-stats]. Likely cause: Error rate is 100.0%. Please check the logs.
-----------------------------------
[INFO] SUCCESS (took 14243 seconds)
-----------------------------------
% esrally race --track=elastic/logs --track-params="params.json" --target-hosts=xxx.es.asia-northeast1.gcp.cloud.es.io:443 --pipeline=benchmark-only --client-options="use_ssl:true,verify_certs:true,basic_auth_user:'xxx',basic_auth_password:'xxx'" --kill-running-processes
____ ____
/ __ \____ _/ / /_ __
/ /_/ / __ `/ / / / / /
/ _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
/____/
[INFO] Race id is [d851bc57-729a-401c-b6eb-1d11ca791985]
[INFO] Racing on track [elastic/logs], challenge [logging-indexing] and car ['external'] with version [8.8.0].
Running insert-pipelines [100% done]
Running insert-ilm [100% done]
Running delete-all-datastreams [100% done]
Running delete-all-composable-templates [100% done]
Running delete-all-component-templates [100% done]
Running create-all-component-templates [100% done]
Running create-all-composable-templates [100% done]
Running create-required-data-streams [100% done]
Running validate-package-template-installation [100% done]
Running update-custom-package-templates [100% done]
Running check-cluster-health [100% done]
Running wait-until-merges-finish [100% done]
Running bulk-index [100% done]
Running compression-stats [100% done]
------------------------------------------------------
_______ __ _____
/ ____(_)___ ____ _/ / / ___/_________ ________
/ /_ / / __ \/ __ `/ / \__ \/ ___/ __ \/ ___/ _ \
/ __/ / / / / / /_/ / / ___/ / /__/ /_/ / / / __/
/_/ /_/_/ /_/\__,_/_/ /____/\___/\____/_/ \___/
------------------------------------------------------
| Metric | Task | Value | Unit |
|---------------------------------------------------------------:|---------------------------------------:|----------------:|-------:|
| Cumulative indexing time of primary shards | | 12.7184 | min |
| Min cumulative indexing time across primary shards | | 0 | min |
| Median cumulative indexing time across primary shards | | 0 | min |
| Max cumulative indexing time across primary shards | | 12.7184 | min |
| Cumulative indexing throttle time of primary shards | | 0 | min |
| Min cumulative indexing throttle time across primary shards | | 0 | min |
| Median cumulative indexing throttle time across primary shards | | 0 | min |
| Max cumulative indexing throttle time across primary shards | | 0 | min |
| Cumulative merge time of primary shards | | 1.90947 | min |
| Cumulative merge count of primary shards | | 2 | |
| Min cumulative merge time across primary shards | | 0 | min |
| Median cumulative merge time across primary shards | | 0 | min |
| Max cumulative merge time across primary shards | | 1.90947 | min |
| Cumulative merge throttle time of primary shards | | 0.187017 | min |
| Min cumulative merge throttle time across primary shards | | 0 | min |
| Median cumulative merge throttle time across primary shards | | 0 | min |
| Max cumulative merge throttle time across primary shards | | 0.187017 | min |
| Cumulative refresh time of primary shards | | 0.1045 | min |
| Cumulative refresh count of primary shards | | 860 | |
| Min cumulative refresh time across primary shards | | 0 | min |
| Median cumulative refresh time across primary shards | | 0 | min |
| Max cumulative refresh time across primary shards | | 0.104333 | min |
| Cumulative flush time of primary shards | | 1.45932 | min |
| Cumulative flush count of primary shards | | 47 | |
| Min cumulative flush time across primary shards | | 0 | min |
| Median cumulative flush time across primary shards | | 1.66667e-05 | min |
| Max cumulative flush time across primary shards | | 1.45888 | min |
| Total Young Gen GC time | | 940.218 | s |
| Total Young Gen GC count | | 86361 | |
| Total Old Gen GC time | | 61.847 | s |
| Total Old Gen GC count | | 131 | |
| Store size | | 0.529193 | GB |
| Translog size | | 0.0852852 | GB |
| Heap used for segments | | 0 | MB |
| Heap used for doc values | | 0 | MB |
| Heap used for terms | | 0 | MB |
| Heap used for norms | | 0 | MB |
| Heap used for points | | 0 | MB |
| Heap used for stored fields | | 0 | MB |
| Segment count | | 6 | |
| Total Ingest Pipeline count | | 1.97065e+07 | |
| Total Ingest Pipeline time | | 3967.68 | s |
| Total Ingest Pipeline failed | | 0 | |
| Min Throughput | insert-pipelines | 60.57 | ops/s |
| Mean Throughput | insert-pipelines | 60.57 | ops/s |
| Median Throughput | insert-pipelines | 60.57 | ops/s |
| Max Throughput | insert-pipelines | 60.57 | ops/s |
| 100th percentile latency | insert-pipelines | 245.006 | ms |
| 100th percentile service time | insert-pipelines | 245.006 | ms |
| error rate | insert-pipelines | 0 | % |
| Min Throughput | insert-ilm | 12.34 | ops/s |
| Mean Throughput | insert-ilm | 12.34 | ops/s |
| Median Throughput | insert-ilm | 12.34 | ops/s |
| Max Throughput | insert-ilm | 12.34 | ops/s |
| 100th percentile latency | insert-ilm | 78.2425 | ms |
| 100th percentile service time | insert-ilm | 78.2425 | ms |
| error rate | insert-ilm | 0 | % |
| Min Throughput | validate-package-template-installation | 18.22 | ops/s |
| Mean Throughput | validate-package-template-installation | 18.22 | ops/s |
| Median Throughput | validate-package-template-installation | 18.22 | ops/s |
| Max Throughput | validate-package-template-installation | 18.22 | ops/s |
| 100th percentile latency | validate-package-template-installation | 54.335 | ms |
| 100th percentile service time | validate-package-template-installation | 54.335 | ms |
| error rate | validate-package-template-installation | 0 | % |
| Min Throughput | update-custom-package-templates | 16.01 | ops/s |
| Mean Throughput | update-custom-package-templates | 16.01 | ops/s |
| Median Throughput | update-custom-package-templates | 16.01 | ops/s |
| Max Throughput | update-custom-package-templates | 16.01 | ops/s |
| 100th percentile latency | update-custom-package-templates | 2309.3 | ms |
| 100th percentile service time | update-custom-package-templates | 2309.3 | ms |
| error rate | update-custom-package-templates | 0 | % |
| Min Throughput | bulk-index | 898.85 | docs/s |
| Mean Throughput | bulk-index | 3108.86 | docs/s |
| Median Throughput | bulk-index | 2970.43 | docs/s |
| Max Throughput | bulk-index | 3908.06 | docs/s |
| 50th percentile latency | bulk-index | 274.681 | ms |
| 90th percentile latency | bulk-index | 7161.03 | ms |
| 99th percentile latency | bulk-index | 12559.4 | ms |
| 99.9th percentile latency | bulk-index | 20234.2 | ms |
| 99.99th percentile latency | bulk-index | 25734.1 | ms |
| 100th percentile latency | bulk-index | 25973.1 | ms |
| 50th percentile service time | bulk-index | 274.681 | ms |
| 90th percentile service time | bulk-index | 7161.03 | ms |
| 99th percentile service time | bulk-index | 12559.4 | ms |
| 99.9th percentile service time | bulk-index | 20234.2 | ms |
| 99.99th percentile service time | bulk-index | 25734.1 | ms |
| 100th percentile service time | bulk-index | 25973.1 | ms |
| error rate | bulk-index | 67.57 | % |
| 100th percentile latency | compression-stats | 61464.8 | ms |
| 100th percentile service time | compression-stats | 61464.8 | ms |
| error rate | compression-stats | 100 | % |
[WARNING] Error rate is 67.57 for operation 'bulk-index'. Please check the logs.
[WARNING] Error rate is 100.0 for operation 'compression-stats'. Please check the logs.
[WARNING] No throughput metrics available for [compression-stats]. Likely cause: Error rate is 100.0%. Please check the logs.
----------------------------------
[INFO] SUCCESS (took 8542 seconds)
----------------------------------
总结
由于我们为应对上传负载所准备的环境资源不足,所以出现了错误。但是,即使在这种情况下,如果像这次一样频繁地进行数据的滚动和移动到Frozen,我认为会增加一些额外的开销,这也导致了上传错误和速度的差异。
总之,本次我们能够确认随着数据注入的增加,数据会如何移动到Frozen,所以我认为还不错。