티스토리 뷰
1. NS 통으로 복구
% velero schedule get
NAME STATUS CREATED SCHEDULE BACKUP TTL LAST BACKUP SELECTOR
daily-full-backup Enabled 2022-05-06 16:26:40 +0900 KST @every 24h 168h0m0s 16h ago <none>
클러스터에는 매일 돌아가는 full 백업이 있다
클러스터내 오브젝트 삭제후 velero backup을 이용하여 복구 테스트
아마 관리자의 실수로 ns를 날리거나 했을때 사용할수 있을거 같다
대상 ns선정
ybchoi@ybchoiui-MacBookPro ~ % k get all -n elasticsearch
NAME READY STATUS RESTARTS AGE
pod/elasticsearch-master-0 1/1 Running 0 2d
pod/elasticsearch-master-1 2/2 Running 0 22h
pod/elasticsearch-master-2 2/2 Running 0 2d4h
pod/kibana-kibana-5bd58fdc7-z6f6b 2/2 Running 0 23h
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/elasticsearch-master ClusterIP 172.20.122.122 <none> 9200/TCP,9300/TCP 18d
service/elasticsearch-master-headless ClusterIP None <none> 9200/TCP,9300/TCP 18d
service/kibana-kibana ClusterIP 172.20.223.98 <none> 5601/TCP 18d
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/kibana-kibana 1/1 1 1 18d
NAME DESIRED CURRENT READY AGE
replicaset.apps/kibana-kibana-5bd58fdc7 1 1 1 11d
replicaset.apps/kibana-kibana-79f6c9c44 0 0 0 18d
NAME READY AGE
statefulset.apps/elasticsearch-master 3/3 18d
% k get cm,ing,secret -n elasticsearch
NAME DATA AGE
configmap/elasticsearch-master-config 1 18d
configmap/kube-root-ca.crt 1 18d
NAME CLASS HOSTS ADDRESS PORTS AGE
ingress.networking.k8s.io/elastic-internal alb elastic.stg.pet-i.net,kibana.stg.pet-i.net internal-k8s-elastics-elastici-37ce2a8e4a-1352378594.ap-northeast-2.elb.amazonaws.com 80 15d
NAME TYPE DATA AGE
secret/default-token-v4v98 kubernetes.io/service-account-token 3 18d
secret/elastic-certificate-crt Opaque 1 18d
secret/elastic-certificate-pem Opaque 1 18d
secret/elastic-certificates Opaque 1 18d
secret/elastic-credentials Opaque 2 18d
secret/sh.helm.release.v1.elasticsearch.v1 helm.sh/release.v1 1 18d
secret/sh.helm.release.v1.elasticsearch.v10 helm.sh/release.v1 1 11d
secret/sh.helm.release.v1.elasticsearch.v2 helm.sh/release.v1 1 18d
secret/sh.helm.release.v1.elasticsearch.v3 helm.sh/release.v1 1 18d
secret/sh.helm.release.v1.elasticsearch.v4 helm.sh/release.v1 1 18d
secret/sh.helm.release.v1.elasticsearch.v5 helm.sh/release.v1 1 18d
secret/sh.helm.release.v1.elasticsearch.v6 helm.sh/release.v1 1 18d
secret/sh.helm.release.v1.elasticsearch.v7 helm.sh/release.v1 1 18d
secret/sh.helm.release.v1.elasticsearch.v8 helm.sh/release.v1 1 12d
secret/sh.helm.release.v1.elasticsearch.v9 helm.sh/release.v1 1 12d
secret/sh.helm.release.v1.kibana.v1 helm.sh/release.v1 1 18d
secret/sh.helm.release.v1.kibana.v2 helm.sh/release.v1 1 11d
elasticsearch에 여러가지 복잡한 설정들이 있기때문에 elasticsearch를 고름
elasticsearch ns를 삭제하여 모든 관련 오브젝트 삭제
% k delete ns elasticsearch --force
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
namespace "elasticsearch" force deleted
pv를 포함한 모든 오브젝트 삭제됨
velero backup을 이용하여 해당 ns를 복구해보자
% velero backup get
NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SELECTOR
daily-full-backup-20220524072758 Completed 0 0 2022-05-24 16:27:58 +0900 KST 6d default <none>
daily-full-backup-20220523072757 Completed 0 0 2022-05-23 16:27:57 +0900 KST 5d default <none>
daily-full-backup-20220522072731 Completed 0 0 2022-05-22 16:27:31 +0900 KST 4d default <none>
daily-full-backup-20220521072730 Completed 0 0 2022-05-21 16:27:30 +0900 KST 3d default <none>
daily-full-backup-20220520072729 Completed 0 0 2022-05-20 16:27:29 +0900 KST 2d default <none>
daily-full-backup-20220519072657 Completed 0 0 2022-05-19 16:26:58 +0900 KST 1d default <none>
daily-full-backup-20220518072656 Completed 0 0 2022-05-18 16:26:56 +0900 KST 7h default <none>
5/24자 백업 존재
해당 백업을 이용하여 restore수행
% velero restore create --from-backup daily-full-backup-20220524072758 --include-namespaces elasticsearch
Restore request "daily-full-backup-20220524072758-20220525084026" submitted successfully.
Run `velero restore describe daily-full-backup-20220524072758-20220525084026` or `velero restore logs daily-full-backup-20220524072758-20220525084026` for more details.
에러발생
Phase: PartiallyFailed (run 'velero restore logs daily-full-backup-20220524072758-20220525084026' for more information)
Total items to be restored: 56
Items restored: 56
Started: 2022-05-25 08:40:27 +0900 KST
Completed: 2022-05-25 08:40:33 +0900 KST
Errors:
Velero: <none>
Cluster: <none>
Namespaces:
elasticsearch: error restoring targetgroupbindings.elbv2.k8s.aws/elasticsearch/k8s-elastics-elastics-438cbbcb3a: admission webhook "vtargetgroupbinding.elbv2.k8s.aws" denied the request: unable to get target group IP address type: TargetGroupNotFound: One or more target groups not found
status code: 400, request id: c1725070-7bee-48b0-911d-571b184b78dc
error restoring targetgroupbindings.elbv2.k8s.aws/elasticsearch/k8s-elastics-kibanaki-8e737f4e76: admission webhook "vtargetgroupbinding.elbv2.k8s.aws" denied the request: unable to get target group IP address type: TargetGroupNotFound: One or more target groups not found
status code: 400, request id: 01f953ba-a94b-4b37-8a78-67d5dd3e61a4
타겟그룹은 aws 오브젝트라 그런듯 원래 타겟그룹말고 alb로 인해 새로운 타겟으룹으로 다시 생성됨
% k get targetgroupbindings -n elasticsearch
NAME SERVICE-NAME SERVICE-PORT TARGET-TYPE AGE
k8s-elastics-elastics-f4832a036d elasticsearch-master 9200 ip 12m
k8s-elastics-kibanaki-fdd96c45b6 kibana-kibana 5601 ip 12m
일단 해당부분 제외하고 복구는 잘 수행됨
% k get all -n elasticsearch
NAME READY STATUS RESTARTS AGE
pod/elasticsearch-master-0 1/2 Running 0 3m15s
pod/elasticsearch-master-1 1/2 Running 0 3m15s
pod/elasticsearch-master-2 1/2 Running 0 3m14s
pod/kibana-kibana-5bd58fdc7-z6f6b 1/2 Running 0 3m14s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/elasticsearch-master ClusterIP 172.20.153.126 <none> 9200/TCP,9300/TCP 3m14s
service/elasticsearch-master-headless ClusterIP None <none> 9200/TCP,9300/TCP 3m14s
service/kibana-kibana ClusterIP 172.20.53.105 <none> 5601/TCP 3m14s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/kibana-kibana 0/1 1 0 3m14s
NAME DESIRED CURRENT READY AGE
replicaset.apps/kibana-kibana-5bd58fdc7 1 1 0 3m14s
replicaset.apps/kibana-kibana-79f6c9c44 0 0 0 3m14s
NAME READY AGE
statefulset.apps/elasticsearch-master 0/3 3m14s
삭제된 볼륨이 snap을 이용하여 복원되었고 관련 오브젝트는 velero에서 백업해놓은 s3에서 불러와 생성하였다
이제 클러스터가 정상화 되었는지 확인
{"type": "server", "timestamp": "2022-05-24T23:42:56,695Z", "level": "WARN", "component": "o.e.x.m.e.l.LocalExporter", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "unexpected error while indexing monitoring document", "cluster.uuid": "ucfWSHF_RFG9SdO5c6yQFA", "node.id": "Ym8Hi3E2RrOotMW5y7E61w" ,
"stacktrace": ["org.elasticsearch.xpack.monitoring.exporter.ExportException: UnavailableShardsException[[.monitoring-es-7-2022.05.24][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[.monitoring-es-7-2022.05.24][0]] containing [index {[.monitoring-es-7-2022.05.24][_doc][zsxy-IABufwlJQgR1LQd], source[{\\"cluster_uuid\\":\\"ucfWSHF_RFG9SdO5c6yQFA\\",\\"timestamp\\":\\"2022-05-24T23:41:56.586Z\\",\\"interval_ms\\":10000,\\"type\\":\\"node_stats\\",\\"source_node\\":{\\"uuid\\":\\"Ym8Hi3E2RrOotMW5y7E61w\\",\\"host\\":\\"10.120.11.140\\",\\"transport_address\\":\\"10.120.11.140:9300\\",\\"ip\\":\\"10.120.11.140\\",\\"name\\":\\"elasticsearch-master-0\\",\\"timestamp\\":\\"2022-05-24T23:41:56.576Z\\"},\\"node_stats\\":{\\"node_id\\":\\"Ym8Hi3E2RrOotMW5y7E61w\\",\\"node_master\\":false,\\"mlockall\\":false,\\"indices\\":{\\"docs\\":{\\"count\\":0},\\"store\\":{\\"size_in_bytes\\":0},\\"indexing\\":{\\"index_total\\":0,\\"index_time_in_millis\\":0,\\"throttle_time_in_millis\\":0},\\"search\\":{\\"query_total\\":0,\\"query_time_in_millis\\":0},\\"query_cache\\":{\\"memory_size_in_bytes\\":0,\\"hit_count\\":0,\\"miss_count\\":0,\\"evictions\\":0},\\"fielddata\\":{\\"memory_size_in_bytes\\":0,\\"evictions\\":0},\\"segments\\":{\\"count\\":0,\\"memory_in_bytes\\":0,\\"terms_memory_in_bytes\\":0,\\"stored_fields_memory_in_bytes\\":0,\\"term_vectors_memory_in_bytes\\":0,\\"norms_memory_in_bytes\\":0,\\"points_memory_in_bytes\\":0,\\"doc_values_memory_in_bytes\\":0,\\"index_writer_memory_in_bytes\\":0,\\"version_map_memory_in_bytes\\":0,\\"fixed_bit_set_memory_in_bytes\\":0},\\"request_cache\\":{\\"memory_size_in_bytes\\":0,\\"evictions\\":0,\\"hit_count\\":0,\\"miss_count\\":0}},\\"os\\":{\\"cpu\\":{\\"load_average\\":{\\"1m\\":1.1,\\"5m\\":0.61,\\"15m\\":0.56}},\\"cgroup\\":{\\"cpuacct\\":{\\"control_group\\":\\"/\\",\\"usage_nanos\\":44600464454},\\"cpu\\":{\\"control_group\\":\\"/\\",\\"cfs_period_micros\\":100000,\\"cfs_quota_micros\\":100000,\\"stat\\":{\\"number_of_elapsed_periods\\":489,\\"number_of_times_throttled\\":402,\\"time_throttled_nanos\\":40366716569}},\\"memory\\":{\\"control_group\\":\\"/\\",\\"limit_in_bytes\\":\\"3221225472\\",\\"usage_in_bytes\\":\\"2013106176\\"}}},\\"process\\":{\\"open_file_descriptors\\":335,\\"max_file_descriptors\\":1048576,\\"cpu\\":{\\"percent\\":84}},\\"jvm\\":{\\"mem\\":{\\"heap_used_in_bytes\\":124748288,\\"heap_used_percent\\":7,\\"heap_max_in_bytes\\":1610612736},\\"gc\\":{\\"collectors\\":{\\"young\\":{\\"collection_count\\":14,\\"collection_time_in_millis\\":720},\\"old\\":{\\"collection_count\\":0,\\"collection_time_in_millis\\":0}}}},\\"thread_pool\\":{\\"generic\\":{\\"threads\\":5,\\"queue\\":0,\\"rejected\\":0},\\"get\\":{\\"threads\\":0,\\"queue\\":0,\\"rejected\\":0},\\"management\\":{\\"threads\\":1,\\"queue\\":0,\\"rejected\\":0},\\"search\\":{\\"threads\\":0,\\"queue\\":0,\\"rejected\\":0},\\"watcher\\":{\\"threads\\":0,\\"queue\\":0,\\"rejected\\":0},\\"write\\":{\\"threads\\":0,\\"queue\\":0,\\"rejected\\":0}},\\"fs\\":{\\"total\\":{\\"total_in_bytes\\":31572529152,\\"free_in_bytes\\":30612402176,\\"available_in_bytes\\":30595624960},\\"io_stats\\":{\\"total\\":{\\"operations\\":126,\\"read_operations\\":6,\\"write_operations\\":120,\\"read_kilobytes\\":64,\\"write_kilobytes\\":1824}}}}}]}]]]",
"at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$throwExportException$2(LocalBulk.java:135) ~[x-pack-monitoring-7.17.3.jar:7.17.3]",
"at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197) ~[?:?]",
"at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:179) ~[?:?]",
"at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:992) ~[?:?]",
"at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) ~[?:?]",
"at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) ~[?:?]",
"at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) ~[?:?]",
"at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) ~[?:?]",
"at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:?]",
"at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596) ~[?:?]",
"at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:136) [x-pack-monitoring-7.17.3.jar:7.17.3]",
"at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$doFlush$0(LocalBulk.java:117) [x-pack-monitoring-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:136) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:31) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:88) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:82) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:31) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$applyInternal$2(SecurityActionFilter.java:192) [x-pack-security-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.ActionListener$DelegatingFailureActionListener.onResponse(ActionListener.java:219) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.ActionListener$RunBeforeActionListener.onResponse(ActionListener.java:389) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.ActionListener$MappedActionListener.onResponse(ActionListener.java:101) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.finishHim(TransportBulkAction.java:625) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.onFailure(TransportBulkAction.java:620) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:97) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.ContextPreservingActionListener.onFailure(ContextPreservingActionListener.java:38) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.ActionListener$Delegating.onFailure(ActionListener.java:66) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.finishAsFailed(TransportReplicationAction.java:1041) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retry(TransportReplicationAction.java:1013) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retryBecauseUnavailable(TransportReplicationAction.java:1077) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.doRun(TransportReplicationAction.java:873) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$2.onTimeout(TransportReplicationAction.java:1032) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:345) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:263) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:660) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.17.3.jar:7.17.3]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]",
"at java.lang.Thread.run(Thread.java:833) [?:?]",
"Caused by: org.elasticsearch.action.UnavailableShardsException: [.monitoring-es-7-2022.05.24][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[.monitoring-es-7-2022.05.24][0]] containing [index {[.monitoring-es-7-2022.05.24][_doc][zsxy-IABufwlJQgR1LQd], source[{\\"cluster_uuid\\":\\"ucfWSHF_RFG9SdO5c6yQFA\\",\\"timestamp\\":\\"2022-05-24T23:41:56.586Z\\",\\"interval_ms\\":10000,\\"type\\":\\"node_stats\\",\\"source_node\\":{\\"uuid\\":\\"Ym8Hi3E2RrOotMW5y7E61w\\",\\"host\\":\\"10.120.11.140\\",\\"transport_address\\":\\"10.120.11.140:9300\\",\\"ip\\":\\"10.120.11.140\\",\\"name\\":\\"elasticsearch-master-0\\",\\"timestamp\\":\\"2022-05-24T23:41:56.576Z\\"},\\"node_stats\\":{\\"node_id\\":\\"Ym8Hi3E2RrOotMW5y7E61w\\",\\"node_master\\":false,\\"mlockall\\":false,\\"indices\\":{\\"docs\\":{\\"count\\":0},\\"store\\":{\\"size_in_bytes\\":0},\\"indexing\\":{\\"index_total\\":0,\\"index_time_in_millis\\":0,\\"throttle_time_in_millis\\":0},\\"search\\":{\\"query_total\\":0,\\"query_time_in_millis\\":0},\\"query_cache\\":{\\"memory_size_in_bytes\\":0,\\"hit_count\\":0,\\"miss_count\\":0,\\"evictions\\":0},\\"fielddata\\":{\\"memory_size_in_bytes\\":0,\\"evictions\\":0},\\"segments\\":{\\"count\\":0,\\"memory_in_bytes\\":0,\\"terms_memory_in_bytes\\":0,\\"stored_fields_memory_in_bytes\\":0,\\"term_vectors_memory_in_bytes\\":0,\\"norms_memory_in_bytes\\":0,\\"points_memory_in_bytes\\":0,\\"doc_values_memory_in_bytes\\":0,\\"index_writer_memory_in_bytes\\":0,\\"version_map_memory_in_bytes\\":0,\\"fixed_bit_set_memory_in_bytes\\":0},\\"request_cache\\":{\\"memory_size_in_bytes\\":0,\\"evictions\\":0,\\"hit_count\\":0,\\"miss_count\\":0}},\\"os\\":{\\"cpu\\":{\\"load_average\\":{\\"1m\\":1.1,\\"5m\\":0.61,\\"15m\\":0.56}},\\"cgroup\\":{\\"cpuacct\\":{\\"control_group\\":\\"/\\",\\"usage_nanos\\":44600464454},\\"cpu\\":{\\"control_group\\":\\"/\\",\\"cfs_period_micros\\":100000,\\"cfs_quota_micros\\":100000,\\"stat\\":{\\"number_of_elapsed_periods\\":489,\\"number_of_times_throttled\\":402,\\"time_throttled_nanos\\":40366716569}},\\"memory\\":{\\"control_group\\":\\"/\\",\\"limit_in_bytes\\":\\"3221225472\\",\\"usage_in_bytes\\":\\"2013106176\\"}}},\\"process\\":{\\"open_file_descriptors\\":335,\\"max_file_descriptors\\":1048576,\\"cpu\\":{\\"percent\\":84}},\\"jvm\\":{\\"mem\\":{\\"heap_used_in_bytes\\":124748288,\\"heap_used_percent\\":7,\\"heap_max_in_bytes\\":1610612736},\\"gc\\":{\\"collectors\\":{\\"young\\":{\\"collection_count\\":14,\\"collection_time_in_millis\\":720},\\"old\\":{\\"collection_count\\":0,\\"collection_time_in_millis\\":0}}}},\\"thread_pool\\":{\\"generic\\":{\\"threads\\":5,\\"queue\\":0,\\"rejected\\":0},\\"get\\":{\\"threads\\":0,\\"queue\\":0,\\"rejected\\":0},\\"management\\":{\\"threads\\":1,\\"queue\\":0,\\"rejected\\":0},\\"search\\":{\\"threads\\":0,\\"queue\\":0,\\"rejected\\":0},\\"watcher\\":{\\"threads\\":0,\\"queue\\":0,\\"rejected\\":0},\\"write\\":{\\"threads\\":0,\\"queue\\":0,\\"rejected\\":0}},\\"fs\\":{\\"total\\":{\\"total_in_bytes\\":31572529152,\\"free_in_bytes\\":30612402176,\\"available_in_bytes\\":30595624960},\\"io_stats\\":{\\"total\\":{\\"operations\\":126,\\"read_operations\\":6,\\"write_operations\\":120,\\"read_kilobytes\\":64,\\"write_kilobytes\\":1824}}}}}]}]]",
"... 11 more"] }
{"type": "server", "timestamp": "2022-05-24T23:42:56,732Z", "level": "WARN", "component": "o.e.x.m.MonitoringService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "monitoring execution failed", "cluster.uuid": "ucfWSHF_RFG9SdO5c6yQFA", "node.id": "Ym8Hi3E2RrOotMW5y7E61w" ,
"stacktrace": ["org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulks",
"at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.lambda$doFlush$0(ExportBulk.java:110) [x-pack-monitoring-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:144) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:142) [x-pack-monitoring-7.17.3.jar:7.17.3]",
"at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$doFlush$0(LocalBulk.java:117) [x-pack-monitoring-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:136) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:31) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:88) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:82) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:31) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$applyInternal$2(SecurityActionFilter.java:192) [x-pack-security-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.ActionListener$DelegatingFailureActionListener.onResponse(ActionListener.java:219) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.ActionListener$RunBeforeActionListener.onResponse(ActionListener.java:389) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.ActionListener$MappedActionListener.onResponse(ActionListener.java:101) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.finishHim(TransportBulkAction.java:625) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.onFailure(TransportBulkAction.java:620) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:97) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.ContextPreservingActionListener.onFailure(ContextPreservingActionListener.java:38) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.ActionListener$Delegating.onFailure(ActionListener.java:66) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.finishAsFailed(TransportReplicationAction.java:1041) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retry(TransportReplicationAction.java:1013) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retryBecauseUnavailable(TransportReplicationAction.java:1077) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.doRun(TransportReplicationAction.java:873) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$2.onTimeout(TransportReplicationAction.java:1032) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:345) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:263) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:660) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.17.3.jar:7.17.3]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]",
"at java.lang.Thread.run(Thread.java:833) [?:?]",
"Caused by: org.elasticsearch.xpack.monitoring.exporter.ExportException: bulk [default_local] reports failures when exporting documents",
"at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:131) ~[?:?]",
"... 28 more"] }
위 에러가 발생하긴 했으나 곧 정상화됨
% k get pod -n elasticsearch
NAME READY STATUS RESTARTS AGE
elasticsearch-master-0 2/2 Running 0 5m14s
elasticsearch-master-1 2/2 Running 0 5m14s
elasticsearch-master-2 2/2 Running 0 5m13s
kibana-kibana-5bd58fdc7-z6f6b 2/2 Running 0 5m13s
복구 완료
2. NS내 특정 APP복구
NS내 특정 APP에대한 복구를 테스트
비교적 오브젝트가 많은 monitoring에서 grafana를 대상으로 수행함
% k get all -n monitoring
NAME READY STATUS RESTARTS AGE
pod/alertmanager-kube-prometheus-stack-alertmanager-0 2/2 Running 0 23h
pod/event-exporter-kubernetes-event-exporter-6459f55cf9-f8ff5 1/1 Running 0 2d
pod/kube-prometheus-stack-grafana-85cd98fd6c-7d7qf 3/3 Running 0 2d
pod/kube-prometheus-stack-kube-state-metrics-d699cc95f-9zh5m 1/1 Running 0 23h
pod/kube-prometheus-stack-operator-67676bdcf9-gxvj7 1/1 Running 0 23h
pod/kube-prometheus-stack-prometheus-node-exporter-6fbtv 1/1 Running 0 23h
pod/kube-prometheus-stack-prometheus-node-exporter-9nstb 1/1 Running 0 2d
pod/kube-prometheus-stack-prometheus-node-exporter-sh8br 1/1 Running 0 2d4h
pod/loki-0 1/1 Running 0 23h
pod/prometheus-kube-prometheus-stack-prometheus-0 2/2 Running 0 2d4h
pod/promtail-5wwpt 1/1 Running 0 2d
pod/promtail-8x8b9 1/1 Running 0 2d4h
pod/promtail-mgn9c 1/1 Running 0 23h
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 18d
service/kube-prometheus-stack-alertmanager ClusterIP 172.20.164.56 <none> 9093/TCP 18d
service/kube-prometheus-stack-grafana ClusterIP 172.20.163.46 <none> 80/TCP 18d
service/kube-prometheus-stack-kube-state-metrics ClusterIP 172.20.55.44 <none> 8080/TCP 18d
service/kube-prometheus-stack-operator ClusterIP 172.20.142.167 <none> 443/TCP 18d
service/kube-prometheus-stack-prometheus ClusterIP 172.20.101.235 <none> 9090/TCP 18d
service/kube-prometheus-stack-prometheus-node-exporter ClusterIP 172.20.135.122 <none> 9100/TCP 18d
service/loki ClusterIP 172.20.235.215 <none> 3100/TCP 18d
service/loki-headless ClusterIP None <none> 3100/TCP 18d
service/prometheus-operated ClusterIP None <none> 9090/TCP 18d
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/kube-prometheus-stack-prometheus-node-exporter 3 3 3 3 3 <none> 18d
daemonset.apps/promtail 3 3 3 3 3 <none> 18d
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/event-exporter-kubernetes-event-exporter 1/1 1 1 15d
deployment.apps/kube-prometheus-stack-grafana 1/1 1 1 18d
deployment.apps/kube-prometheus-stack-kube-state-metrics 1/1 1 1 18d
deployment.apps/kube-prometheus-stack-operator 1/1 1 1 18d
NAME DESIRED CURRENT READY AGE
replicaset.apps/event-exporter-kubernetes-event-exporter-6459f55cf9 1 1 1 15d
replicaset.apps/kube-prometheus-stack-grafana-85cd98fd6c 1 1 1 18d
replicaset.apps/kube-prometheus-stack-kube-state-metrics-d699cc95f 1 1 1 18d
replicaset.apps/kube-prometheus-stack-operator-54d77bd667 0 0 0 11d
replicaset.apps/kube-prometheus-stack-operator-67676bdcf9 1 1 1 18d
NAME READY AGE
statefulset.apps/alertmanager-kube-prometheus-stack-alertmanager 1/1 18d
statefulset.apps/loki 1/1 18d
statefulset.apps/prometheus-kube-prometheus-stack-prometheus 1/1 18d
grafana에 대시보드들을 삭제하고 백업을 이용하여 복구하여 기존 데이터가 복구되는지 테스트
(사용자 실수로 데이터삭제시 velero를 이용하여 복구가 되는지 테스트)
위 폴더를 통으로 삭제함
여러가지 복원법이 있지만 그냥 label을 선택하여 통으로 복원 시도
대상 label app.kubernetes.io/name=grafana
persistentvolumeclaim/kube-prometheus-stack-grafana Bound pvc-a7771ab4-00a6-4aaa-a6c8-fe7fa60f4972 10Gi RWO gp3 18d
persistentvolume/pvc-a7771ab4-00a6-4aaa-a6c8-fe7fa60f4972 10Gi RWO Delete Bound monitoring/kube-prometheus-stack-grafana gp3 18d
원하는 동작은 위 pv가 스냅을 이용하여 복구 되는것
복원 수행
% velero restore create --from-backup daily-full-backup-20220524072758 --selector app.kubernetes.io/name=grafana
Restore request "daily-full-backup-20220524072758-20220525090753" submitted successfully.
Run `velero restore describe daily-full-backup-20220524072758-20220525090753` or `velero restore logs daily-full-backup-20220524072758-20220525090753` for more details.
Namespaces:
monitoring: could not restore, PersistentVolumeClaim "kube-prometheus-stack-grafana" already exists. Warning: the in-cluster version is different than the backed-up version.
could not restore, Pod "kube-prometheus-stack-grafana-85cd98fd6c-7d7qf" already exists. Warning: the in-cluster version is different than the backed-up version.
could not restore, Service "kube-prometheus-stack-grafana" already exists. Warning: the in-cluster version is different than the backed-up version.
pvc복원 안됨
pvc 삭제하고 다시 진행
% k delete pvc -n monitoring kube-prometheus-stack-grafana --force
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
persistentvolumeclaim "kube-prometheus-stack-grafana" force deleted
파드가 프로비저닝 되면서 다시 새로운 pvc를 생성함..
이미 pvc가 있기때문에 복원 수행안됨
Warnings:
Velero: <none>
Cluster: could not restore, PersistentVolume "pvc-a7771ab4-00a6-4aaa-a6c8-fe7fa60f4972" already exists. Warning: the in-cluster version is different than the backed-up version.
Namespaces:
monitoring: could not restore, PersistentVolumeClaim "kube-prometheus-stack-grafana" already exists. Warning: the in-cluster version is different than the backed-up version.
could not restore, Endpoints "kube-prometheus-stack-grafana" already exists. Warning: the in-cluster version is different than the backed-up version.
could not restore, EndpointSlice "kube-prometheus-stack-grafana-62smz" already exists. Warning: the in-cluster version is different than the backed-up version.
could not restore, Service "kube-prometheus-stack-grafana" already exists. Warning: the in-cluster version is different than the backed-up version.
일단 ns통으로 날리고 복구하여 정상화 시킴
다시 테스트
대시보드 삭제까지 똑같이 수행
아래의 명령어로 grafana replicas0으로 설정
% k scale -n monitoring deploy kube-prometheus-stack-grafana --replicas=0
deployment.apps/kube-prometheus-stack-grafana scaled
grafana pod 삭제됨
% k get deploy -n monitoring
NAME READY UP-TO-DATE AVAILABLE AGE
event-exporter-kubernetes-event-exporter 1/1 1 1 138m
kube-prometheus-stack-grafana 0/0 0 0 138m
이상태로 grafana pvc삭제
% k delete pvc -n monitoring kube-prometheus-stack-grafana
persistentvolumeclaim "kube-prometheus-stack-grafana" deleted
현재 구동중인 파드가 없으므로 pvc정상 삭제 됨
grafana label 지정하여 복원수행
% velero restore create --from-backup daily-full-backup-20220524072758 --selector app.kubernetes.io/name=grafana
Restore request "daily-full-backup-20220524072758-20220525114453" submitted successfully.
Run `velero restore describe daily-full-backup-20220524072758-20220525114453` or `velero restore logs daily-full-backup-20220524072758-20220525114453` for more details.
복원되었으나 rs로 인해 pod 종료됨
k get pod -n monitoring
NAME READY STATUS RESTARTS AGE
alertmanager-kube-prometheus-stack-alertmanager-0 2/2 Running 0 140m
event-exporter-kubernetes-event-exporter-6459f55cf9-f8ff5 1/1 Running 0 140m
kube-prometheus-stack-grafana-85cd98fd6c-7d7qf 0/3 Terminating 0 21s
replicas 다시 1로 수정
k scale -n monitoring deploy kube-prometheus-stack-grafana --replicas=1
복원 완료
'클라우드 > 쿠버네티스' 카테고리의 다른 글
ingress-alb access 로깅 활성화 (0) | 2022.06.16 |
---|---|
Lens의 대체제 K9S (0) | 2022.05.31 |
쿠버네티스 컨테이너 런타임 Docker에서 Containerd로 (0) | 2022.05.24 |
linkerd-viz 배포시 기배포된 prometheus, grafana사용하기 (0) | 2022.05.24 |
Service Mesh 도입 테스트(linkerd) (0) | 2022.05.24 |