티스토리 뷰

1. NS 통으로 복구

 

% velero schedule get
NAME                STATUS    CREATED                         SCHEDULE     BACKUP TTL   LAST BACKUP   SELECTOR
daily-full-backup   Enabled   2022-05-06 16:26:40 +0900 KST   @every 24h   168h0m0s     16h ago       <none>

클러스터에는 매일 돌아가는 full 백업이 있다

클러스터내 오브젝트 삭제후 velero backup을 이용하여 복구 테스트

아마 관리자의 실수로 ns를 날리거나 했을때 사용할수 있을거 같다

대상 ns선정

ybchoi@ybchoiui-MacBookPro ~ % k get all -n elasticsearch
NAME                                READY   STATUS    RESTARTS   AGE
pod/elasticsearch-master-0          1/1     Running   0          2d
pod/elasticsearch-master-1          2/2     Running   0          22h
pod/elasticsearch-master-2          2/2     Running   0          2d4h
pod/kibana-kibana-5bd58fdc7-z6f6b   2/2     Running   0          23h

NAME                                    TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
service/elasticsearch-master            ClusterIP   172.20.122.122   <none>        9200/TCP,9300/TCP   18d
service/elasticsearch-master-headless   ClusterIP   None             <none>        9200/TCP,9300/TCP   18d
service/kibana-kibana                   ClusterIP   172.20.223.98    <none>        5601/TCP            18d

NAME                            READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/kibana-kibana   1/1     1            1           18d

NAME                                      DESIRED   CURRENT   READY   AGE
replicaset.apps/kibana-kibana-5bd58fdc7   1         1         1       11d
replicaset.apps/kibana-kibana-79f6c9c44   0         0         0       18d

NAME                                    READY   AGE
statefulset.apps/elasticsearch-master   3/3     18d

% k get cm,ing,secret -n elasticsearch
NAME                                    DATA   AGE
configmap/elasticsearch-master-config   1      18d
configmap/kube-root-ca.crt              1      18d

NAME                                         CLASS   HOSTS                                        ADDRESS                                                                                 PORTS   AGE
ingress.networking.k8s.io/elastic-internal   alb     elastic.stg.pet-i.net,kibana.stg.pet-i.net   internal-k8s-elastics-elastici-37ce2a8e4a-1352378594.ap-northeast-2.elb.amazonaws.com   80      15d

NAME                                          TYPE                                  DATA   AGE
secret/default-token-v4v98                    kubernetes.io/service-account-token   3      18d
secret/elastic-certificate-crt                Opaque                                1      18d
secret/elastic-certificate-pem                Opaque                                1      18d
secret/elastic-certificates                   Opaque                                1      18d
secret/elastic-credentials                    Opaque                                2      18d
secret/sh.helm.release.v1.elasticsearch.v1    helm.sh/release.v1                    1      18d
secret/sh.helm.release.v1.elasticsearch.v10   helm.sh/release.v1                    1      11d
secret/sh.helm.release.v1.elasticsearch.v2    helm.sh/release.v1                    1      18d
secret/sh.helm.release.v1.elasticsearch.v3    helm.sh/release.v1                    1      18d
secret/sh.helm.release.v1.elasticsearch.v4    helm.sh/release.v1                    1      18d
secret/sh.helm.release.v1.elasticsearch.v5    helm.sh/release.v1                    1      18d
secret/sh.helm.release.v1.elasticsearch.v6    helm.sh/release.v1                    1      18d
secret/sh.helm.release.v1.elasticsearch.v7    helm.sh/release.v1                    1      18d
secret/sh.helm.release.v1.elasticsearch.v8    helm.sh/release.v1                    1      12d
secret/sh.helm.release.v1.elasticsearch.v9    helm.sh/release.v1                    1      12d
secret/sh.helm.release.v1.kibana.v1           helm.sh/release.v1                    1      18d
secret/sh.helm.release.v1.kibana.v2           helm.sh/release.v1                    1      11d

elasticsearch에 여러가지 복잡한 설정들이 있기때문에 elasticsearch를 고름

elasticsearch ns를 삭제하여 모든 관련 오브젝트 삭제

% k delete ns elasticsearch --force
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
namespace "elasticsearch" force deleted

pv를 포함한 모든 오브젝트 삭제됨

velero backup을 이용하여 해당 ns를 복구해보자

% velero backup get
NAME                               STATUS      ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
daily-full-backup-20220524072758   Completed   0        0          2022-05-24 16:27:58 +0900 KST   6d        default            <none>
daily-full-backup-20220523072757   Completed   0        0          2022-05-23 16:27:57 +0900 KST   5d        default            <none>
daily-full-backup-20220522072731   Completed   0        0          2022-05-22 16:27:31 +0900 KST   4d        default            <none>
daily-full-backup-20220521072730   Completed   0        0          2022-05-21 16:27:30 +0900 KST   3d        default            <none>
daily-full-backup-20220520072729   Completed   0        0          2022-05-20 16:27:29 +0900 KST   2d        default            <none>
daily-full-backup-20220519072657   Completed   0        0          2022-05-19 16:26:58 +0900 KST   1d        default            <none>
daily-full-backup-20220518072656   Completed   0        0          2022-05-18 16:26:56 +0900 KST   7h        default            <none>

5/24자 백업 존재

해당 백업을 이용하여 restore수행

% velero restore create --from-backup daily-full-backup-20220524072758 --include-namespaces elasticsearch
Restore request "daily-full-backup-20220524072758-20220525084026" submitted successfully.
Run `velero restore describe daily-full-backup-20220524072758-20220525084026` or `velero restore logs daily-full-backup-20220524072758-20220525084026` for more details.

에러발생

Phase:                       PartiallyFailed (run 'velero restore logs daily-full-backup-20220524072758-20220525084026' for more information)
Total items to be restored:  56
Items restored:              56

Started:    2022-05-25 08:40:27 +0900 KST
Completed:  2022-05-25 08:40:33 +0900 KST

Errors:
  Velero:     <none>
  Cluster:    <none>
  Namespaces:
    elasticsearch:  error restoring targetgroupbindings.elbv2.k8s.aws/elasticsearch/k8s-elastics-elastics-438cbbcb3a: admission webhook "vtargetgroupbinding.elbv2.k8s.aws" denied the request: unable to get target group IP address type: TargetGroupNotFound: One or more target groups not found
  status code: 400, request id: c1725070-7bee-48b0-911d-571b184b78dc
      error restoring targetgroupbindings.elbv2.k8s.aws/elasticsearch/k8s-elastics-kibanaki-8e737f4e76: admission webhook "vtargetgroupbinding.elbv2.k8s.aws" denied the request: unable to get target group IP address type: TargetGroupNotFound: One or more target groups not found
  status code: 400, request id: 01f953ba-a94b-4b37-8a78-67d5dd3e61a4

타겟그룹은 aws 오브젝트라 그런듯 원래 타겟그룹말고 alb로 인해 새로운 타겟으룹으로 다시 생성됨

% k get targetgroupbindings -n elasticsearch
NAME                               SERVICE-NAME           SERVICE-PORT   TARGET-TYPE   AGE
k8s-elastics-elastics-f4832a036d   elasticsearch-master   9200           ip            12m
k8s-elastics-kibanaki-fdd96c45b6   kibana-kibana          5601           ip            12m

일단 해당부분 제외하고 복구는 잘 수행됨

% k get all -n elasticsearch
NAME                                READY   STATUS    RESTARTS   AGE
pod/elasticsearch-master-0          1/2     Running   0          3m15s
pod/elasticsearch-master-1          1/2     Running   0          3m15s
pod/elasticsearch-master-2          1/2     Running   0          3m14s
pod/kibana-kibana-5bd58fdc7-z6f6b   1/2     Running   0          3m14s

NAME                                    TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
service/elasticsearch-master            ClusterIP   172.20.153.126   <none>        9200/TCP,9300/TCP   3m14s
service/elasticsearch-master-headless   ClusterIP   None             <none>        9200/TCP,9300/TCP   3m14s
service/kibana-kibana                   ClusterIP   172.20.53.105    <none>        5601/TCP            3m14s

NAME                            READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/kibana-kibana   0/1     1            0           3m14s

NAME                                      DESIRED   CURRENT   READY   AGE
replicaset.apps/kibana-kibana-5bd58fdc7   1         1         0       3m14s
replicaset.apps/kibana-kibana-79f6c9c44   0         0         0       3m14s

NAME                                    READY   AGE
statefulset.apps/elasticsearch-master   0/3     3m14s

삭제된 볼륨이 snap을 이용하여 복원되었고 관련 오브젝트는 velero에서 백업해놓은 s3에서 불러와 생성하였다

이제 클러스터가 정상화 되었는지 확인

{"type": "server", "timestamp": "2022-05-24T23:42:56,695Z", "level": "WARN", "component": "o.e.x.m.e.l.LocalExporter", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "unexpected error while indexing monitoring document", "cluster.uuid": "ucfWSHF_RFG9SdO5c6yQFA", "node.id": "Ym8Hi3E2RrOotMW5y7E61w" ,
"stacktrace": ["org.elasticsearch.xpack.monitoring.exporter.ExportException: UnavailableShardsException[[.monitoring-es-7-2022.05.24][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[.monitoring-es-7-2022.05.24][0]] containing [index {[.monitoring-es-7-2022.05.24][_doc][zsxy-IABufwlJQgR1LQd], source[{\\"cluster_uuid\\":\\"ucfWSHF_RFG9SdO5c6yQFA\\",\\"timestamp\\":\\"2022-05-24T23:41:56.586Z\\",\\"interval_ms\\":10000,\\"type\\":\\"node_stats\\",\\"source_node\\":{\\"uuid\\":\\"Ym8Hi3E2RrOotMW5y7E61w\\",\\"host\\":\\"10.120.11.140\\",\\"transport_address\\":\\"10.120.11.140:9300\\",\\"ip\\":\\"10.120.11.140\\",\\"name\\":\\"elasticsearch-master-0\\",\\"timestamp\\":\\"2022-05-24T23:41:56.576Z\\"},\\"node_stats\\":{\\"node_id\\":\\"Ym8Hi3E2RrOotMW5y7E61w\\",\\"node_master\\":false,\\"mlockall\\":false,\\"indices\\":{\\"docs\\":{\\"count\\":0},\\"store\\":{\\"size_in_bytes\\":0},\\"indexing\\":{\\"index_total\\":0,\\"index_time_in_millis\\":0,\\"throttle_time_in_millis\\":0},\\"search\\":{\\"query_total\\":0,\\"query_time_in_millis\\":0},\\"query_cache\\":{\\"memory_size_in_bytes\\":0,\\"hit_count\\":0,\\"miss_count\\":0,\\"evictions\\":0},\\"fielddata\\":{\\"memory_size_in_bytes\\":0,\\"evictions\\":0},\\"segments\\":{\\"count\\":0,\\"memory_in_bytes\\":0,\\"terms_memory_in_bytes\\":0,\\"stored_fields_memory_in_bytes\\":0,\\"term_vectors_memory_in_bytes\\":0,\\"norms_memory_in_bytes\\":0,\\"points_memory_in_bytes\\":0,\\"doc_values_memory_in_bytes\\":0,\\"index_writer_memory_in_bytes\\":0,\\"version_map_memory_in_bytes\\":0,\\"fixed_bit_set_memory_in_bytes\\":0},\\"request_cache\\":{\\"memory_size_in_bytes\\":0,\\"evictions\\":0,\\"hit_count\\":0,\\"miss_count\\":0}},\\"os\\":{\\"cpu\\":{\\"load_average\\":{\\"1m\\":1.1,\\"5m\\":0.61,\\"15m\\":0.56}},\\"cgroup\\":{\\"cpuacct\\":{\\"control_group\\":\\"/\\",\\"usage_nanos\\":44600464454},\\"cpu\\":{\\"control_group\\":\\"/\\",\\"cfs_period_micros\\":100000,\\"cfs_quota_micros\\":100000,\\"stat\\":{\\"number_of_elapsed_periods\\":489,\\"number_of_times_throttled\\":402,\\"time_throttled_nanos\\":40366716569}},\\"memory\\":{\\"control_group\\":\\"/\\",\\"limit_in_bytes\\":\\"3221225472\\",\\"usage_in_bytes\\":\\"2013106176\\"}}},\\"process\\":{\\"open_file_descriptors\\":335,\\"max_file_descriptors\\":1048576,\\"cpu\\":{\\"percent\\":84}},\\"jvm\\":{\\"mem\\":{\\"heap_used_in_bytes\\":124748288,\\"heap_used_percent\\":7,\\"heap_max_in_bytes\\":1610612736},\\"gc\\":{\\"collectors\\":{\\"young\\":{\\"collection_count\\":14,\\"collection_time_in_millis\\":720},\\"old\\":{\\"collection_count\\":0,\\"collection_time_in_millis\\":0}}}},\\"thread_pool\\":{\\"generic\\":{\\"threads\\":5,\\"queue\\":0,\\"rejected\\":0},\\"get\\":{\\"threads\\":0,\\"queue\\":0,\\"rejected\\":0},\\"management\\":{\\"threads\\":1,\\"queue\\":0,\\"rejected\\":0},\\"search\\":{\\"threads\\":0,\\"queue\\":0,\\"rejected\\":0},\\"watcher\\":{\\"threads\\":0,\\"queue\\":0,\\"rejected\\":0},\\"write\\":{\\"threads\\":0,\\"queue\\":0,\\"rejected\\":0}},\\"fs\\":{\\"total\\":{\\"total_in_bytes\\":31572529152,\\"free_in_bytes\\":30612402176,\\"available_in_bytes\\":30595624960},\\"io_stats\\":{\\"total\\":{\\"operations\\":126,\\"read_operations\\":6,\\"write_operations\\":120,\\"read_kilobytes\\":64,\\"write_kilobytes\\":1824}}}}}]}]]]",
"at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$throwExportException$2(LocalBulk.java:135) ~[x-pack-monitoring-7.17.3.jar:7.17.3]",
"at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197) ~[?:?]",
"at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:179) ~[?:?]",
"at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:992) ~[?:?]",
"at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) ~[?:?]",
"at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) ~[?:?]",
"at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) ~[?:?]",
"at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) ~[?:?]",
"at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:?]",
"at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596) ~[?:?]",
"at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:136) [x-pack-monitoring-7.17.3.jar:7.17.3]",
"at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$doFlush$0(LocalBulk.java:117) [x-pack-monitoring-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:136) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:31) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:88) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:82) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:31) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$applyInternal$2(SecurityActionFilter.java:192) [x-pack-security-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.ActionListener$DelegatingFailureActionListener.onResponse(ActionListener.java:219) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.ActionListener$RunBeforeActionListener.onResponse(ActionListener.java:389) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.ActionListener$MappedActionListener.onResponse(ActionListener.java:101) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.finishHim(TransportBulkAction.java:625) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.onFailure(TransportBulkAction.java:620) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:97) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.ContextPreservingActionListener.onFailure(ContextPreservingActionListener.java:38) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.ActionListener$Delegating.onFailure(ActionListener.java:66) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.finishAsFailed(TransportReplicationAction.java:1041) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retry(TransportReplicationAction.java:1013) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retryBecauseUnavailable(TransportReplicationAction.java:1077) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.doRun(TransportReplicationAction.java:873) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$2.onTimeout(TransportReplicationAction.java:1032) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:345) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:263) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:660) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.17.3.jar:7.17.3]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]",
"at java.lang.Thread.run(Thread.java:833) [?:?]",
"Caused by: org.elasticsearch.action.UnavailableShardsException: [.monitoring-es-7-2022.05.24][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[.monitoring-es-7-2022.05.24][0]] containing [index {[.monitoring-es-7-2022.05.24][_doc][zsxy-IABufwlJQgR1LQd], source[{\\"cluster_uuid\\":\\"ucfWSHF_RFG9SdO5c6yQFA\\",\\"timestamp\\":\\"2022-05-24T23:41:56.586Z\\",\\"interval_ms\\":10000,\\"type\\":\\"node_stats\\",\\"source_node\\":{\\"uuid\\":\\"Ym8Hi3E2RrOotMW5y7E61w\\",\\"host\\":\\"10.120.11.140\\",\\"transport_address\\":\\"10.120.11.140:9300\\",\\"ip\\":\\"10.120.11.140\\",\\"name\\":\\"elasticsearch-master-0\\",\\"timestamp\\":\\"2022-05-24T23:41:56.576Z\\"},\\"node_stats\\":{\\"node_id\\":\\"Ym8Hi3E2RrOotMW5y7E61w\\",\\"node_master\\":false,\\"mlockall\\":false,\\"indices\\":{\\"docs\\":{\\"count\\":0},\\"store\\":{\\"size_in_bytes\\":0},\\"indexing\\":{\\"index_total\\":0,\\"index_time_in_millis\\":0,\\"throttle_time_in_millis\\":0},\\"search\\":{\\"query_total\\":0,\\"query_time_in_millis\\":0},\\"query_cache\\":{\\"memory_size_in_bytes\\":0,\\"hit_count\\":0,\\"miss_count\\":0,\\"evictions\\":0},\\"fielddata\\":{\\"memory_size_in_bytes\\":0,\\"evictions\\":0},\\"segments\\":{\\"count\\":0,\\"memory_in_bytes\\":0,\\"terms_memory_in_bytes\\":0,\\"stored_fields_memory_in_bytes\\":0,\\"term_vectors_memory_in_bytes\\":0,\\"norms_memory_in_bytes\\":0,\\"points_memory_in_bytes\\":0,\\"doc_values_memory_in_bytes\\":0,\\"index_writer_memory_in_bytes\\":0,\\"version_map_memory_in_bytes\\":0,\\"fixed_bit_set_memory_in_bytes\\":0},\\"request_cache\\":{\\"memory_size_in_bytes\\":0,\\"evictions\\":0,\\"hit_count\\":0,\\"miss_count\\":0}},\\"os\\":{\\"cpu\\":{\\"load_average\\":{\\"1m\\":1.1,\\"5m\\":0.61,\\"15m\\":0.56}},\\"cgroup\\":{\\"cpuacct\\":{\\"control_group\\":\\"/\\",\\"usage_nanos\\":44600464454},\\"cpu\\":{\\"control_group\\":\\"/\\",\\"cfs_period_micros\\":100000,\\"cfs_quota_micros\\":100000,\\"stat\\":{\\"number_of_elapsed_periods\\":489,\\"number_of_times_throttled\\":402,\\"time_throttled_nanos\\":40366716569}},\\"memory\\":{\\"control_group\\":\\"/\\",\\"limit_in_bytes\\":\\"3221225472\\",\\"usage_in_bytes\\":\\"2013106176\\"}}},\\"process\\":{\\"open_file_descriptors\\":335,\\"max_file_descriptors\\":1048576,\\"cpu\\":{\\"percent\\":84}},\\"jvm\\":{\\"mem\\":{\\"heap_used_in_bytes\\":124748288,\\"heap_used_percent\\":7,\\"heap_max_in_bytes\\":1610612736},\\"gc\\":{\\"collectors\\":{\\"young\\":{\\"collection_count\\":14,\\"collection_time_in_millis\\":720},\\"old\\":{\\"collection_count\\":0,\\"collection_time_in_millis\\":0}}}},\\"thread_pool\\":{\\"generic\\":{\\"threads\\":5,\\"queue\\":0,\\"rejected\\":0},\\"get\\":{\\"threads\\":0,\\"queue\\":0,\\"rejected\\":0},\\"management\\":{\\"threads\\":1,\\"queue\\":0,\\"rejected\\":0},\\"search\\":{\\"threads\\":0,\\"queue\\":0,\\"rejected\\":0},\\"watcher\\":{\\"threads\\":0,\\"queue\\":0,\\"rejected\\":0},\\"write\\":{\\"threads\\":0,\\"queue\\":0,\\"rejected\\":0}},\\"fs\\":{\\"total\\":{\\"total_in_bytes\\":31572529152,\\"free_in_bytes\\":30612402176,\\"available_in_bytes\\":30595624960},\\"io_stats\\":{\\"total\\":{\\"operations\\":126,\\"read_operations\\":6,\\"write_operations\\":120,\\"read_kilobytes\\":64,\\"write_kilobytes\\":1824}}}}}]}]]",
"... 11 more"] }
{"type": "server", "timestamp": "2022-05-24T23:42:56,732Z", "level": "WARN", "component": "o.e.x.m.MonitoringService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "monitoring execution failed", "cluster.uuid": "ucfWSHF_RFG9SdO5c6yQFA", "node.id": "Ym8Hi3E2RrOotMW5y7E61w" ,
"stacktrace": ["org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulks",
"at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.lambda$doFlush$0(ExportBulk.java:110) [x-pack-monitoring-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:144) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:142) [x-pack-monitoring-7.17.3.jar:7.17.3]",
"at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$doFlush$0(LocalBulk.java:117) [x-pack-monitoring-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:136) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:31) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:88) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:82) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:31) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$applyInternal$2(SecurityActionFilter.java:192) [x-pack-security-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.ActionListener$DelegatingFailureActionListener.onResponse(ActionListener.java:219) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.ActionListener$RunBeforeActionListener.onResponse(ActionListener.java:389) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.ActionListener$MappedActionListener.onResponse(ActionListener.java:101) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.finishHim(TransportBulkAction.java:625) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.onFailure(TransportBulkAction.java:620) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:97) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.ContextPreservingActionListener.onFailure(ContextPreservingActionListener.java:38) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.ActionListener$Delegating.onFailure(ActionListener.java:66) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.finishAsFailed(TransportReplicationAction.java:1041) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retry(TransportReplicationAction.java:1013) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retryBecauseUnavailable(TransportReplicationAction.java:1077) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.doRun(TransportReplicationAction.java:873) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$2.onTimeout(TransportReplicationAction.java:1032) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:345) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:263) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:660) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.17.3.jar:7.17.3]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]",
"at java.lang.Thread.run(Thread.java:833) [?:?]",
"Caused by: org.elasticsearch.xpack.monitoring.exporter.ExportException: bulk [default_local] reports failures when exporting documents",
"at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:131) ~[?:?]",
"... 28 more"] }

위 에러가 발생하긴 했으나 곧 정상화됨

% k get pod -n elasticsearch
NAME                            READY   STATUS    RESTARTS   AGE
elasticsearch-master-0          2/2     Running   0          5m14s
elasticsearch-master-1          2/2     Running   0          5m14s
elasticsearch-master-2          2/2     Running   0          5m13s
kibana-kibana-5bd58fdc7-z6f6b   2/2     Running   0          5m13s

 

복구 완료

2. NS내 특정 APP복구

NS내 특정 APP에대한 복구를 테스트

비교적 오브젝트가 많은 monitoring에서 grafana를 대상으로 수행함

%  k get all -n monitoring
NAME                                                            READY   STATUS    RESTARTS   AGE
pod/alertmanager-kube-prometheus-stack-alertmanager-0           2/2     Running   0          23h
pod/event-exporter-kubernetes-event-exporter-6459f55cf9-f8ff5   1/1     Running   0          2d
pod/kube-prometheus-stack-grafana-85cd98fd6c-7d7qf              3/3     Running   0          2d
pod/kube-prometheus-stack-kube-state-metrics-d699cc95f-9zh5m    1/1     Running   0          23h
pod/kube-prometheus-stack-operator-67676bdcf9-gxvj7             1/1     Running   0          23h
pod/kube-prometheus-stack-prometheus-node-exporter-6fbtv        1/1     Running   0          23h
pod/kube-prometheus-stack-prometheus-node-exporter-9nstb        1/1     Running   0          2d
pod/kube-prometheus-stack-prometheus-node-exporter-sh8br        1/1     Running   0          2d4h
pod/loki-0                                                      1/1     Running   0          23h
pod/prometheus-kube-prometheus-stack-prometheus-0               2/2     Running   0          2d4h
pod/promtail-5wwpt                                              1/1     Running   0          2d
pod/promtail-8x8b9                                              1/1     Running   0          2d4h
pod/promtail-mgn9c                                              1/1     Running   0          23h

NAME                                                     TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
service/alertmanager-operated                            ClusterIP   None             <none>        9093/TCP,9094/TCP,9094/UDP   18d
service/kube-prometheus-stack-alertmanager               ClusterIP   172.20.164.56    <none>        9093/TCP                     18d
service/kube-prometheus-stack-grafana                    ClusterIP   172.20.163.46    <none>        80/TCP                       18d
service/kube-prometheus-stack-kube-state-metrics         ClusterIP   172.20.55.44     <none>        8080/TCP                     18d
service/kube-prometheus-stack-operator                   ClusterIP   172.20.142.167   <none>        443/TCP                      18d
service/kube-prometheus-stack-prometheus                 ClusterIP   172.20.101.235   <none>        9090/TCP                     18d
service/kube-prometheus-stack-prometheus-node-exporter   ClusterIP   172.20.135.122   <none>        9100/TCP                     18d
service/loki                                             ClusterIP   172.20.235.215   <none>        3100/TCP                     18d
service/loki-headless                                    ClusterIP   None             <none>        3100/TCP                     18d
service/prometheus-operated                              ClusterIP   None             <none>        9090/TCP                     18d

NAME                                                            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/kube-prometheus-stack-prometheus-node-exporter   3         3         3       3            3           <none>          18d
daemonset.apps/promtail                                         3         3         3       3            3           <none>          18d

NAME                                                       READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/event-exporter-kubernetes-event-exporter   1/1     1            1           15d
deployment.apps/kube-prometheus-stack-grafana              1/1     1            1           18d
deployment.apps/kube-prometheus-stack-kube-state-metrics   1/1     1            1           18d
deployment.apps/kube-prometheus-stack-operator             1/1     1            1           18d

NAME                                                                  DESIRED   CURRENT   READY   AGE
replicaset.apps/event-exporter-kubernetes-event-exporter-6459f55cf9   1         1         1       15d
replicaset.apps/kube-prometheus-stack-grafana-85cd98fd6c              1         1         1       18d
replicaset.apps/kube-prometheus-stack-kube-state-metrics-d699cc95f    1         1         1       18d
replicaset.apps/kube-prometheus-stack-operator-54d77bd667             0         0         0       11d
replicaset.apps/kube-prometheus-stack-operator-67676bdcf9             1         1         1       18d

NAME                                                               READY   AGE
statefulset.apps/alertmanager-kube-prometheus-stack-alertmanager   1/1     18d
statefulset.apps/loki                                              1/1     18d
statefulset.apps/prometheus-kube-prometheus-stack-prometheus       1/1     18d

grafana에 대시보드들을 삭제하고 백업을 이용하여 복구하여 기존 데이터가 복구되는지 테스트

(사용자 실수로 데이터삭제시 velero를 이용하여 복구가 되는지 테스트)

위 폴더를 통으로 삭제함

여러가지 복원법이 있지만 그냥 label을 선택하여 통으로 복원 시도

대상 label app.kubernetes.io/name=grafana

persistentvolumeclaim/kube-prometheus-stack-grafana                                                                          Bound    pvc-a7771ab4-00a6-4aaa-a6c8-fe7fa60f4972   10Gi       RWO            gp3            18d
persistentvolume/pvc-a7771ab4-00a6-4aaa-a6c8-fe7fa60f4972   10Gi       RWO            Delete           Bound    monitoring/kube-prometheus-stack-grafana                                                                          gp3                     18d

원하는 동작은 위 pv가 스냅을 이용하여 복구 되는것

복원 수행

% velero restore create --from-backup  daily-full-backup-20220524072758 --selector app.kubernetes.io/name=grafana
Restore request "daily-full-backup-20220524072758-20220525090753" submitted successfully.
Run `velero restore describe daily-full-backup-20220524072758-20220525090753` or `velero restore logs daily-full-backup-20220524072758-20220525090753` for more details.
Namespaces:
    monitoring:  could not restore, PersistentVolumeClaim "kube-prometheus-stack-grafana" already exists. Warning: the in-cluster version is different than the backed-up version.
                 could not restore, Pod "kube-prometheus-stack-grafana-85cd98fd6c-7d7qf" already exists. Warning: the in-cluster version is different than the backed-up version.
                 could not restore, Service "kube-prometheus-stack-grafana" already exists. Warning: the in-cluster version is different than the backed-up version.

pvc복원 안됨

pvc 삭제하고 다시 진행

% k delete pvc -n monitoring kube-prometheus-stack-grafana --force
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
persistentvolumeclaim "kube-prometheus-stack-grafana" force deleted

파드가 프로비저닝 되면서 다시 새로운 pvc를 생성함..

이미 pvc가 있기때문에 복원 수행안됨

Warnings:
  Velero:     <none>
  Cluster:  could not restore, PersistentVolume "pvc-a7771ab4-00a6-4aaa-a6c8-fe7fa60f4972" already exists. Warning: the in-cluster version is different than the backed-up version.
  Namespaces:
    monitoring:  could not restore, PersistentVolumeClaim "kube-prometheus-stack-grafana" already exists. Warning: the in-cluster version is different than the backed-up version.
                 could not restore, Endpoints "kube-prometheus-stack-grafana" already exists. Warning: the in-cluster version is different than the backed-up version.
                 could not restore, EndpointSlice "kube-prometheus-stack-grafana-62smz" already exists. Warning: the in-cluster version is different than the backed-up version.
                 could not restore, Service "kube-prometheus-stack-grafana" already exists. Warning: the in-cluster version is different than the backed-up version.

일단 ns통으로 날리고 복구하여 정상화 시킴

다시 테스트

대시보드 삭제까지 똑같이 수행

아래의 명령어로 grafana replicas0으로 설정

% k scale -n monitoring deploy kube-prometheus-stack-grafana --replicas=0
deployment.apps/kube-prometheus-stack-grafana scaled

grafana pod 삭제됨

% k get deploy -n monitoring
NAME                                       READY   UP-TO-DATE   AVAILABLE   AGE
event-exporter-kubernetes-event-exporter   1/1     1            1           138m
kube-prometheus-stack-grafana              0/0     0            0           138m

이상태로 grafana pvc삭제

% k delete pvc -n monitoring kube-prometheus-stack-grafana
persistentvolumeclaim "kube-prometheus-stack-grafana" deleted

현재 구동중인 파드가 없으므로 pvc정상 삭제 됨

grafana label 지정하여 복원수행

% velero restore create --from-backup  daily-full-backup-20220524072758 --selector app.kubernetes.io/name=grafana
Restore request "daily-full-backup-20220524072758-20220525114453" submitted successfully.
Run `velero restore describe daily-full-backup-20220524072758-20220525114453` or `velero restore logs daily-full-backup-20220524072758-20220525114453` for more details.

복원되었으나 rs로 인해 pod 종료됨

k get pod -n monitoring
NAME                                                        READY   STATUS        RESTARTS   AGE
alertmanager-kube-prometheus-stack-alertmanager-0           2/2     Running       0          140m
event-exporter-kubernetes-event-exporter-6459f55cf9-f8ff5   1/1     Running       0          140m
kube-prometheus-stack-grafana-85cd98fd6c-7d7qf              0/3     Terminating   0          21s

replicas 다시 1로 수정

k scale -n monitoring deploy kube-prometheus-stack-grafana --replicas=1

 

복원 완료

 

 
최근에 올라온 글
최근에 달린 댓글
Total
Today
Yesterday
링크
«   2024/12   »
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31
글 보관함