This document provides details for troubleshooting and restoring the InfluxDB Cluster.
- InfluxDB Problems
- InfluxDB Installation Problems
- PVC in Pending Status
- Pod in Pending Status
- Two Relay Pods with Different Deployments
- InfluxDB Pod is Down
- Pod is Down
- Pod Restarting
- Read from InfluxDB Cluster Not Working
- Returns "503 Unavailable" Http Status Code for /query
- InfluxDB Pod Absent into List of Haproxy Endpoints
- Write to InfluxDB Cluster Not Working
- Returns "Unable to write points"
- Returns "503 Unavailable" Http Status Code for /write
- InfluxDB Pod Absent into List of Relay Endpoints
- InfluxDB Pod Does Not Return into Cluster after Failover
- No Metrics Available
- Telegraf has No Settings for Collect Metrics
- Component Does Not Expose Metrics
- Corrupted TSM Files
- Verify TSM Files
- Fix Corrupted TSM Files
- InfluxDB Installation Problems
- Alarms
- InfluxDB Component is Down
- Possible Component Down Problems
- Component Down Solution
- CPU Usage by InfluxDB Component is Above {any}%
- Possible CPU Usage Problems
- CPU Usage Solution
- RAM Usage by InfluxDB Component is Above {any}%
- Possible RAM Usage Problems
- RAM Usage Solution
- Smoke Test Failed for InfluxDB Cluster
- Possible Smoke Test Failed Problems
- Smoke Test Failed Solution
- InfluxDB Component is Down
InfluxDB Problems
The following sections describe the problems that can occur with the InfluxDB Cluster.
InfluxDB Installation Problems
This section describes the problems that occur during the deployment of the InfluxDB Cluster and their solutions.
PVC in Pending Status
During the deployment, a situation can occur when the Persistence Volume Clain (PVC) is created but cannot mount the Persistence Volume (PV).
There are three common causes:
- Incorrect PV name specified in the deployment parameters.
- Incorrect
storage-class
that did not allow to patch an already created one PV in PVC. - PV not in the
Available
status. For example, it could be in theRetain
status.
Check PV Name or Mask
For deploying the influxdb cluster, specify the following two parameters related with PV:
INFLUXDB_PV_MASK
(earlier namePV_MASK
, but already deprecated) specifies the PV name mask that is used for creating PV claims.BACKUP_STORAGE_PV
specifies the Persistence Volume name that is used to store backups.
Usually, problems are related with
PV_MASK
. There are many tickets where people specify deprecated variable of the latest version and it is not supported. For solving this issue, you need to use the INFLUXDB_PV_MASK
deploy parameter.
Another issue is related with incorrect
INFLUXDB_PV_MASK
value or when you do not specify it for your environment. By default, this parameter has the following value:PV_MASK = pv-influx-data-
But for specific environment, the PV name may differ.
Check Storage Class
For deploying InfluxDB pods with provisioning PV, the
INFLUXDB_PV_CLASS
parameter can be used. By default, it is not set. If it is specified while creating a PVC, then specify storage-class
into the INFLUXDB_PV_CLASS
parameter.
If you want to update an already installed environment without cleaning the project, you also need to specify this parameter. Otherwise, installation scripts try to patch PVC without the
storage-class
and it may lead to the following error:Error from server (Invalid): PersistentVolumeClaim "pv-influxdb-data-1" is invalid:
spec: Forbidden: field is immutable after creation
Check PV Status
During the deployment of the InfluxDB Cluster, the deploy scripts check the PV status and try to path the PV to the set
Available
status. This step is skipped if the deploy job is run by a user who has no rights to work with PVs.
In this case, the PV already exists, but has a
Retain
status. This can occur when the PV was bound to the PVC, but the PVC was removed. The pod which should use this PV cannot start because it is unable to bind the PV.
The PV status can be displayed using the following command:
oc get pv
For example:
$ oc get pv
NAME CAPACITY ACCESSMODES RECLAIMPOLICY STATUS CLAIM REASON AGE
pv-influxdb-backup 10Gi RWO Retain Bound influxdb-cluster/influx-backup-pvc 241d
pv-influxdb-data-1 10Gi RWO Retain Bound influxdb-cluster/pv-influxdb-data-1 241d
pv-influxdb-data-2 10Gi RWO Retain Bound influxdb-cluster/pv-influxdb-data-2 241d
To change the PV status manually:
- Use a profile that has permissions to work with the PV.
- Execute the following command:
oc patch pv "<pv_name>" --type json -p '[{"op": "replace", "path": "/spec/claimRef", "value": null},{"op": "replace", "path": "/status/phase", "value": "Available"}]
Where,<pv_name>
is the name of the PV that is to be patched.
Pod in Pending Status
During deployment, a situation could occur where deployment is stuck and OpenShift cannot create the pod. There are many reasons for such issues. For example:
- Deployment has incorrect selectors for the pod and OpenShift scheduler cannot select node for deploy
- Deployment has incorrect image
- PVC stuck in pending status
Check Pod Selectors and Affinity Rules
Since pod may get stuck in the pending status for different reasons, there are ways to solve similar issues. View the list of pods using the following command:
$ oc get pods
NAME READY STATUS RESTARTS AGE
influxdb-1-1-deploy 0/1 Error 0 3d
influxdb-2-1-deploy 0/1 Error 0 3d
influxdb-backup-daemon-2-16jc7 1/1 Running 0 1d
influxdb-relay-3-7jh47 1/1 Running 0 3d
influxdb-relay-3-cw4cc 1/1 Running 0 3d
influxdb-router-2-32gtt 1/1 Running 0 3d
influxdb-router-2-zvmbz 1/1 Running 0 3d
influxdb-service-1-qdcfs 1/1 Running 0 3d
influxdb-tests-shlnn 0/1 Completed 0 3d
If some pods are not ready or do not have the
Running
status, you can check the events using the following command:
$ oc get events
oc get events
LASTSEEN FIRSTSEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE
22s 22s 1 influxdb-1-3-deploy Pod Normal Scheduled {default-scheduler
} Successfully
assigned influxdb-1-3-deploy to node-1-1
19s 19s 1 influxdb-1-3-deploy Pod spec.containers{deployment} Normal Pulled {kubelet
node-1-1} Container image
"openshift/origin-deployer:v1.5.1" already present on machine
19s 19s 1 influxdb-1-3-deploy Pod spec.containers{deployment} Normal Created {kubelet
node-1-1} Created container with docker id
b12b43040734; Security:[seccomp=unconfined]
19s 19s 1 influxdb-1-3-deploy Pod spec.containers{deployment} Normal Started {kubelet
node-1-1} Started container with docker id
b12b43040734
18s 18s 1 influxdb-1-3-hxj5h Pod Normal Scheduled {default-scheduler
} Successfully
assigned influxdb-1-3-hxj5h to node-1-1
15s 15s 1 influxdb-1-3-hxj5h Pod spec.containers{influxdb-1} Normal Pulling {kubelet
node-1-1} pulling image
"influxdb:1.7.9"
15s 15s 1 influxdb-1-3-hxj5h Pod spec.containers{influxdb-1} Normal Pulled {kubelet
node-1-1} Successfully pulled image
"influxdb:1.7.9"
15s 15s 1 influxdb-1-3-hxj5h Pod spec.containers{influxdb-1} Normal Created {kubelet
node-1-1} Created container with docker id
720fe51ee505; Security:[seccomp=unconfined]
15s 15s 1 influxdb-1-3-hxj5h Pod spec.containers{influxdb-1} Normal Started {kubelet
node-1-1} Started container with docker id
720fe51ee505
7s 7s 1 influxdb-1-3-hxj5h Pod spec.containers{influxdb-1} Warning Unhealthy {kubelet
node-1-1} Readiness probe failed: Get
http://1.2.3.4:8086/ping: net/http: request canceled while waiting for
connection (Client.Timeout exceeded while awaiting headers)
18s 18s 1 influxdb-1-3 ReplicationController Normal SuccessfulCreate {replication-controller
} Created pod:
influxdb-1-3-hxj5h
22s 22s 1 influxdb-1 DeploymentConfig Normal DeploymentCreated {deploymentconfig-controller
} Created new replication controller "influxdb-1-3" for version
3
.....
...
Verify that this list of events does not contain the following events:
failed to fit in any node fit failure summary on nodes: CheckServiceAffinity (9), MatchNodeSelector (9), InsufficientMemory (2)
or
"Error scheduling default influxdb-1-deploy: pod (influxdb-1-deploy) failed to fit in any node
If you see similar messages, you need to do the following:
- Check that OpenShift has the capacity to deploy pods with specified resources.
- Check that the pod has affinity or anti-affinity rules which allow running new pods in this environment.To get pods config and see affinity or anti-affinity rules, execute the following command:
$ oc describe pod <pod_name> Name: influxdb-relay-3-7jh47 Namespace: influxdb-cluster Security Policy: restricted ...
- Check that the pod's
nodeSelector
has correct labels and specified labels are presented on the OpenShift nodesTo get nodeSelector for the pod, execute the following command:$ oc get pod <pod_name> -o jsonpath="{.spec.nodeSelector}{.name}" map[region:primary]
To get node labels, execute the following command:$ oc get nodes --show-labelsNAME STATUS AGE LABELSinfra-1 Ready 1y beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=infra-1,node=infra-1,region=infra,role=infra,site=left,type=compute,zone=defaultnode-1 Ready 1y beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=node-1,node=node-1,region=primary,role=compute,site=right,type=compute,zone=default...
Two Relay Pods with Different Deployments
During or after the installation, you can see the deployment for
influxdb-relay
with the following types:- ReplicationSet
- DeploymentConfig
It means that this environment was upgraded from version
3.x
to version 4.x
without clean upgrade.Redeploy InfluxDB Cluster with Clean Mode
Before installing, you can select options for clean install. The clean mode removes the project and all entities in it, but keeps Persistence Volumes. It means that all the data is kept.
Remove Obsolete Deployment
If the project already has two pods with different deployments, you can manually remove the obsolete deployment.
To remove the obsolete deployment:
- Find ReplicationSet for influxdb-relay.
- Scale down to replicas = 0.
- Remove ReplicationSet.
For example:
$ oc get rc
NAME DESIRED CURRENT READY AGE
influxdb-relay 0 0 0 3d
$ oc scale --replicas=0 rc influxdb-relay
$ oc delete rc influxdb-relay
InfluxDB Pod is Down
This section describes problems which can occur during InfluxDB Cluster work.
Pod is Down
There are many reasons why a pod is down. But usually it can be classified by a previous status:
- Pod failed during deploy/update and OpenShift stop or remove pod. For more details, see InfluxDB Installation Problems
- Pod failed after some continuous restarts. For more details, see Pod Restarting
- Pod failed because OpenShift node failed
Pod Restarting
InfluxDB is designed to store many metrics with a high write rate. InfluxDB uses a small amount of memory when InfluxDB has only write traffic (even with a high rate).
InfluxDB uses an index to quickly access metrics. Moderate and complex queries can cause high memory usage, because usually an in-memory index is used.
So it means that if you deploy InfluxDB with resources less than expected load or try to store more data than specified in the load profile, you can encounter an
Out Of Memory
error during restart.Out Of Memory During Pod Start
During InfluxDB Cluster update, a problem can occur when InfluxDB pod continuously restarts and errors can be as follows:
Container influxdb-1
State: Waiting (CrashLoopBackOff)
Last State Terminated at Apr 27, 2020 10:27:51 PM with exit code 137 (OOMKiller)
Ready: false
Restart Count: 7
or
Container influxdb-1
State: Waiting (CrashLoopBackOff)
Last State Terminated at Apr 27, 2020 10:27:51 PM with exit code 2 (Error)
Ready: false
Restart Count: 7
The logs may not contain any errors and just end on regular message about open shard. For example:
ts=2020-04-27T20:43:46.739052Z lvl=info
msg="InfluxDB starting" log_id=0MRLxEhl000 version=1.7.9 branch=1.7
commit=23bc63d43a8dc05f53afa46e3526ebb5578f3d88
ts=2020-04-27T20:43:46.811192Z lvl=info
msg="Go runtime" log_id=0MRLxEhl000 version=go1.12.6 maxprocs=4
ts=2020-04-27T20:43:46.917793Z lvl=info
msg="Using data dir" log_id=0MRLxEhl000 service=store
path=/var/lib/influxdb/data
ts=2020-04-27T20:43:46.918880Z lvl=info
msg="Compaction settings" log_id=0MRLxEhl000 service=store
max_concurrent_compactions=2 throughput_bytes_per_second=50331648
throughput_bytes_per_second_burst=50331648
ts=2020-04-27T20:43:46.918928Z lvl=info
msg="Open store (start)" log_id=0MRLxEhl000 service=store trace_id=0MRLxFPW000
op_name=tsdb_open op_event=start
ts=2020-04-27T20:43:47.617596Z lvl=info
msg="Opened file" log_id=0MRLxEhl000 engine=tsm1 service=filestore
path=/var/lib/influxdb/data/testdb/default/692313/000000002-000000002.tsm id=0
duration=95.079ms
...
ts=2020-04-27T20:43:49.812364Z lvl=info
msg="Opened shard" log_id=0MRLxEhl000 service=store trace_id=0MRLxFPW000
op_name=tsdb_open index_version=inmem
path=/var/lib/influxdb/data/testdb/default/721947
duration=1098.246ms
Usually it means that InfluxDB PV stores a lot of data and InfluxDB fails with
Out Of Memory
during reading these data to create in-memory index.
There are two ways to solve this issue:
- Increase RAM size and memory size must be selected based on Load Profile from Hardware Sizing Guide
- Navigate to PV (you can navigate to PV on OpenShift node or use a debug pod), find biggest tsm files and remove themFor example (Please pay attention that it will remove some historical data!):
$ pwd /var/lib/influxdb $ ls -la total 0 drwxr-xr-x. 6 root root 76 апр 23 12:10 data drwxr-xr-x. 2 root root 21 апр 23 12:10 meta drwx------. 6 root root 76 апр 23 12:10 wal $ du -sh data/* 23M data/monitoring_server 180K data/prometheus 12G data/testdb $ du -sh data/testdb/default/* 1,6M data/testdb/default/664926 2,4M data/testdb/default/672950 2,4M data/testdb/default/684224 11,9G data/testdb/default/692313 3,0M data/testdb/default/705657 2,8M data/testdb/default/721947 1,4M data/testdb/default/738278 $ rm -rf data/testdb/default/692313
Out Of Memory Some Time After Start
This situation is similar to when InfluxDB restart during start. It has the similar symptoms like OOM errors with continuous restarts. But unlike the previous situation, logs contains information about compaction process. For example:
ts=2020-04-27T21:02:42.932868Z lvl=info
msg="InfluxDB starting" log_id=0MRN1_y0000 version=1.7.4 branch=1.7
commit=ef77e72f435b71b1ad6da7d6a6a4c4a262439379
ts=2020-04-27T21:02:42.932910Z lvl=info
msg="Go runtime" log_id=0MRN1_y0000 version=go1.11 maxprocs=4
ts=2020-04-27T21:02:43.047179Z lvl=info
msg="Using data dir" log_id=0MRN1_y0000 service=store
path=/var/lib/influxdb/data
ts=2020-04-27T21:02:43.053729Z lvl=info
msg="Compaction settings" log_id=0MRN1_y0000 service=store
max_concurrent_compactions=2 throughput_bytes_per_second=50331648
throughput_bytes_per_second_burst=50331648
ts=2020-04-27T21:02:43.053776Z lvl=info
msg="Open store (start)" log_id=0MRN1_y0000 service=store trace_id=0MRN1aRG000
op_name=tsdb_open op_event=start
...
ts=2020-04-27T21:02:54.218233Z lvl=info
msg="Starting monitor service" log_id=0MRN1_y0000 service=monitor
ts=2020-04-27T21:02:54.218238Z lvl=info
msg="Registered diagnostics client" log_id=0MRN1_y0000 service=monitor
name=build
...
ts=2020-04-27T21:02:54.218368Z lvl=info
msg="Storing statistics" log_id=0MRN1_y0000 service=monitor
db_instance=monitoring_server db_rp=monitor interval=10s
ts=2020-04-27T21:02:54.218569Z lvl=info
msg="Listening on HTTP" log_id=0MRN1_y0000 service=httpd addr=[::]:8086
https=false
ts=2020-04-27T21:02:54.218583Z lvl=info
msg="Starting retention policy enforcement service" log_id=0MRN1_y0000
service=retention check_interval=30m
ts=2020-04-27T21:02:54.218932Z lvl=info
msg="Started listening on UDP" log_id=0MRN1_y0000 service=udp
addr=:8087
ts=2020-04-27T21:02:54.219093Z lvl=info
msg="Listening for signals" log_id=0MRN1_y0000
[httpd] 10.130.4.1 - - [27/Apr/2020:21:02:59
+0000] "GET /ping HTTP/1.1" 204 0 "-" "Go-http-client/1.1"
7990f47e-88ca-11ea-8001-dad7becf4de2 105576
...
ts=2020-04-27T21:02:54.219093Z lvl=info
msg="Cache snapshot (start)" log_id=09AmobhW000 engine=tsm1 trace_id=09BprT6G000
op_name=tsm1_cache_snapshot op_event=start
ts=2020-04-27T21:02:54.219093Z lvl=info
msg="Snapshot for path written" log_id=09AmobhW000 engine=tsm1
trace_id=09BprT6G000 op_name=tsm1_cache_snapshot
path=/var/lib/influxdb/data/telegraf/default/1071
duration=313.801ms
ts=2020-04-27T21:02:54.219093Z lvl=info
msg="Cache snapshot (end)" log_id=09AmobhW000 engine=tsm1 trace_id=09BprT6G000
op_name=tsm1_cache_snapshot op_event=end op_elapsed=313.872ms
ts=2020-04-27T21:02:54.219093Z lvl=info
msg="TSM compaction (start)" log_id=09AmobhW000 engine=tsm1 tsm1_level=1
tsm1_strategy=level trace_id=09BprX0G000 op_name=tsm1_compact_group
op_event=start
ts=2020-04-27T21:02:54.219093Z lvl=info
msg="Beginning compaction" log_id=09AmobhW000 engine=tsm1 tsm1_level=1
tsm1_strategy=level trace_id=09BprX0G000 op_name=tsm1_compact_group
tsm1_files=8
ts=2020-04-27T21:02:54.219093Z lvl=info
msg="Compacting file" log_id=09AmobhW000 engine=tsm1 tsm1_level=1
tsm1_strategy=level trace_id=09BprX0G000 op_name=tsm1_compact_group tsm1_index=0
tsm1_file=/var/lib/influxdb/data/telegraf/default/1071/000000409-000000001.tsm
...
ts=2020-04-27T21:02:54.219093Z lvl=info
msg="Compacted file" log_id=09AmobhW000 engine=tsm1 tsm1_level=1
tsm1_strategy=level trace_id=09BprX0G000 op_name=tsm1_compact_group tsm1_index=0
tsm1_file=/var/lib/influxdb/data/telegraf/default/1071/000000416-000000002.tsm.tmp
ts=2020-04-27T21:02:54.219093Z lvl=info
msg="Finished compacting files" log_id=09AmobhW000 engine=tsm1 tsm1_level=1
tsm1_strategy=level trace_id=09BprX0G000 op_name=tsm1_compact_group groups=8
files=1 duration=1496.442ms
ts=2020-04-27T21:02:54.219093Z lvl=info
msg="TSM compaction (end)" log_id=09AmobhW000 engine=tsm1 tsm1_level=1
tsm1_strategy=level trace_id=09BprX0G000 op_name=tsm1_compact_group op_event=end
op_elapsed=1496.456ms
ts=2020-04-27T21:02:54.219093Z lvl=info
msg="TSM compaction (start)" log_id=09AmobhW000 engine=tsm1 tsm1_level=2
tsm1_strategy=level trace_id=09BprdpG000 op_name=tsm1_compact_group
op_event=start
ts=2020-04-27T21:02:54.219093Z lvl=info
msg="Beginning compaction" log_id=09AmobhW000 engine=tsm1 tsm1_level=2
tsm1_strategy=level trace_id=09BprdpG000 op_name=tsm1_compact_group
tsm1_files=4
ts=2020-04-27T21:02:54.219093Z lvl=info
msg="Compacting file" log_id=09AmobhW000 engine=tsm1 tsm1_level=2
tsm1_strategy=level trace_id=09BprdpG000 op_name=tsm1_compact_group tsm1_index=0
tsm1_file=/var/lib/influxdb/data/telegraf/default/1071/000000392-000000002.tsm
ts=2020-04-27T21:02:54.219093Z lvl=info
msg="Compacting file" log_id=09AmobhW000 engine=tsm1 tsm1_level=2
tsm1_strategy=level trace_id=09BprdpG000 op_name=tsm1_compact_group tsm1_index=1
tsm1_file=/var/lib/influxdb/data/telegraf/default/1071/000000400-000000002.tsm
ts=2020-04-27T21:02:54.219093Z lvl=info
msg="Compacting file" log_id=09AmobhW000 engine=tsm1 tsm1_level=2
tsm1_strategy=level trace_id=09BprdpG000 op_name=tsm1_compact_group tsm1_index=2
tsm1_file=/var/lib/influxdb/data/telegraf/default/1071/000000408-000000002.tsm
ts=2020-04-27T21:02:54.219093Z lvl=info
msg="Compacting file" log_id=09AmobhW000 engine=tsm1 tsm1_level=2
tsm1_strategy=level trace_id=09BprdpG000 op_name=tsm1_compact_group tsm1_index=3
tsm1_file=/var/lib/influxdb/data/telegraf/default/1071/000000416-000000002.tsm
ts=2020-04-27T21:02:54.219093Z lvl=info
msg="Compacted file" log_id=09AmobhW000 engine=tsm1 tsm1_level=2
tsm1_strategy=level trace_id=09BprdpG000 op_name=tsm1_compact_group tsm1_index=0
tsm1_file=/var/lib/influxdb/data/telegraf/default/1071/000000416-000000003.tsm.tmp
ts=2020-04-27T21:02:54.219093Z lvl=info
msg="Finished compacting files" log_id=09AmobhW000 engine=tsm1 tsm1_level=2
tsm1_strategy=level trace_id=09BprdpG000 op_name=tsm1_compact_group groups=4
files=1 duration=4792.512ms
ts=2020-04-27T21:02:54.219093Z lvl=info
msg="TSM compaction (end)" log_id=09AmobhW000 engine=tsm1 tsm1_level=2
tsm1_strategy=level trace_id=09BprdpG000 op_name=tsm1_compact_group op_event=end
op_elapsed=4792.526ms
ts=2020-04-27T21:02:54.219093Z lvl=info
msg="InfluxDB starting" log_id=09BpsqS0000 version=1.7.4 branch=1.5
commit=ef77e72f435b71b1ad6da7d6a6a4c4a262439379
ts=2020-04-27T21:02:54.219093Z lvl=info
msg="Go runtime" log_id=09BpsqS0000 version=go1.11 maxprocs=4
InfluxDB successfully started, loaded all data into index and has run the compaction process. The Compactor is responsible for converting less optimized Cache and TSM data into more read-optimized formats. It does this by compressing series, removing deleted data, optimizing indices and combining smaller files into larger ones.
During compaction process InfluxDB can use a lot of memory if it processes a measurement with a lot of values. In this case, try to allocate an additional memory which leads to
Out Of Memory
error.
To solve this issue, use the following options:
- Temporally increase RAM size, which allow complete compaction. After complete compaction process, you can reduce RAM size to old value.
- Find the biggest files and remove them. For more information, see Out Of Memory during pod start
Out Of Memory During Executing a Query
The major component that affects RAM is the series cardinality. The RAM usage has an exponential relationship to series cardinality (series count in database). A large series count leads to an increase in the depth of the index and may lead to increased RAM usage.
For example, a complex query is as follows:
SELECT MIN(value),MAX(value),SUM(value),MEAN(value)
FROM /.*/
WHERE "uid" =~ /|14df3879-8198-4840-9bf1-5e3f942ff845/
AND "indicatorName" !~ /./
AND time > 1570665518s AND time < 1570665818s
GROUP BY *, time(300s)
which:
- Uses 4 aggregated functions.
- Selects from all measurements in the current database.
- Groups by all available tags in the current database.
This complex query when executed on a database with 100,000` series, can do the following:
- Execute in ~3 seconds.
- Use ~200-500 Mb RAM.
But if this complex query tries to execute on a database with 600,000 series, it may:
- Execute in more than ~30 seconds.
- Use > ~2 Gb RAM.
So in this case, you need to do the following:
- Increase pod resources
- Reduce a metrics cardinality, for example you can review your metrics schema and redesign it to reduce a count of unique tags
- Reduce a storage period (which also reduces index deep and metrics cardinality)
For reduced storage period, you need to change Retention Policy (RP) which automatically remove data with timestamp older than that specified in RP.
To change RP into InfluxDB Cluster, you can use two options.
Option 1: Make changes in ConfigMap
influxdb-service-config
.
Your databases configuration in influxdb-service-config looks like this:
databases:
- database:
name: test
retention_policies:
- retention_policy:
name: test1
duration: 20d
shard: 1d
default: yes
- retention_policy:
name: test2
duration: 20d
shard: 1d
default: yes
Where, the setup storage period needs to be changed in
duration
.
For additional information about parameters, refer to the following sections in the InfluxDB Cluster Installation Procedure:
- ADDITIONAL_DATABASES
- ADDITIONAL_RETENTION_POLICY
Option 2: Execute a query to change RP on each InfluxDB node.
InfluxQL query to change RP:
ALTER RETENTION POLICY "default" ON "<database>" DURATION 20d SHARD DURATION 1d [DEFAULT]
Where, the setup storage period needs to be changed in
DURATION XXd
.
Note: This query must be executed on each InfluxDB pod in the project.
Read from InfluxDB Cluster Not Working
For providing a single entry point in order to not load a common OpenShift router in InfluxDB Cluster, use HAProxy which deploys into pod with the name
influxdb-router
.
Configuration of HAProxy endpoints stored into ConfigMap
influxdb-router-config
and control by service pod influxdb-service
. But in some cases when InfluxDB pods are unstable and reboot some endpoints into config file can be absent.
It can lead to two possible issues:
- Data always reads from only one InfluxDB pod (in case when in InfluxDB Cluster deployed 2 InfluxDB pod)
- Any requests to
/query
return "503 Unavailable" response
Returns "503 Unavailable" Http Status Code for /query
When Prometheus or another tool makes request to
/query
endpoint and HAProxy has no servers for proxy request to them, it will return "503 Unavailable" response.
Before sending the request, you need to get an
influxdb-router
service and use IP or Hostname to send requests using the following command:$ oc get services
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
...
influxdb-router 172.30.127.232 <none> 8086/TCP,8087/TCP,8088/TCP,1936/TCP 3d
...
In OpenShift, hostname can be built by the following schema:
<service_name>.<project_name>.svc
For example,
influxdb-router.influxdb-cluster.svc
.
To check it manually, use one of pods with cURL tools into pod, for example
influxdb-backup-daemon
.
For example, a normal response:
$ curl -XGET http://influxdb-router.influxdb-cluster.svc:8086/query?q=show+databases
{"results":[{"statement_id":0,"series":[{"name":"databases","columns":["name"],"values":[["prometheus"],["monitoring_server"],["test"],["test1"],["test2"],["test4"],["test10"],["test11"],["test12"]]}]}]}
But if HAPRoxy has no server for processing endpoint request, it can return as follows:
$ curl -v -XGET http://influxdb-router.influxdb-cluster.svc:8086/query?q=show+databases
* About to connect() to influxdb-router.influxdb-cluster.svc port 8086 (#0)
* Trying 172.30.127.232...
* Connected to influxdb-router.influxdb-cluster.svc (172.30.127.232) port 8086 (#0)
> GET /query?q=show+databases HTTP/1.1
> User-Agent: curl/7.29.0
> Host: influxdb-router.influxdb-cluster.svc:8086
> Accept: */*
>
* HTTP 1.0, assume close after body
< HTTP/1.0 503 Service Unavailable
< Cache-Control: no-cache
< Connection: close
< Content-Type: text/html
<
<html><body><h1>503 Service Unavailable</h1>
No server is available to handle this request.
</body></html>
* Closing connection 0
In this case, you need to check
influxdb-router-config
ConfigMap using the following command:$ oc get cm influxdb-router-config -o yaml
apiVersion: v1
data:
haproxy.cfg: ....
Check that section has servers or not using the following command:
backend influxdb_query
mode http
balance roundrobin
cookie SERVERID insert indirect nocache
server influxdb-2 172.30.34.194:8086 check cookie influxdb-2 weight 10
server influxdb-1 172.30.119.114:8086 check cookie influxdb-1 weight 10
This issue can be solved in the following ways:
- Restart the
influxdb-service
pod, which re-initializes during the restart, and include all running InfluxDB pods. - Manually include the lost InfluxDB pod.
For including the lost pod manually:
- Navigate to terminal of
influxdb-service
podoc rsh <pod_name> bash
- Call a daemon
influxdb-s
command:influxdb-s include <service_name>
orinfluxdb-s start-read <service_name>
Where,<service_name>
is the service name associated with lost InfluxDB pod.
For additional information about all available manual actions, refer to Manual Actions With InfluxDB Cluster in the Cloud Platform Maintenance Guide.
InfluxDB Pod Absent into List of Haproxy Endpoints
This situation can occur:
- when one of InfluxDB is unstable and had some restarts
- when somebody manually deploys a new InfluxDB pod
In this case,
influxdb-router-config
ConfigMap will have only one server. In latest releases of InfluxDB Cluster, pass command into influxdb-service
pod which shows current active nodes:influxdb-s list
But if this command is not presented, you can see the ConfigMap directly. For example:
backend influxdb_query
mode http
balance roundrobin
cookie SERVERID insert indirect nocache
server influxdb-1 172.30.119.114:8086 check cookie influxdb-1 weight 10
It is not good situation because for read queries not executing load balancing and high availability.
This issue can be solved in the following ways:
- Restart the
influxdb-service
pod, which re-initializes during the restart, and include all running InfluxDB pods. - Manually include the lost InfluxDB pod.
For including lost pod manually:
- Navigate to terminal of
influxdb-service
podoc rsh <pod_name> bash
- Call a daemon
influxdb-s
command:influxdb-s include <service_name>
orinfluxdb-s start-read <service_name>
Where,<service_name>
is the service name associated with the lost InfluxDB pod.
For additional information about all available manual actions, refer to Manual Actions With InfluxDB Cluster in the Cloud Platform Maintenance Guide.
Write to InfluxDB Cluster Not Working
For synchronizing write recording metrics into InfluxDB pods, use
influxdb-relay
pods. It has endpoint /write
and proxy all traffic from this endpoint to each InfluxDB pod.
It leads to three possible issues:
- Data always writes to only one InfluxDB pod (in case when InfluxDB Cluster deployed 2 InfluxDB pods)
- Any requests to
/write
, returns the "Unable to write points" response - Any requests to
/write
, returns the "503 Unavailable" response
Returns "Unable to write points"
This issue can occur if
influxdb-relay
does not have endpoint to proxy requests or if one of InfluxDB cannot process the write request.
In this case, the
influxdb-relay-config
ConfigMap will have only one server. In latest releases of InfluxDB Cluster, pass command into influxdb-service
pod which shows current active nodes:influxdb-s list
But if this command is not presented, you can see the ConfigMap directly. For example:
[[http]]
name = "http"
bind-addr = ":9096"
[[http.output]]
name = "influxdb-1"
location = "http://influxdb-1.influxdb-cluster-dev.svc:8086/write"
buffer-size-mb = 100
max-batch-kb = 50
max-delay-interval = "5s"
It is not a good situation because for read queries not executing load balancing and high availability.
This issue can be solved in the following ways:
- Restart the
influxdb-service
pod, which re-initializes during the restart, and include all running InfluxDB pods. - Manually include the lost InfluxDB pod.
For including lost pod manually:
- Navigate to terminal of
influxdb-service
podoc rsh <pod_name> bash
- Call a daemon
influxdb-s
command:influxdb-s include <service_name>
orinfluxdb-s start-write <service_name>
Where,<service_name>
is the service name associated with the lost InfluxDB pod.
For additional information about all available manual actions please refer to Manual Actions With InfluxDB Cluster in the Cloud Platform Maintenance Guide.
Returns "503 Unavailable" Http Status Code for /write
This is a rare case. This response can return
influxdb-router
(HAProxy) use an entrypoint.
The write requests process is as follows:
Write metrics -> influxdb-router -> influxdb-relay -> some influxdb pods
During first deployment, deploy
influxdb-router
with empty ConfigMap without any servers. After deploying a service pod influxdb-service
, it makes pods discovery in the project and fill all configurations.
So in a situation when
influxdb-router
returns 503 error to the request /write
, it means that during deployment influxdb-service
failed or deploy job did not get to the second step when deploying influxdb-service
pod.
To solve this issue:
- Open log deployment job
- Investigate why
influxdb-service
failed or why job did not get to the second step.
InfluxDB Pod Absent into List of Relay Endpoints
This situation can occur:
- when one of InfluxDB is unstable and had some restarts
- when somebody manually deployed new InfluxDB pod
In this case,
influxdb-relay-config
ConfigMap will have only one server. In latest releases of InfluxDB Cluster, pass command into influxdb-service
pod which shows current active nodes:influxdb-s list
But if this command is not presented, you can see the ConfigMap directly. For example:
[[http]]
name = "http"
bind-addr = ":9096"
[[http.output]]
name = "influxdb-1"
location = "http://influxdb-1.influxdb-cluster-dev.svc:8086/write"
buffer-size-mb = 100
max-batch-kb = 50
max-delay-interval = "5s"
It is not a good situation because all metrics do not write in InfluxDB pods. This will lead to data out of sync between InfluxDB nodes and potentially some data can be lost during failover.
This issue can be solved in the following ways:
- Restart the
influxdb-service
pod, which re-initializes during the restart, and include all running InfluxDB pods. - Manually include the lost InfluxDB pod.
For including lost pod manually:
- Navigate to terminal of
influxdb-service
podoc rsh <pod_name> bash
- Call a daemon
influxdb-s
command:influxdb-s include <service_name>
orinfluxdb-s start-write <service_name>
Where,<service_name>
is the service name associated with lost InfluxDB pod.
For additional information about all available manual actions please refer to Manual Actions With InfluxDB Cluster in the Cloud Platform Maintenance Guide.
InfluxDB Pod Does Not Return into Cluster After Failover
This situation can occurs:
- when one of InfluxDB unstable and had some restarts
- or when somebody manually deploy new InfluxDB pod
This issue can be solved in the following ways:
- Restart the
influxdb-service
pod, which re-initializes during the restart, and include all running InfluxDB pods. - Manually include the lost InfluxDB pod.
For include lost pod manually need:
- Navigate to terminal of
influxdb-service
podoc rsh <pod_name> bash
- Call a daemon
influxdb-s
command:influxdb-s include <service_name>
Where,<service_name>
is the service name associated with lost InfluxDB pod.
For additional information about all available manual actions, refer to Manual Actions With InfluxDB Cluster in the Cloud Platform Maintenance Guide.
No Metrics Available
This situation can occur due to several problems:
- Some links for collect metrics can be absent into
telegraf-agent
ConfigMap - ConfigMap for
telegraf-agent
has some incorrect configuration and telegraf cannot parse it - Some components does not expose metrics
Analysis of some problems and their solutions can be found below.
Telegraf has no Settings for Collect Metrics
Usually this issue is related with any bug or incorrect works of
influxdb-service
. This service should control ConfigMap for telegraf-agent
and add or remove links to InfluxDB pods in cases when these pods include or exclude into cluster.
But in the following cases, service
influxdb-cluster
cannot add necessary records into ConfigMap:- When one of InfluxDB unstable and had some restarts
- When somebody manually deploy new InfluxDB pod
To solve this issue, execute the following actions:
Check a
telegraf-agent
logs and verify that it does not contain error messages that config cannot be parsed. You can print logs using the command:oc logs <pod_name>
Check a
telegraf-agent
ConfigMap and verify that all InfluxDB pods and other components are present in it. You can print ConfigMap using a command:oc get configmap telegraf-config -o yaml
If some InfluxDB pods are absent in ConfigMap, you can restore it with using
influxdb-service
commands:- Navigate to terminal of
influxdb-service
podoc rsh <pod_name> bash
- Call a daemon
influxdb-s
command:influxdb-s include <service_name>
Where,<service_name>
is the service name associated with lost InfluxDB pod.
For additional information about all available manual actions, refer to Manual Actions With InfluxDB Cluster in the Cloud Platform Maintenance Guide.
Component Does Not Expose Metrics
Telegraf should collect metrics from the following component:
- InfluxDB Service
- InfluxDB pods
If all necessary links are specified into ConfigMap, you need to check that components expose the metrics.
To check InfluxDB metrics use one of pods with cURL tools into pod, for example,
influxdb-backup-daemon
.
For example a normal response:
$ curl -XGET http://influxdb-1.influxdb-cluster.svc:8086/debug/vars
{
"system": {"currentTime":"2020-04-29T11:15:38.436999238Z","started":"2020-04-29T10:46:30.53370145Z","uptime":1747},
"cmdline": ["influxd"],
"memstats": {"Alloc":39204680,
...
To check InfluxDB Service metrics use one of pods with cURL tools into pod, for example
influxdb-backup-daemon
.
For example a normal response:
$ curl -XGET http://influxdb-service.influxdb-cluster.svc:8080/metrics
[{"name": "influxdb_cluster_health", "cluster": "influxdb-cluster", "component": "influxdb-1", "status": 1, "create_db_status": 1, "create_db_time": 0.01268862932920456, "write_status": 1, "write_time": 0.031825680285692215, "query_status": 1, "query_time": 0.012298475950956345, "delete_db_status": 1, "delete_db_time": 0.023207109421491623}, {"name": "influxdb_cluster_health", "cluster": "influxdb-cluster", "component": "influxdb-2", "status": 1, "create_db_status": 1, "create_db_time": 0.008592180907726288, "write_status": 1, "write_time": 0.02796272560954094, "query_status": 1, "query_time": 0.007285472005605698, "delete_db_status": 1, "delete_db_time": 0.012705259025096893}, {"name": "influxdb_cluster_health", "cluster": "influxdb-cluster", "component": "influxdb-relay", "status": 1}, {"name": "influxdb_cluster_overview", "cluster": "influxdb-cluster", "active_replicas_count": 2, "replicas_count": 2}]
If InfluxDB pod does not respond, you can try to apply the troubleshooting steps described in the below sections:
- InfluxDB installation problems
- InfluxDB pod in down
If InfluxDB Service pod does not respond, you can try to just restart it. For example, using a command:
oc scale --replicas=0 dc influxdb-service
oc scale --replicas=1 dc influxdb-service
Or just remove the pod using the following command:
$ oc get pods
NAME READY STATUS RESTARTS AGE
...
influxdb-service-35-jksc2 1/1 Running 0 17h
$ oc delete pod influxdb-service-35-jksc2
Corrupted TSM Files
There are situations when the OpenShift node, on which the InfluxDB pod is running, restarts with Kernel Panic, or for some other reason. The InfluxDB process, which stores data on Persistence Volumes (PVs), is affected and corrupts the TSM files. This does not allow to read the files and results in the faliure of the InfluxDB during its start. The following are the symptoms of this scenario:
- The pod is constantly restarting.
- The pod does not restart with exit code 137 (OOMKiller).
- There is no valuable information in the logs.
InfluxDB can crash with logs, as shown in the following example:
2020-03-27T09:42:27.303607Z info InfluxDB
starting {"log_id": "0Lnqmpul000", "version": "1.7.4", "branch": "1.7",
"commit": "ef77e72f435b71b1ad6da7d6a6a4c4a262439379"}
2020-03-27T09:42:27.303734Z info Go runtime
{"log_id": "0Lnqmpul000", "version": "go1.11", "maxprocs": 16}
2020-03-27T09:42:27.406310Z info Using data
dir {"log_id": "0Lnqmpul000", "service": "store", "path":
"/var/lib/influxdb/data"}
2020-03-27T09:42:27.406599Z info Compaction
settings {"log_id": "0Lnqmpul000", "service": "store",
"max_concurrent_compactions": 8, "throughput_bytes_per_second": 50331648,
"throughput_bytes_per_second_burst": 50331648}
2020-03-27T09:42:27.406685Z info Open store
(start) {"log_id": "0Lnqmpul000", "service": "store", "trace_id": "0LnqmqJW000",
"op_name": "tsdb_open", "op_event": "start"}
The
influx_inspect
utility shows a lot of issues related with the corrupted files if it is used to check the TSM files.
For example:
$ influx_inspect verify -dir
/var/lib/influxdb/
/var/lib/influxdb/data/monitoring_server/monitor/745240/000000001-000000001.tsm:
healthy
/var/lib/influxdb/data/prometheus/default/729781/000000001-000000001.tsm:
healthy
...
/var/lib/influxdb/data/prometheus/default/745303/000000005-000000001.tsm:
healthy
/var/lib/influxdb/data/prometheus/mano/552221/000000002-000000002.tsm:
healthy
...
/var/lib/influxdb/data/prometheus/mano/745256/000000001-000000001.tsm:
healthy
Broken Blocks: 0 / 64399, in
0.134176837s
$ influx_inspect verify-seriesfile -dir
/var/lib/influxdb/
2020-03-27T10:43:03.809148Z error Series
file does not exist {"log_id": "0LnuFn00000",
"path": "/var/lib/influxdb/data/_series"}
2020-03-27T10:43:03.809425Z error Series
file does not exist {"log_id": "0LnuFn00000",
"path": "/var/lib/influxdb/lost+found/_series"}
2020-03-27T10:43:03.809545Z error Series
file does not exist {"log_id": "0LnuFn00000",
"path": "/var/lib/influxdb/meta/_series"}
2020-03-27T10:43:03.809665Z error Series
file does not exist {"log_id": "0LnuFn00000",
"path": "/var/lib/influxdb/wal/_series"}
$ influx_inspect verify-seriesfile -dir
/var/lib/influxdb/data/
2020-03-27T10:43:32.103647Z error Error
opening segment {"log_id": "0LnuHWXG000", "path":
"/var/lib/influxdb/data/idb_service_check_db/_series", "partition": "00",
"segment": "0000", "error": "invalid series segment"}
...
2020-03-27T10:43:32.105459Z error Error
opening segment {"log_id": "0LnuHWXG000", "path":
"/var/lib/influxdb/data/idb_service_check_db/_series", "partition": "07",
"segment": "0000", "error": "invalid series segment"}
2020-03-27T10:43:32.479805Z error Index
inconsistency {"log_id": "0LnuHWXG000", "path":
"/var/lib/influxdb/data/monitoring_server/_series", "partition": "06", "id": 7,
"got_offset": 5, "expected_offset": 0}
2020-03-27T10:43:32.582526Z error Index
inconsistency {"log_id": "0LnuHWXG000", "path":
"/var/lib/influxdb/data/prometheus/_series", "partition": "01", "id": 410,
"got_offset": 6889, "expected_offset": 0}
Verify TSM Files
To execute troubleshooting actions, OpenShift runs a pod in the debug mode. When the pod runs in the debug mode, the container starts with an overridden entrypoint. Instead of running the
influxdb
process, the system runs the /bin/bash
process.
To run the pod in the debug mode:
- Navigate to the pod. If the pod does not exist, run a new deploy and navigate to the created pod.
- See the Status section on the Details tab.
- Click Debug in Terminal.
- In the window that opens, execute the commands into the pod.
To check the TSM files, execute the following commands:
- To check TSM files
influx_inspect verify -dir /var/lib/influxdb/
- To check TSM series files
influx_inspect verify-seriesfile -dir /var/lib/influxdb/ influx_inspect verify-seriesfile -dir /var/lib/influxdb/data/
For example:
$ influx_inspect verify -dir
/var/lib/influxdb/
/var/lib/influxdb/data/monitoring_server/monitor/745240/000000001-000000001.tsm:
healthy
...
Broken Blocks: 0 / 64399, in
0.134176837s
$ influx_inspect verify-seriesfile -dir
/var/lib/influxdb/
2020-03-27T10:43:03.809148Z error Series
file does not exist {"log_id": "0LnuFn00000",
"path": "/var/lib/influxdb/data/_series"}
...
$ influx_inspect verify-seriesfile -dir
/var/lib/influxdb/data/
2020-03-27T10:43:32.103647Z error Error
opening segment {"log_id": "0LnuHWXG000", "path":
"/var/lib/influxdb/data/idb_service_check_db/_series", "partition": "00",
"segment": "0000", "error": "invalid series segment"}
...
Fix Corrupted TSM Files
To solve the corrupted TSM files issue, remove the data and restore it from another InfluxDB pod.
To fix the issue:
- Run the pod in the debug mode. Refer How to verify TSM files for details.
- Remove all files from the PV using the following command:
rm -rf /var/lib/influxdb/data/*
- Restart the pod.
- Restart the influxdb-service.
- Manually run the restore proces by navigating to the
influxdb-service
pod and call the following daemoninfluxdb-s
command:influxdb-s restore <service_name>
Where,<service_name>
is the service name of the InfluxDB pod on which the data is to be restored.
For additional information about all available manual actions, refer to Manual Actions With InfluxDB Cluster in the Cloud Platform Maintenance Guide.
Alarms
InfluxDB Component is Down
Problem | Severity |
---|---|
InfluxDB component is down in {$NAMESPACE} project | Disaster |
This trigger checks a count of currently running components in InfluxDB Cluster. Metrics contains statuses of following components:
- InfluxDB pods
- InfluxDB relay
Alarm can be raised in case when a count of currently active components equals
0
.Possible Component Down Problems
Usually this issue can occur due the following reasons:
- Project with InfluxDB Cluster was removed
- Component
influxdb-service
is down or cannot expose metrics - Pod
telegraf-agent
is down, has incorrect settings or does not collect metrics - InfluxDB on monitoring VM is down
Component Down Solution
First, check that all pods of InfluxDB Cluster are live in project. You can check using the following command:
$ oc get pods
NAME READY STATUS RESTARTS AGE
influxdb-1-2-1fnj8 1/1 Running 0 12h
influxdb-2-1-q6v6g 1/1 Running 0 13d
influxdb-backup-daemon-2-6qmbs 1/1 Running 0 2d
influxdb-relay-t3j55 1/1 Running 0 12h
influxdb-router-1-8blvg 1/1 Running 0 12h
influxdb-service-35-jksc2 1/1 Running 0 17h
influxdb-tests-9k3c1 0/1 Completed 0 13d
telegraf-5-ts2f3 1/1 Running 0 17h
If some pods are down, you can check the troubleshooting steps in InfluxDB installation problems or InfluxDB pod in down.
CPU Usage by InfluxDB Component is Above {any}%
Problem | Severity |
---|---|
CPU usage by InfluxDB nodes is above > {$CPU_HIGH_THRESHOLD} for {$INTERVAL} in {$NAMESPACE} project | High |
CPU usage by InfluxDB relay is above > {$CPU_HIGH_THRESHOLD} for {$INTERVAL} in {$NAMESPACE} project | High |
Trigger use metrics which collect
heapster
agent from OpenShift about pods and nodes.
Alarms can be raised in cases when one of InfluxDB pods or InfluxDB relay use a lot of CPU resources. Default thresholds:
- High
- Raise =
>= 90%
- Clear =
< 90%
- Raise =
For example for
influxdb-relay
specify resources cpu.requests = 100m
and cpu.limits = 200m
. When pod influxdb-relay
starts using more than 180m
, an alarm is raised.Possible CPU Usage Problems
Alarms about problems with the increased CPU usage can be raised in the following cases:
- Stored data has a high cardinality and requires a lot of CPU usage to process selection
- Some components execute very huge requests for which InfluxDB should scan a lot of data and use a lot of resources
CPU Usage Solution
To solve issues related with increased CPU usage refer to section Pod restarting.
RAM Usage by InfluxDB Component is Above {any}%
Problem | Severity |
---|---|
RAM usage by InfluxDB nodes is above > {$RAM_HIGH_THRESHOLD} for {$INTERVAL} in {$NAMESPACE} project | High |
RAM usage by InfluxDB relay is above > {$RAM_HIGH_THRESHOLD} for {$INTERVAL} in {$NAMESPACE} project | High |
Trigger use metrics which collect
heapster
agent from OpenShift about pods and nodes.
Alarms can be raised in cases when one of InfluxDB pods or InfluxDB relay use a lot of RAM resources. Default thresholds:
- High
- Raise =
>= 90%
- Clear =
< 90%
- Raise =
For example for
influxdb-relay
specify resources memory.requests = 100Mi
and memory.limits = 200Mi
. When pod influxdb-relay
starts and uses more than 180Mi
, an alarm is raised.Possible RAM Usage Problems
Alarms about problems with the increased memory usage can be raised in the following cases:
- A lot of stored data and too small allocated resources for such data amount
- Stored data has a high cardinality and required a lot of memory to select data
- Some components execute a very huge requests for which InfluxDB should scan a lot of data and use a lot of resources
RAM Usage Solution
To solve issues related with increased memory usage refer to section Pod restarting.
Smoke Test Failed for InfluxDB Cluster
Problem | Severity |
---|---|
Create test DB fail for InfluxDB Cluster in {$NAMESPACE} project | Average |
Delete test DB fail for InfluxDB cluster in {$NAMESPACE} project | Average |
Query data in testing DB fail for InfluxDB cluster in {$NAMESPACE} project | Average |
Write data in testing DB fail for InfluxDB cluster in {$NAMESPACE} project | Average |
These triggers check results of smoke tests which service pods execute on each InfluxDB pod.
For each smoke test run, execute logic:
- Create test database with name
service_check_db
- Write some test measurements with some test data
- Read test data which were write on previous step
- Drop test database
Alarms can be raised when on of these checks failed and metrics contains result
0
.Possible Smoke Test Failed Problems
Alarms about problems with the smoke test can point at issues with CPU or Memory usage.
So they can be raised in the following cases:
- A lot of stored data and too small allocated resources for such data amount
- Stored data has a high cardinality and required a lot of memory to select data
- Some components execute a very huge requests for which InfluxDB should scan a lot of data and use a lot of resources
- Problems with network latency
- Problems with network connections
- Problems with metrics collections
Smoke Test Failed Solution
To solve issues related with smoke test, refer to sections:
- Pod restarting
- There is no metrics
Influxdb Problems And Troubleshooting ~ System Admin Share >>>>> Download Now
ReplyDelete>>>>> Download Full
Influxdb Problems And Troubleshooting ~ System Admin Share >>>>> Download LINK
>>>>> Download Now
Influxdb Problems And Troubleshooting ~ System Admin Share >>>>> Download Full
>>>>> Download LINK
I was surfing net and fortunately came across this site and found very interesting stuff here. Its really fun to read. I enjoyed a lot. Thanks for sharing this wonderful information. msmpeng.exe memory
ReplyDelete