InfluxDB Problems & troubleshooting



This document provides details for troubleshooting and restoring the InfluxDB Cluster.

  • InfluxDB Problems
    • InfluxDB Installation Problems
      • PVC in Pending Status
      • Pod in Pending Status
      • Two Relay Pods with Different Deployments
    • InfluxDB Pod is Down
      • Pod is Down
      • Pod Restarting
    • Read from InfluxDB Cluster Not Working
      • Returns "503 Unavailable" Http Status Code for /query
      • InfluxDB Pod Absent into List of Haproxy Endpoints
    • Write to InfluxDB Cluster Not Working
      • Returns "Unable to write points"
      • Returns "503 Unavailable" Http Status Code for /write
      • InfluxDB Pod Absent into List of Relay Endpoints
    • InfluxDB Pod Does Not Return into Cluster after Failover
    • No Metrics Available
      • Telegraf has No Settings for Collect Metrics
      • Component Does Not Expose Metrics
    • Corrupted TSM Files
      • Verify TSM Files
      • Fix Corrupted TSM Files
  • Alarms
    • InfluxDB Component is Down
      • Possible Component Down Problems
      • Component Down Solution
    • CPU Usage by InfluxDB Component is Above {any}%
      • Possible CPU Usage Problems
      • CPU Usage Solution
    • RAM Usage by InfluxDB Component is Above {any}%
      • Possible RAM Usage Problems
      • RAM Usage Solution
    • Smoke Test Failed for InfluxDB Cluster
      • Possible Smoke Test Failed Problems
      • Smoke Test Failed Solution

InfluxDB Problems


The following sections describe the problems that can occur with the InfluxDB Cluster.

InfluxDB Installation Problems


This section describes the problems that occur during the deployment of the InfluxDB Cluster and their solutions.

PVC in Pending Status


During the deployment, a situation can occur when the Persistence Volume Clain (PVC) is created but cannot mount the Persistence Volume (PV).

There are three common causes:

  • Incorrect PV name specified in the deployment parameters.
  • Incorrect storage-class that did not allow to patch an already created one PV in PVC.
  • PV not in the Available status. For example, it could be in the Retain status.

Check PV Name or Mask


For deploying the influxdb cluster, specify the following two parameters related with PV:

  • INFLUXDB_PV_MASK (earlier name PV_MASK, but already deprecated) specifies the PV name mask that is used for creating PV claims.
  • BACKUP_STORAGE_PV specifies the Persistence Volume name that is used to store backups.

Usually, problems are related with PV_MASK. There are many tickets where people specify deprecated variable of the latest version and it is not supported. For solving this issue, you need to use the INFLUXDB_PV_MASK deploy parameter.

Another issue is related with incorrect INFLUXDB_PV_MASK value or when you do not specify it for your environment. By default, this parameter has the following value:

PV_MASK = pv-influx-data-

But for specific environment, the PV name may differ.

Check Storage Class


For deploying InfluxDB pods with provisioning PV, the INFLUXDB_PV_CLASS parameter can be used. By default, it is not set. If it is specified while creating a PVC, then specify storage-class into the INFLUXDB_PV_CLASS parameter.

If you want to update an already installed environment without cleaning the project, you also need to specify this parameter. Otherwise, installation scripts try to patch PVC without the storage-class and it may lead to the following error:

Error from server (Invalid): PersistentVolumeClaim "pv-influxdb-data-1" is invalid:
 spec: Forbidden: field is immutable after creation

Check PV Status


During the deployment of the InfluxDB Cluster, the deploy scripts check the PV status and try to path the PV to the set Available status. This step is skipped if the deploy job is run by a user who has no rights to work with PVs.

In this case, the PV already exists, but has a Retain status. This can occur when the PV was bound to the PVC, but the PVC was removed. The pod which should use this PV cannot start because it is unable to bind the PV.

The PV status can be displayed using the following command:

oc get pv

For example:
$ oc get pv
NAME                 CAPACITY   ACCESSMODES   RECLAIMPOLICY   STATUS     CLAIM                                 REASON    AGE
pv-influxdb-backup   10Gi       RWO           Retain          Bound      influxdb-cluster/influx-backup-pvc              241d
pv-influxdb-data-1   10Gi       RWO           Retain          Bound      influxdb-cluster/pv-influxdb-data-1             241d
pv-influxdb-data-2   10Gi       RWO           Retain          Bound      influxdb-cluster/pv-influxdb-data-2             241d


 

To change the PV status manually:

  1. Use a profile that has permissions to work with the PV.
  2. Execute the following command:
    oc patch pv "<pv_name>" --type json -p '[{"op": "replace", "path": "/spec/claimRef", "value": null},{"op": "replace", "path": "/status/phase", "value": "Available"}]
    Where, <pv_name> is the name of the PV that is to be patched.

Pod in Pending Status


During deployment, a situation could occur where deployment is stuck and OpenShift cannot create the pod. There are many reasons for such issues. For example:

  • Deployment has incorrect selectors for the pod and OpenShift scheduler cannot select node for deploy
  • Deployment has incorrect image
  • PVC stuck in pending status

Check Pod Selectors and Affinity Rules


Since pod may get stuck in the pending status for different reasons, there are ways to solve similar issues. View the list of pods using the following command:

$ oc get pods
NAME                             READY     STATUS      RESTARTS   AGE
influxdb-1-1-deploy              0/1       Error       0          3d
influxdb-2-1-deploy              0/1       Error       0          3d
influxdb-backup-daemon-2-16jc7   1/1       Running     0          1d
influxdb-relay-3-7jh47           1/1       Running     0          3d
influxdb-relay-3-cw4cc           1/1       Running     0          3d
influxdb-router-2-32gtt          1/1       Running     0          3d
influxdb-router-2-zvmbz          1/1       Running     0          3d
influxdb-service-1-qdcfs         1/1       Running     0          3d
influxdb-tests-shlnn             0/1       Completed   0          3d

If some pods are not ready or do not have the Running status, you can check the events using the following command:

$ oc get events
oc get events
LASTSEEN   FIRSTSEEN   COUNT     NAME                  KIND                    SUBOBJECT                     TYPE      REASON              SOURCE                           MESSAGE
22s        22s         1         influxdb-1-3-deploy   Pod                                                   Normal    Scheduled           {default-scheduler }             Successfully assigned influxdb-1-3-deploy to node-1-1
19s        19s         1         influxdb-1-3-deploy   Pod                     spec.containers{deployment}   Normal    Pulled              {kubelet node-1-1}     Container image "openshift/origin-deployer:v1.5.1" already present on machine
19s        19s         1         influxdb-1-3-deploy   Pod                     spec.containers{deployment}   Normal    Created             {kubelet node-1-1}     Created container with docker id b12b43040734; Security:[seccomp=unconfined]
19s        19s         1         influxdb-1-3-deploy   Pod                     spec.containers{deployment}   Normal    Started             {kubelet node-1-1}     Started container with docker id b12b43040734
18s        18s         1         influxdb-1-3-hxj5h    Pod                                                   Normal    Scheduled           {default-scheduler }             Successfully assigned influxdb-1-3-hxj5h to node-1-1
15s        15s         1         influxdb-1-3-hxj5h    Pod                     spec.containers{influxdb-1}   Normal    Pulling             {kubelet node-1-1}     pulling image "influxdb:1.7.9"
15s        15s         1         influxdb-1-3-hxj5h    Pod                     spec.containers{influxdb-1}   Normal    Pulled              {kubelet node-1-1}     Successfully pulled image "influxdb:1.7.9"
15s        15s         1         influxdb-1-3-hxj5h    Pod                     spec.containers{influxdb-1}   Normal    Created             {kubelet node-1-1}     Created container with docker id 720fe51ee505; Security:[seccomp=unconfined]
15s        15s         1         influxdb-1-3-hxj5h    Pod                     spec.containers{influxdb-1}   Normal    Started             {kubelet node-1-1}     Started container with docker id 720fe51ee505
7s         7s          1         influxdb-1-3-hxj5h    Pod                     spec.containers{influxdb-1}   Warning   Unhealthy           {kubelet node-1-1}     Readiness probe failed: Get http://1.2.3.4:8086/ping: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
18s        18s         1         influxdb-1-3          ReplicationController                                 Normal    SuccessfulCreate    {replication-controller }        Created pod: influxdb-1-3-hxj5h
22s        22s         1         influxdb-1            DeploymentConfig                                      Normal    DeploymentCreated   {deploymentconfig-controller }   Created new replication controller "influxdb-1-3" for version 3
.....
...

Verify that this list of events does not contain the following events:

failed to fit in any node fit failure summary on nodes: CheckServiceAffinity (9), MatchNodeSelector (9), InsufficientMemory (2)

or

"Error scheduling default influxdb-1-deploy: pod (influxdb-1-deploy) failed to fit in any node

If you see similar messages, you need to do the following:

  • Check that OpenShift has the capacity to deploy pods with specified resources.
  • Check that the pod has affinity or anti-affinity rules which allow running new pods in this environment.
    To get pods config and see affinity or anti-affinity rules, execute the following command:
    $ oc describe pod <pod_name>
    Name:                   influxdb-relay-3-7jh47
    Namespace:              influxdb-cluster
    Security Policy:        restricted
    ...
  • Check that the pod's nodeSelector has correct labels and specified labels are presented on the OpenShift nodes
    To get nodeSelector for the pod, execute the following command:
    $ oc get pod <pod_name> -o jsonpath="{.spec.nodeSelector}{.name}"
    map[region:primary]
    To get node labels, execute the following command:
    $ oc get nodes --show-labels
    NAME                 STATUS    AGE       LABELS
    infra-1    Ready     1y        beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=infra-1,node=infra-1,region=infra,role=infra,site=left,type=compute,zone=default
    node-1     Ready     1y        beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=node-1,node=node-1,region=primary,role=compute,site=right,type=compute,zone=default
    ...

Two Relay Pods with Different Deployments


During or after the installation, you can see the deployment for influxdb-relay with the following types:

  • ReplicationSet
  • DeploymentConfig

It means that this environment was upgraded from version 3.x to version 4.x without clean upgrade.

Redeploy InfluxDB Cluster with Clean Mode


Before installing, you can select options for clean install. The clean mode removes the project and all entities in it, but keeps Persistence Volumes. It means that all the data is kept.

Remove Obsolete Deployment


If the project already has two pods with different deployments, you can manually remove the obsolete deployment.

To remove the obsolete deployment:

  1. Find ReplicationSet for influxdb-relay.
  2. Scale down to replicas = 0.
  3. Remove ReplicationSet.

For example:

$ oc get rc
NAME                       DESIRED   CURRENT   READY     AGE
influxdb-relay             0         0         0         3d

$ oc scale --replicas=0  rc influxdb-relay

$ oc delete rc influxdb-relay

InfluxDB Pod is Down


This section describes problems which can occur during InfluxDB Cluster work.

Pod is Down


There are many reasons why a pod is down. But usually it can be classified by a previous status:

  • Pod failed during deploy/update and OpenShift stop or remove pod. For more details, see InfluxDB Installation Problems
  • Pod failed after some continuous restarts. For more details, see Pod Restarting
  • Pod failed because OpenShift node failed

Pod Restarting


InfluxDB is designed to store many metrics with a high write rate. InfluxDB uses a small amount of memory when InfluxDB has only write traffic (even with a high rate).

InfluxDB uses an index to quickly access metrics. Moderate and complex queries can cause high memory usage, because usually an in-memory index is used.

So it means that if you deploy InfluxDB with resources less than expected load or try to store more data than specified in the load profile, you can encounter an Out Of Memory error during restart.

Out Of Memory During Pod Start


During InfluxDB Cluster update, a problem can occur when InfluxDB pod continuously restarts and errors can be as follows:

Container influxdb-1

State: Waiting (CrashLoopBackOff)
Last State Terminated at Apr 27, 2020 10:27:51 PM with exit code 137 (OOMKiller)
Ready: false
Restart Count: 7

or

Container influxdb-1

State: Waiting (CrashLoopBackOff)
Last State Terminated at Apr 27, 2020 10:27:51 PM with exit code 2 (Error)
Ready: false
Restart Count: 7

The logs may not contain any errors and just end on regular message about open shard. For example:

ts=2020-04-27T20:43:46.739052Z lvl=info msg="InfluxDB starting" log_id=0MRLxEhl000 version=1.7.9 branch=1.7 commit=23bc63d43a8dc05f53afa46e3526ebb5578f3d88
ts=2020-04-27T20:43:46.811192Z lvl=info msg="Go runtime" log_id=0MRLxEhl000 version=go1.12.6 maxprocs=4
ts=2020-04-27T20:43:46.917793Z lvl=info msg="Using data dir" log_id=0MRLxEhl000 service=store path=/var/lib/influxdb/data
ts=2020-04-27T20:43:46.918880Z lvl=info msg="Compaction settings" log_id=0MRLxEhl000 service=store max_concurrent_compactions=2 throughput_bytes_per_second=50331648 throughput_bytes_per_second_burst=50331648
ts=2020-04-27T20:43:46.918928Z lvl=info msg="Open store (start)" log_id=0MRLxEhl000 service=store trace_id=0MRLxFPW000 op_name=tsdb_open op_event=start
ts=2020-04-27T20:43:47.617596Z lvl=info msg="Opened file" log_id=0MRLxEhl000 engine=tsm1 service=filestore path=/var/lib/influxdb/data/testdb/default/692313/000000002-000000002.tsm id=0 duration=95.079ms
...
ts=2020-04-27T20:43:49.812364Z lvl=info msg="Opened shard" log_id=0MRLxEhl000 service=store trace_id=0MRLxFPW000 op_name=tsdb_open index_version=inmem path=/var/lib/influxdb/data/testdb/default/721947 duration=1098.246ms

Usually it means that InfluxDB PV stores a lot of data and InfluxDB fails with Out Of Memory during reading these data to create in-memory index.

There are two ways to solve this issue:

  • Increase RAM size and memory size must be selected based on Load Profile from Hardware Sizing Guide
  • Navigate to PV (you can navigate to PV on OpenShift node or use a debug pod), find biggest tsm files and remove them
    For example (Please pay attention that it will remove some historical data!):
    $ pwd
    /var/lib/influxdb
    
    $ ls -la
    total 0
    drwxr-xr-x. 6 root root 76 апр 23 12:10 data
    drwxr-xr-x. 2 root root 21 апр 23 12:10 meta
    drwx------. 6 root root 76 апр 23 12:10 wal
    
    $ du -sh data/*
    23M     data/monitoring_server
    180K    data/prometheus
    12G     data/testdb
    
    $ du -sh data/testdb/default/*
    1,6M    data/testdb/default/664926
    2,4M    data/testdb/default/672950
    2,4M    data/testdb/default/684224
    11,9G   data/testdb/default/692313
    3,0M    data/testdb/default/705657
    2,8M    data/testdb/default/721947
    1,4M    data/testdb/default/738278
    
    $ rm -rf data/testdb/default/692313

Out Of Memory Some Time After Start


This situation is similar to when InfluxDB restart during start. It has the similar symptoms like OOM errors with continuous restarts. But unlike the previous situation, logs contains information about compaction process. For example:

ts=2020-04-27T21:02:42.932868Z lvl=info msg="InfluxDB starting" log_id=0MRN1_y0000 version=1.7.4 branch=1.7 commit=ef77e72f435b71b1ad6da7d6a6a4c4a262439379
ts=2020-04-27T21:02:42.932910Z lvl=info msg="Go runtime" log_id=0MRN1_y0000 version=go1.11 maxprocs=4
ts=2020-04-27T21:02:43.047179Z lvl=info msg="Using data dir" log_id=0MRN1_y0000 service=store path=/var/lib/influxdb/data
ts=2020-04-27T21:02:43.053729Z lvl=info msg="Compaction settings" log_id=0MRN1_y0000 service=store max_concurrent_compactions=2 throughput_bytes_per_second=50331648 throughput_bytes_per_second_burst=50331648
ts=2020-04-27T21:02:43.053776Z lvl=info msg="Open store (start)" log_id=0MRN1_y0000 service=store trace_id=0MRN1aRG000 op_name=tsdb_open op_event=start
...
ts=2020-04-27T21:02:54.218233Z lvl=info msg="Starting monitor service" log_id=0MRN1_y0000 service=monitor
ts=2020-04-27T21:02:54.218238Z lvl=info msg="Registered diagnostics client" log_id=0MRN1_y0000 service=monitor name=build
...
ts=2020-04-27T21:02:54.218368Z lvl=info msg="Storing statistics" log_id=0MRN1_y0000 service=monitor db_instance=monitoring_server db_rp=monitor interval=10s
ts=2020-04-27T21:02:54.218569Z lvl=info msg="Listening on HTTP" log_id=0MRN1_y0000 service=httpd addr=[::]:8086 https=false
ts=2020-04-27T21:02:54.218583Z lvl=info msg="Starting retention policy enforcement service" log_id=0MRN1_y0000 service=retention check_interval=30m
ts=2020-04-27T21:02:54.218932Z lvl=info msg="Started listening on UDP" log_id=0MRN1_y0000 service=udp addr=:8087
ts=2020-04-27T21:02:54.219093Z lvl=info msg="Listening for signals" log_id=0MRN1_y0000
[httpd] 10.130.4.1 - - [27/Apr/2020:21:02:59 +0000] "GET /ping HTTP/1.1" 204 0 "-" "Go-http-client/1.1" 7990f47e-88ca-11ea-8001-dad7becf4de2 105576
...
ts=2020-04-27T21:02:54.219093Z lvl=info msg="Cache snapshot (start)" log_id=09AmobhW000 engine=tsm1 trace_id=09BprT6G000 op_name=tsm1_cache_snapshot op_event=start
ts=2020-04-27T21:02:54.219093Z lvl=info msg="Snapshot for path written" log_id=09AmobhW000 engine=tsm1 trace_id=09BprT6G000 op_name=tsm1_cache_snapshot path=/var/lib/influxdb/data/telegraf/default/1071 duration=313.801ms
ts=2020-04-27T21:02:54.219093Z lvl=info msg="Cache snapshot (end)" log_id=09AmobhW000 engine=tsm1 trace_id=09BprT6G000 op_name=tsm1_cache_snapshot op_event=end op_elapsed=313.872ms
ts=2020-04-27T21:02:54.219093Z lvl=info msg="TSM compaction (start)" log_id=09AmobhW000 engine=tsm1 tsm1_level=1 tsm1_strategy=level trace_id=09BprX0G000 op_name=tsm1_compact_group op_event=start
ts=2020-04-27T21:02:54.219093Z lvl=info msg="Beginning compaction" log_id=09AmobhW000 engine=tsm1 tsm1_level=1 tsm1_strategy=level trace_id=09BprX0G000 op_name=tsm1_compact_group tsm1_files=8
ts=2020-04-27T21:02:54.219093Z lvl=info msg="Compacting file" log_id=09AmobhW000 engine=tsm1 tsm1_level=1 tsm1_strategy=level trace_id=09BprX0G000 op_name=tsm1_compact_group tsm1_index=0 tsm1_file=/var/lib/influxdb/data/telegraf/default/1071/000000409-000000001.tsm
...
ts=2020-04-27T21:02:54.219093Z lvl=info msg="Compacted file" log_id=09AmobhW000 engine=tsm1 tsm1_level=1 tsm1_strategy=level trace_id=09BprX0G000 op_name=tsm1_compact_group tsm1_index=0 tsm1_file=/var/lib/influxdb/data/telegraf/default/1071/000000416-000000002.tsm.tmp
ts=2020-04-27T21:02:54.219093Z lvl=info msg="Finished compacting files" log_id=09AmobhW000 engine=tsm1 tsm1_level=1 tsm1_strategy=level trace_id=09BprX0G000 op_name=tsm1_compact_group groups=8 files=1 duration=1496.442ms
ts=2020-04-27T21:02:54.219093Z lvl=info msg="TSM compaction (end)" log_id=09AmobhW000 engine=tsm1 tsm1_level=1 tsm1_strategy=level trace_id=09BprX0G000 op_name=tsm1_compact_group op_event=end op_elapsed=1496.456ms
ts=2020-04-27T21:02:54.219093Z lvl=info msg="TSM compaction (start)" log_id=09AmobhW000 engine=tsm1 tsm1_level=2 tsm1_strategy=level trace_id=09BprdpG000 op_name=tsm1_compact_group op_event=start
ts=2020-04-27T21:02:54.219093Z lvl=info msg="Beginning compaction" log_id=09AmobhW000 engine=tsm1 tsm1_level=2 tsm1_strategy=level trace_id=09BprdpG000 op_name=tsm1_compact_group tsm1_files=4
ts=2020-04-27T21:02:54.219093Z lvl=info msg="Compacting file" log_id=09AmobhW000 engine=tsm1 tsm1_level=2 tsm1_strategy=level trace_id=09BprdpG000 op_name=tsm1_compact_group tsm1_index=0 tsm1_file=/var/lib/influxdb/data/telegraf/default/1071/000000392-000000002.tsm
ts=2020-04-27T21:02:54.219093Z lvl=info msg="Compacting file" log_id=09AmobhW000 engine=tsm1 tsm1_level=2 tsm1_strategy=level trace_id=09BprdpG000 op_name=tsm1_compact_group tsm1_index=1 tsm1_file=/var/lib/influxdb/data/telegraf/default/1071/000000400-000000002.tsm
ts=2020-04-27T21:02:54.219093Z lvl=info msg="Compacting file" log_id=09AmobhW000 engine=tsm1 tsm1_level=2 tsm1_strategy=level trace_id=09BprdpG000 op_name=tsm1_compact_group tsm1_index=2 tsm1_file=/var/lib/influxdb/data/telegraf/default/1071/000000408-000000002.tsm
ts=2020-04-27T21:02:54.219093Z lvl=info msg="Compacting file" log_id=09AmobhW000 engine=tsm1 tsm1_level=2 tsm1_strategy=level trace_id=09BprdpG000 op_name=tsm1_compact_group tsm1_index=3 tsm1_file=/var/lib/influxdb/data/telegraf/default/1071/000000416-000000002.tsm
ts=2020-04-27T21:02:54.219093Z lvl=info msg="Compacted file" log_id=09AmobhW000 engine=tsm1 tsm1_level=2 tsm1_strategy=level trace_id=09BprdpG000 op_name=tsm1_compact_group tsm1_index=0 tsm1_file=/var/lib/influxdb/data/telegraf/default/1071/000000416-000000003.tsm.tmp
ts=2020-04-27T21:02:54.219093Z lvl=info msg="Finished compacting files" log_id=09AmobhW000 engine=tsm1 tsm1_level=2 tsm1_strategy=level trace_id=09BprdpG000 op_name=tsm1_compact_group groups=4 files=1 duration=4792.512ms
ts=2020-04-27T21:02:54.219093Z lvl=info msg="TSM compaction (end)" log_id=09AmobhW000 engine=tsm1 tsm1_level=2 tsm1_strategy=level trace_id=09BprdpG000 op_name=tsm1_compact_group op_event=end op_elapsed=4792.526ms
ts=2020-04-27T21:02:54.219093Z lvl=info msg="InfluxDB starting" log_id=09BpsqS0000 version=1.7.4 branch=1.5 commit=ef77e72f435b71b1ad6da7d6a6a4c4a262439379
ts=2020-04-27T21:02:54.219093Z lvl=info msg="Go runtime" log_id=09BpsqS0000 version=go1.11 maxprocs=4

InfluxDB successfully started, loaded all data into index and has run the compaction process. The Compactor is responsible for converting less optimized Cache and TSM data into more read-optimized formats. It does this by compressing series, removing deleted data, optimizing indices and combining smaller files into larger ones.

During compaction process InfluxDB can use a lot of memory if it processes a measurement with a lot of values. In this case, try to allocate an additional memory which leads to Out Of Memory error.

To solve this issue, use the following options:

  • Temporally increase RAM size, which allow complete compaction. After complete compaction process, you can reduce RAM size to old value.
  • Find the biggest files and remove them. For more information, see Out Of Memory during pod start

Out Of Memory During Executing a Query


The major component that affects RAM is the series cardinality. The RAM usage has an exponential relationship to series cardinality (series count in database). A large series count leads to an increase in the depth of the index and may lead to increased RAM usage.

For example, a complex query is as follows:

SELECT MIN(value),MAX(value),SUM(value),MEAN(value)
FROM /.*/
WHERE "uid" =~ /|14df3879-8198-4840-9bf1-5e3f942ff845/
  AND "indicatorName" !~ /./
  AND time > 1570665518s AND time < 1570665818s
GROUP BY *, time(300s)

which:

  • Uses 4 aggregated functions.
  • Selects from all measurements in the current database.
  • Groups by all available tags in the current database.

This complex query when executed on a database with 100,000` series, can do the following:

  • Execute in ~3 seconds.
  • Use ~200-500 Mb RAM.

But if this complex query tries to execute on a database with 600,000 series, it may:

  • Execute in more than ~30 seconds.
  • Use > ~2 Gb RAM.

So in this case, you need to do the following:

  • Increase pod resources
  • Reduce a metrics cardinality, for example you can review your metrics schema and redesign it to reduce a count of unique tags
  • Reduce a storage period (which also reduces index deep and metrics cardinality)

For reduced storage period, you need to change Retention Policy (RP) which automatically remove data with timestamp older than that specified in RP.

To change RP into InfluxDB Cluster, you can use two options.

Option 1: Make changes in ConfigMap influxdb-service-config.

Your databases configuration in influxdb-service-config looks like this:

databases:
- database:
  name: test
  retention_policies:
  - retention_policy:
    name: test1
    duration: 20d
    shard: 1d
    default: yes
  - retention_policy:
    name: test2
    duration: 20d
    shard: 1d
    default: yes

Where, the setup storage period needs to be changed in duration.

For additional information about parameters, refer to the following sections in the InfluxDB Cluster Installation Procedure:

  • ADDITIONAL_DATABASES
  • ADDITIONAL_RETENTION_POLICY

Option 2: Execute a query to change RP on each InfluxDB node.

InfluxQL query to change RP:

ALTER RETENTION POLICY "default" ON "<database>" DURATION 20d SHARD DURATION 1d [DEFAULT]

Where, the setup storage period needs to be changed in DURATION XXd.

Note: This query must be executed on each InfluxDB pod in the project.

Read from InfluxDB Cluster Not Working


For providing a single entry point in order to not load a common OpenShift router in InfluxDB Cluster, use HAProxy which deploys into pod with the name influxdb-router.

Configuration of HAProxy endpoints stored into ConfigMap influxdb-router-config and control by service pod influxdb-service. But in some cases when InfluxDB pods are unstable and reboot some endpoints into config file can be absent.

It can lead to two possible issues:

  • Data always reads from only one InfluxDB pod (in case when in InfluxDB Cluster deployed 2 InfluxDB pod)
  • Any requests to /query return "503 Unavailable" response

Returns "503 Unavailable" Http Status Code for /query


When Prometheus or another tool makes request to /query endpoint and HAProxy has no servers for proxy request to them, it will return "503 Unavailable" response.

Before sending the request, you need to get an influxdb-router service and use IP or Hostname to send requests using the following command:

$ oc get services
NAME                     CLUSTER-IP       EXTERNAL-IP   PORT(S)                               AGE
...
influxdb-router          172.30.127.232   <none>        8086/TCP,8087/TCP,8088/TCP,1936/TCP   3d
...

In OpenShift, hostname can be built by the following schema:

<service_name>.<project_name>.svc

For example, influxdb-router.influxdb-cluster.svc.

To check it manually, use one of pods with cURL tools into pod, for example influxdb-backup-daemon.

For example, a normal response:

$ curl -XGET http://influxdb-router.influxdb-cluster.svc:8086/query?q=show+databases
{"results":[{"statement_id":0,"series":[{"name":"databases","columns":["name"],"values":[["prometheus"],["monitoring_server"],["test"],["test1"],["test2"],["test4"],["test10"],["test11"],["test12"]]}]}]}

But if HAPRoxy has no server for processing endpoint request, it can return as follows:

$ curl -v -XGET http://influxdb-router.influxdb-cluster.svc:8086/query?q=show+databases
* About to connect() to influxdb-router.influxdb-cluster.svc port 8086 (#0)
*   Trying 172.30.127.232...
* Connected to influxdb-router.influxdb-cluster.svc (172.30.127.232) port 8086 (#0)
> GET /query?q=show+databases HTTP/1.1
> User-Agent: curl/7.29.0
> Host: influxdb-router.influxdb-cluster.svc:8086
> Accept: */*
>
* HTTP 1.0, assume close after body
< HTTP/1.0 503 Service Unavailable
< Cache-Control: no-cache
< Connection: close
< Content-Type: text/html
<
<html><body><h1>503 Service Unavailable</h1>
No server is available to handle this request.
</body></html>
* Closing connection 0

In this case, you need to check influxdb-router-config ConfigMap using the following command:

$ oc get cm influxdb-router-config -o yaml
apiVersion: v1
data:
  haproxy.cfg: ....

Check that section has servers or not using the following command:

backend influxdb_query
    mode http
    balance roundrobin
    cookie SERVERID insert indirect nocache
    server influxdb-2 172.30.34.194:8086 check cookie influxdb-2 weight 10
    server influxdb-1 172.30.119.114:8086 check cookie influxdb-1 weight 10

This issue can be solved in the following ways:

  • Restart the influxdb-service pod, which re-initializes during the restart, and include all running InfluxDB pods.
  • Manually include the lost InfluxDB pod.

For including the lost pod manually:

  • Navigate to terminal of influxdb-service pod
    oc rsh <pod_name> bash
  • Call a daemon influxdb-s command:
    influxdb-s include <service_name>
    or
    influxdb-s start-read <service_name>
    Where, <service_name> is the service name associated with lost InfluxDB pod.

For additional information about all available manual actions, refer to Manual Actions With InfluxDB Cluster in the Cloud Platform Maintenance Guide.

InfluxDB Pod Absent into List of Haproxy Endpoints


This situation can occur:

  • when one of InfluxDB is unstable and had some restarts
  • when somebody manually deploys a new InfluxDB pod

In this case, influxdb-router-config ConfigMap will have only one server. In latest releases of InfluxDB Cluster, pass command into influxdb-service pod which shows current active nodes:

influxdb-s list

But if this command is not presented, you can see the ConfigMap directly. For example:

backend influxdb_query
    mode http
    balance roundrobin
    cookie SERVERID insert indirect nocache
    server influxdb-1 172.30.119.114:8086 check cookie influxdb-1 weight 10

It is not good situation because for read queries not executing load balancing and high availability.

This issue can be solved in the following ways:

  • Restart the influxdb-service pod, which re-initializes during the restart, and include all running InfluxDB pods.
  • Manually include the lost InfluxDB pod.

For including lost pod manually:

  • Navigate to terminal of influxdb-service pod
    oc rsh <pod_name> bash
  • Call a daemon influxdb-s command:
    influxdb-s include <service_name>
    or
    influxdb-s start-read <service_name>
    Where, <service_name> is the service name associated with the lost InfluxDB pod.

For additional information about all available manual actions, refer to Manual Actions With InfluxDB Cluster in the Cloud Platform Maintenance Guide.

Write to InfluxDB Cluster Not Working


For synchronizing write recording metrics into InfluxDB pods, use influxdb-relay pods. It has endpoint /write and proxy all traffic from this endpoint to each InfluxDB pod.

It leads to three possible issues:

  • Data always writes to only one InfluxDB pod (in case when InfluxDB Cluster deployed 2 InfluxDB pods)
  • Any requests to /write, returns the "Unable to write points" response
  • Any requests to /write, returns the "503 Unavailable" response

Returns "Unable to write points"


This issue can occur if influxdb-relay does not have endpoint to proxy requests or if one of InfluxDB cannot process the write request.

In this case, the influxdb-relay-config ConfigMap will have only one server. In latest releases of InfluxDB Cluster, pass command into influxdb-service pod which shows current active nodes:

influxdb-s list

But if this command is not presented, you can see the ConfigMap directly. For example:

[[http]]
name = "http"
bind-addr = ":9096"

[[http.output]]
name = "influxdb-1"
location = "http://influxdb-1.influxdb-cluster-dev.svc:8086/write"
buffer-size-mb = 100
max-batch-kb = 50
max-delay-interval = "5s"

It is not a good situation because for read queries not executing load balancing and high availability.

This issue can be solved in the following ways:

  • Restart the influxdb-service pod, which re-initializes during the restart, and include all running InfluxDB pods.
  • Manually include the lost InfluxDB pod.

For including lost pod manually:

  • Navigate to terminal of influxdb-service pod
    oc rsh <pod_name> bash
  • Call a daemon influxdb-s command:
    influxdb-s include <service_name>
    or
    influxdb-s start-write <service_name>
    Where, <service_name> is the service name associated with the lost InfluxDB pod.

For additional information about all available manual actions please refer to Manual Actions With InfluxDB Cluster in the Cloud Platform Maintenance Guide.

Returns "503 Unavailable" Http Status Code for /write


This is a rare case. This response can return influxdb-router (HAProxy) use an entrypoint.

The write requests process is as follows:

Write metrics -> influxdb-router -> influxdb-relay -> some influxdb pods

During first deployment, deploy influxdb-router with empty ConfigMap without any servers. After deploying a service pod influxdb-service, it makes pods discovery in the project and fill all configurations.

So in a situation when influxdb-router returns 503 error to the request /write, it means that during deployment influxdb-service failed or deploy job did not get to the second step when deploying influxdb-service pod.

To solve this issue:

  • Open log deployment job
  • Investigate why influxdb-service failed or why job did not get to the second step.

InfluxDB Pod Absent into List of Relay Endpoints


This situation can occur:

  • when one of InfluxDB is unstable and had some restarts
  • when somebody manually deployed new InfluxDB pod

In this case, influxdb-relay-config ConfigMap will have only one server. In latest releases of InfluxDB Cluster, pass command into influxdb-service pod which shows current active nodes:

influxdb-s list

But if this command is not presented, you can see the ConfigMap directly. For example:

[[http]]
name = "http"
bind-addr = ":9096"

[[http.output]]
name = "influxdb-1"
location = "http://influxdb-1.influxdb-cluster-dev.svc:8086/write"
buffer-size-mb = 100
max-batch-kb = 50
max-delay-interval = "5s"

It is not a good situation because all metrics do not write in InfluxDB pods. This will lead to data out of sync between InfluxDB nodes and potentially some data can be lost during failover.

This issue can be solved in the following ways:

  • Restart the influxdb-service pod, which re-initializes during the restart, and include all running InfluxDB pods.
  • Manually include the lost InfluxDB pod.

For including lost pod manually:

  • Navigate to terminal of influxdb-service pod
    oc rsh <pod_name> bash
  • Call a daemon influxdb-s command:
    influxdb-s include <service_name>
    or
    influxdb-s start-write <service_name>
    Where, <service_name> is the service name associated with lost InfluxDB pod.

For additional information about all available manual actions please refer to Manual Actions With InfluxDB Cluster in the Cloud Platform Maintenance Guide.

InfluxDB Pod Does Not Return into Cluster After Failover


This situation can occurs:

  • when one of InfluxDB unstable and had some restarts
  • or when somebody manually deploy new InfluxDB pod

This issue can be solved in the following ways:

  • Restart the influxdb-service pod, which re-initializes during the restart, and include all running InfluxDB pods.
  • Manually include the lost InfluxDB pod.

For include lost pod manually need:

  • Navigate to terminal of influxdb-service pod
    oc rsh <pod_name> bash
  • Call a daemon influxdb-s command:
    influxdb-s include <service_name>
    Where, <service_name> is the service name associated with lost InfluxDB pod.

For additional information about all available manual actions, refer to Manual Actions With InfluxDB Cluster in the Cloud Platform Maintenance Guide.

No Metrics Available


This situation can occur due to several problems:

  • Some links for collect metrics can be absent into telegraf-agent ConfigMap
  • ConfigMap for telegraf-agent has some incorrect configuration and telegraf cannot parse it
  • Some components does not expose metrics

Analysis of some problems and their solutions can be found below.

Telegraf has no Settings for Collect Metrics


Usually this issue is related with any bug or incorrect works of influxdb-service. This service should control ConfigMap for telegraf-agent and add or remove links to InfluxDB pods in cases when these pods include or exclude into cluster.

But in the following cases, service influxdb-cluster cannot add necessary records into ConfigMap:

  • When one of InfluxDB unstable and had some restarts
  • When somebody manually deploy new InfluxDB pod

To solve this issue, execute the following actions:

Check a telegraf-agent logs and verify that it does not contain error messages that config cannot be parsed. You can print logs using the command:

oc logs <pod_name>

Check a telegraf-agent ConfigMap and verify that all InfluxDB pods and other components are present in it. You can print ConfigMap using a command:

oc get configmap telegraf-config -o yaml

If some InfluxDB pods are absent in ConfigMap, you can restore it with using influxdb-service commands:

  • Navigate to terminal of influxdb-service pod
    oc rsh <pod_name> bash
  • Call a daemon influxdb-s command:
    influxdb-s include <service_name>
    Where, <service_name> is the service name associated with lost InfluxDB pod.

For additional information about all available manual actions, refer to Manual Actions With InfluxDB Cluster in the Cloud Platform Maintenance Guide.

Component Does Not Expose Metrics


Telegraf should collect metrics from the following component:

  • InfluxDB Service
  • InfluxDB pods

If all necessary links are specified into ConfigMap, you need to check that components expose the metrics.

To check InfluxDB metrics use one of pods with cURL tools into pod, for example, influxdb-backup-daemon.

For example a normal response:

$ curl -XGET http://influxdb-1.influxdb-cluster.svc:8086/debug/vars
{
"system": {"currentTime":"2020-04-29T11:15:38.436999238Z","started":"2020-04-29T10:46:30.53370145Z","uptime":1747},
"cmdline": ["influxd"],
"memstats": {"Alloc":39204680,
...

To check InfluxDB Service metrics use one of pods with cURL tools into pod, for example influxdb-backup-daemon.

For example a normal response:


$ curl -XGET http://influxdb-service.influxdb-cluster.svc:8080/metrics
[{"name": "influxdb_cluster_health", "cluster": "influxdb-cluster", "component": "influxdb-1", "status": 1, "create_db_status": 1, "create_db_time": 0.01268862932920456, "write_status": 1, "write_time": 0.031825680285692215, "query_status": 1, "query_time": 0.012298475950956345, "delete_db_status": 1, "delete_db_time": 0.023207109421491623}, {"name": "influxdb_cluster_health", "cluster": "influxdb-cluster", "component": "influxdb-2", "status": 1, "create_db_status": 1, "create_db_time": 0.008592180907726288, "write_status": 1, "write_time": 0.02796272560954094, "query_status": 1, "query_time": 0.007285472005605698, "delete_db_status": 1, "delete_db_time": 0.012705259025096893}, {"name": "influxdb_cluster_health", "cluster": "influxdb-cluster", "component": "influxdb-relay", "status": 1}, {"name": "influxdb_cluster_overview", "cluster": "influxdb-cluster", "active_replicas_count": 2, "replicas_count": 2}]

If InfluxDB pod does not respond, you can try to apply the troubleshooting steps described in the below sections:

  • InfluxDB installation problems
  • InfluxDB pod in down

If InfluxDB Service pod does not respond, you can try to just restart it. For example, using a command:

oc scale --replicas=0 dc influxdb-service
oc scale --replicas=1 dc influxdb-service

Or just remove the pod using the following command:

$ oc get pods
NAME                             READY     STATUS      RESTARTS   AGE
...
influxdb-service-35-jksc2        1/1       Running     0          17h

$ oc delete pod influxdb-service-35-jksc2

Corrupted TSM Files


There are situations when the OpenShift node, on which the InfluxDB pod is running, restarts with Kernel Panic, or for some other reason. The InfluxDB process, which stores data on Persistence Volumes (PVs), is affected and corrupts the TSM files. This does not allow to read the files and results in the faliure of the InfluxDB during its start. The following are the symptoms of this scenario:

  • The pod is constantly restarting.
  • The pod does not restart with exit code 137 (OOMKiller).
  • There is no valuable information in the logs.

InfluxDB can crash with logs, as shown in the following example:

2020-03-27T09:42:27.303607Z info InfluxDB starting {"log_id": "0Lnqmpul000", "version": "1.7.4", "branch": "1.7", "commit": "ef77e72f435b71b1ad6da7d6a6a4c4a262439379"}
2020-03-27T09:42:27.303734Z info Go runtime {"log_id": "0Lnqmpul000", "version": "go1.11", "maxprocs": 16}
2020-03-27T09:42:27.406310Z info Using data dir {"log_id": "0Lnqmpul000", "service": "store", "path": "/var/lib/influxdb/data"}
2020-03-27T09:42:27.406599Z info Compaction settings {"log_id": "0Lnqmpul000", "service": "store", "max_concurrent_compactions": 8, "throughput_bytes_per_second": 50331648, "throughput_bytes_per_second_burst": 50331648}
2020-03-27T09:42:27.406685Z info Open store (start) {"log_id": "0Lnqmpul000", "service": "store", "trace_id": "0LnqmqJW000", "op_name": "tsdb_open", "op_event": "start"}


The influx_inspect utility shows a lot of issues related with the corrupted files if it is used to check the TSM files.

For example:

$ influx_inspect verify -dir /var/lib/influxdb/
/var/lib/influxdb/data/monitoring_server/monitor/745240/000000001-000000001.tsm: healthy
/var/lib/influxdb/data/prometheus/default/729781/000000001-000000001.tsm: healthy
...
/var/lib/influxdb/data/prometheus/default/745303/000000005-000000001.tsm: healthy
/var/lib/influxdb/data/prometheus/mano/552221/000000002-000000002.tsm: healthy
...
/var/lib/influxdb/data/prometheus/mano/745256/000000001-000000001.tsm: healthy
Broken Blocks: 0 / 64399, in 0.134176837s
$ influx_inspect verify-seriesfile -dir /var/lib/influxdb/
2020-03-27T10:43:03.809148Z     error   Series file does not exist      {"log_id": "0LnuFn00000", "path": "/var/lib/influxdb/data/_series"}
2020-03-27T10:43:03.809425Z     error   Series file does not exist      {"log_id": "0LnuFn00000", "path": "/var/lib/influxdb/lost+found/_series"}
2020-03-27T10:43:03.809545Z     error   Series file does not exist      {"log_id": "0LnuFn00000", "path": "/var/lib/influxdb/meta/_series"}
2020-03-27T10:43:03.809665Z     error   Series file does not exist      {"log_id": "0LnuFn00000", "path": "/var/lib/influxdb/wal/_series"}
$ influx_inspect verify-seriesfile -dir /var/lib/influxdb/data/
2020-03-27T10:43:32.103647Z     error   Error opening segment   {"log_id": "0LnuHWXG000", "path": "/var/lib/influxdb/data/idb_service_check_db/_series", "partition": "00", "segment": "0000", "error": "invalid series segment"}
...
2020-03-27T10:43:32.105459Z     error   Error opening segment   {"log_id": "0LnuHWXG000", "path": "/var/lib/influxdb/data/idb_service_check_db/_series", "partition": "07", "segment": "0000", "error": "invalid series segment"}
2020-03-27T10:43:32.479805Z     error   Index inconsistency     {"log_id": "0LnuHWXG000", "path": "/var/lib/influxdb/data/monitoring_server/_series", "partition": "06", "id": 7, "got_offset": 5, "expected_offset": 0}
2020-03-27T10:43:32.582526Z     error   Index inconsistency     {"log_id": "0LnuHWXG000", "path": "/var/lib/influxdb/data/prometheus/_series", "partition": "01", "id": 410, "got_offset": 6889, "expected_offset": 0}

Verify TSM Files


To execute troubleshooting actions, OpenShift runs a pod in the debug mode. When the pod runs in the debug mode, the container starts with an overridden entrypoint. Instead of running the influxdb process, the system runs the /bin/bash process.

To run the pod in the debug mode:

  1. Navigate to the pod. If the pod does not exist, run a new deploy and navigate to the created pod.
  2. See the Status section on the Details tab.
  3. Click Debug in Terminal.
  4. In the window that opens, execute the commands into the pod.

To check the TSM files, execute the following commands:

  • To check TSM files
    influx_inspect verify -dir /var/lib/influxdb/
  • To check TSM series files
    influx_inspect verify-seriesfile -dir /var/lib/influxdb/
    influx_inspect verify-seriesfile -dir /var/lib/influxdb/data/

For example:

$ influx_inspect verify -dir /var/lib/influxdb/
/var/lib/influxdb/data/monitoring_server/monitor/745240/000000001-000000001.tsm: healthy
...
Broken Blocks: 0 / 64399, in 0.134176837s
$ influx_inspect verify-seriesfile -dir /var/lib/influxdb/
2020-03-27T10:43:03.809148Z     error   Series file does not exist      {"log_id": "0LnuFn00000", "path": "/var/lib/influxdb/data/_series"}
...
$ influx_inspect verify-seriesfile -dir /var/lib/influxdb/data/
2020-03-27T10:43:32.103647Z     error   Error opening segment   {"log_id": "0LnuHWXG000", "path": "/var/lib/influxdb/data/idb_service_check_db/_series", "partition": "00", "segment": "0000", "error": "invalid series segment"}
...

Fix Corrupted TSM Files


To solve the corrupted TSM files issue, remove the data and restore it from another InfluxDB pod.

To fix the issue:

  1. Run the pod in the debug mode. Refer How to verify TSM files for details.
  2. Remove all files from the PV using the following command:
    rm -rf /var/lib/influxdb/data/*
  3. Restart the pod.
  4. Restart the influxdb-service.
  5. Manually run the restore proces by navigating to the influxdb-service pod and call the following daemon influxdb-s command:
    influxdb-s restore <service_name>
    Where, <service_name> is the service name of the InfluxDB pod on which the data is to be restored.

For additional information about all available manual actions, refer to Manual Actions With InfluxDB Cluster in the Cloud Platform Maintenance Guide.

Alarms


InfluxDB Component is Down



Problem Severity
InfluxDB component is down in {$NAMESPACE} project Disaster

This trigger checks a count of currently running components in InfluxDB Cluster. Metrics contains statuses of following components:

  • InfluxDB pods
  • InfluxDB relay

Alarm can be raised in case when a count of currently active components equals 0.

Possible Component Down Problems


Usually this issue can occur due the following reasons:

  • Project with InfluxDB Cluster was removed
  • Component influxdb-service is down or cannot expose metrics
  • Pod telegraf-agent is down, has incorrect settings or does not collect metrics
  • InfluxDB on monitoring VM is down

Component Down Solution


First, check that all pods of InfluxDB Cluster are live in project. You can check using the following command:

$ oc get pods
NAME                             READY     STATUS      RESTARTS   AGE
influxdb-1-2-1fnj8               1/1       Running     0          12h
influxdb-2-1-q6v6g               1/1       Running     0          13d
influxdb-backup-daemon-2-6qmbs   1/1       Running     0          2d
influxdb-relay-t3j55             1/1       Running     0          12h
influxdb-router-1-8blvg          1/1       Running     0          12h
influxdb-service-35-jksc2        1/1       Running     0          17h
influxdb-tests-9k3c1             0/1       Completed   0          13d
telegraf-5-ts2f3                 1/1       Running     0          17h

If some pods are down, you can check the troubleshooting steps in InfluxDB installation problems or InfluxDB pod in down.

CPU Usage by InfluxDB Component is Above {any}%



Problem Severity
CPU usage by InfluxDB nodes is above > {$CPU_HIGH_THRESHOLD} for {$INTERVAL} in {$NAMESPACE} project High
CPU usage by InfluxDB relay is above > {$CPU_HIGH_THRESHOLD} for {$INTERVAL} in {$NAMESPACE} project High

Trigger use metrics which collect heapster agent from OpenShift about pods and nodes.

Alarms can be raised in cases when one of InfluxDB pods or InfluxDB relay use a lot of CPU resources. Default thresholds:

  • High
    • Raise = >= 90%
    • Clear = < 90%

For example for influxdb-relay specify resources cpu.requests = 100m and cpu.limits = 200m. When pod influxdb-relay starts using more than 180m, an alarm is raised.

Possible CPU Usage Problems


Alarms about problems with the increased CPU usage can be raised in the following cases:

  • Stored data has a high cardinality and requires a lot of CPU usage to process selection
  • Some components execute very huge requests for which InfluxDB should scan a lot of data and use a lot of resources

CPU Usage Solution


To solve issues related with increased CPU usage refer to section Pod restarting.

RAM Usage by InfluxDB Component is Above {any}%



Problem Severity
RAM usage by InfluxDB nodes is above > {$RAM_HIGH_THRESHOLD} for {$INTERVAL} in {$NAMESPACE} project High
RAM usage by InfluxDB relay is above > {$RAM_HIGH_THRESHOLD} for {$INTERVAL} in {$NAMESPACE} project High

Trigger use metrics which collect heapster agent from OpenShift about pods and nodes.

Alarms can be raised in cases when one of InfluxDB pods or InfluxDB relay use a lot of RAM resources. Default thresholds:

  • High
    • Raise = >= 90%
    • Clear = < 90%

For example for influxdb-relay specify resources memory.requests = 100Mi and memory.limits = 200Mi. When pod influxdb-relay starts and uses more than 180Mi, an alarm is raised.

Possible RAM Usage Problems


Alarms about problems with the increased memory usage can be raised in the following cases:

  • A lot of stored data and too small allocated resources for such data amount
  • Stored data has a high cardinality and required a lot of memory to select data
  • Some components execute a very huge requests for which InfluxDB should scan a lot of data and use a lot of resources

RAM Usage Solution


To solve issues related with increased memory usage refer to section Pod restarting.

Smoke Test Failed for InfluxDB Cluster



Problem Severity
Create test DB fail for InfluxDB Cluster in {$NAMESPACE} project Average
Delete test DB fail for InfluxDB cluster in {$NAMESPACE} project Average
Query data in testing DB fail for InfluxDB cluster in {$NAMESPACE} project Average
Write data in testing DB fail for InfluxDB cluster in {$NAMESPACE} project Average

These triggers check results of smoke tests which service pods execute on each InfluxDB pod.

For each smoke test run, execute logic:

  • Create test database with name service_check_db
  • Write some test measurements with some test data
  • Read test data which were write on previous step
  • Drop test database

Alarms can be raised when on of these checks failed and metrics contains result 0.

Possible Smoke Test Failed Problems


Alarms about problems with the smoke test can point at issues with CPU or Memory usage.

So they can be raised in the following cases:

  • A lot of stored data and too small allocated resources for such data amount
  • Stored data has a high cardinality and required a lot of memory to select data
  • Some components execute a very huge requests for which InfluxDB should scan a lot of data and use a lot of resources
  • Problems with network latency
  • Problems with network connections
  • Problems with metrics collections

Smoke Test Failed Solution


To solve issues related with smoke test, refer to sections:

  • Pod restarting
  • There is no metrics

    2 Comments

    1. Influxdb Problems And Troubleshooting ~ System Admin Share >>>>> Download Now

      >>>>> Download Full

      Influxdb Problems And Troubleshooting ~ System Admin Share >>>>> Download LINK

      >>>>> Download Now

      Influxdb Problems And Troubleshooting ~ System Admin Share >>>>> Download Full

      >>>>> Download LINK

      ReplyDelete
    2. I was surfing net and fortunately came across this site and found very interesting stuff here. Its really fun to read. I enjoyed a lot. Thanks for sharing this wonderful information. msmpeng.exe memory

      ReplyDelete
    Previous Post Next Post