Self Recovery
If the previous server maintenance or downtime occurred for an extended period of time, the apid service can take longer to recover. Users may need to check the ready_timeout parameter for apid in orchestrationd.yml. The parameter unit is in seconds with a default of 5 minutes. Users may want to increase the timeout to 1 hour or more depending on the data volume and situations.Users can check the apid log files if the startup is successful. Issues would more likely arise due to a slow or very contented disk, which may need multiple restart attempts to resume. The following messages indicate symptoms of such conditions:
WARN [Thread-43] [RocksDBCache/hub-keyValueStore-entitySnapshots] - [lumn_family.cc:872] [default] Stalling writes because we have 15 immutable memtables (waiting for flush), max_write_buffer_number is set to 16 rate 16777216
WARN [Thread-44] [RocksDBCache/hub-keyValueStore-entitySnapshots] - [lumn_family.cc:872] [default] Stalling writes because we have 15 immutable memtables (waiting for flush), max_write_buffer_number is set to 16 rate 16777216
WARN [main] [RocksDBCache/hub-keyValueStore-entitySnapshots] - [lumn_family.cc:872] [default] Stalling writes because we have 15 immutable memtables (waiting for flush), max_write_buffer_number is set to 16 rate 16777216
WARN [main] [RocksDBCache/hub-keyValueStore-entitySnapshots] - [lumn_family.cc:836] [default] Stopping writes because we have 16 immutable memtables (waiting for flush), max_write_buffer_number is set to 16
Manual Recovery
If apid is unable to recover itself after the timeout is increased, users may consider to delete and recreate the entity snapshots Kafka topic. This topic only holds the latest value for every metrics, so it will be re-populated over time as new data gets consumed.
Please follow the instructions below:
Step 0: Environment preparation
On all nodes: (please adjust the commands with actual installation paths)
export HUB_HOME=/opt/hub/hub-2.5.1
export KAFKA_HOME=$HUB_HOME/services/kafka-2.12-2.8.1-log4j-patched
Step 1: mask the systemd service and manually start the Hub in run-level 2
On all nodes:
sudo systemctl mask hub-orchestration
sudo systemctl stop hub-orchestration
$HUB_HOME/bin/hub.sh start
$HUB_HOME/bin/hub-admin run-level 2
Step 2
On all nodes:
Wait for apid to be stopped (using $HUB_HOME/bin/hub-admin
) and delete directory /opt/hub/tmp/apid-kafka-kvs/hub-keyValueStore-entitySnapshots
Step 3
On one node only:
Check if file $KAFKA_HOME/../conf/server.properties
contains delete.topic.enable=true
. If not, add it to the end of the file and restart the Hub via $HUB_HOME/bin/hub.sh restart
Step 4
Ensure that the Kafka client can access details about topic hub-keyValueStore-entitySnapshots
$KAFKA_HOME/bin/kafka-topics.sh --command-config $KAFKA_HOME/../conf/client.properties --zookeeper localhost:5181 --topic hub-keyValueStore-entitySnapshots --describe
Step 5
Delete topic hub-keyValueStore-entitySnapshots
$KAFKA_HOME/bin/kafka-topics.sh --command-config $KAFKA_HOME/../conf/client.properties --zookeeper localhost:5181 --topic hub-keyValueStore-entitySnapshots --delete
Step 6
Wait for topic hub-keyValueStore-entitySnapshots
to be gone
$KAFKA_HOME/bin/kafka-topics.sh --command-config $KAFKA_HOME/../conf/client.properties --zookeeper localhost:5181 --topic hub-keyValueStore-entitySnapshots --list
Step 7
Re-create topic hub-keyValueStore-entitySnapshots
$KAFKA_HOME/bin/kafka-topics.sh --command-config $KAFKA_HOME/../conf/client.properties --zookeeper localhost:5181 --topic hub-keyValueStore-entitySnapshots --create --config cleanup.policy=compact --config min.insync.replicas=1 --config segment.bytes=536870912 --partitions 10 --replication-factor 1
Step 8
Wait until topic hub-keyValueStore-entitySnapshots
becomes available
$KAFKA_HOME/bin/kafka-topics.sh --command-config $KAFKA_HOME/../conf/client.properties --zookeeper localhost:5181 --topic hub-keyValueStore-entitySnapshots --describe
Step 9
On all nodes:
Go back to run-level 4.
$HUB_HOME/bin/hub-admin run-level 4
Step 10: final checks
Check that everything is working as expected. You will notice that in the Web Console, it takes some time for some Dataview cells to become available. This is because the data snapshot is being rebuilt from incoming data.
Step 11: unmask the systemd service
On all nodes:
$HUB_HOME/bin/hub.sh stop
sudo systemctl unmask hub-orchestration
sudo systemctl start hub-orchestration
-
Tags:
- Geneos
- Answerbot
- GatewayHub
- exported_docs_10-05-24
Comments
0 comments
Please sign in to leave a comment.