Loading...

XML

Word

Printable

Type: Bug
Resolution: Fixed
Priority: High
Fix Version/s: 9.2.0, 9.1.1, 8.19.9
Affects Version/s: 9.0.0, 8.19.3
Component/s: Data Center
Labels:
- mesh

Support reference count:
2
Symptom Severity:
Severity 3 - Minor
UIS:
19
Bug Fix Policy:
View Atlassian Server bug fix policy

Issue Summary

This is reproducible on Data Center: (yes)

Steps to Reproduce

Create a Bitbucket Data Center cluster with two nodes and make sure the sidecar process is running in both instances.
Shut down the Bitbucket application in one of the nodes.
Take the backup of the mesh database and then corrupt it
```
mv mesh.mv.db mesh-backup.mv.db;
>mesh.mv.db
```

Start up the instance and monitor the /status endpoint of the node that has the corrupted mesh database.

Expected Results

Mesh sidecar will not start up and the /status endpoint will return

{"state":"ERROR"}

Actual Results

The /status endpoint keeps showing RUNNING and ERROR status alternatively.

while true; do curl -X GET http://xx.xx.xx.xx:7990/status && echo ""; sleep 1; done
{"state":"RUNNING"}
{"state":"RUNNING"}
{"state":"RUNNING"}
{"state":"RUNNING"}
{"state":"RUNNING"}
{"state":"ERROR"}
{"state":"ERROR"}
{"state":"ERROR"}
{"state":"ERROR"}
{"state":"RUNNING"}
{"state":"RUNNING"}
{"state":"RUNNING"}
{"state":"RUNNING"}
{"state":"RUNNING"}
{"state":"ERROR"}
{"state":"ERROR"}
{"state":"ERROR"}
{"state":"ERROR"}
{"state":"RUNNING"}
{"state":"RUNNING"}
{"state":"RUNNING"}
{"state":"RUNNING"}
{"state":"RUNNING"}
{"state":"ERROR"}
{"state":"ERROR"}
{"state":"ERROR"}
{"state":"ERROR"}
{"state":"ERROR"}

Bitbucket is repeatedly trying to restart the sidecar but is failing (and it will fail until manually fixed). There isn't a limit on how many times it will try to do this, so it will continue trying indefinitely.

If an Admin has configured their load balancer to check the health of the node in 10 seconds, it is possible that it might hit the node at a time after bitbucket has asked the sidecar to start, but before it has failed (i.e. where status reports "RUNNING").

The default time that the app waits before restarting is 5s, and it takes about 4-5s for it to fail, so it takes ~10s to go from RUNNING -> ERROR -> RUNNING. 30 is a multiple of 10, so the LB might only ever see "RUNNING" and thus it may think the node is healthy.

Please see the illustration of the scenario in the screenshot.

Workaround

Currently there is no known workaround for this behavior from the Bitbucket side. Admin can configure the load balancer to check for the health of the nodes in a 5 seconds interval to mitigate it temporarily.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List

image-2024-08-07-14-57-11-220.png
309 kB
07/Aug/2024 9:27 AM

Assignee:: M Jin

Reporter:: Naveen

Votes:: 1 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 07/Aug/2024 9:31 AM

Updated:: 08/Sep/2024 10:00 AM

Resolved:: 06/Sep/2024 2:34 AM

Details

Description

Issue Summary

Steps to Reproduce

Expected Results

Actual Results

Workaround

Attachments

Attachments

Forms

Activity

People

Dates