-
Bug
-
Resolution: Fixed
-
High
-
9.0.0, 8.19.3
-
2
-
Severity 3 - Minor
-
19
-
Issue Summary
This is reproducible on Data Center: (yes)
Steps to Reproduce
- Create a Bitbucket Data Center cluster with two nodes and make sure the sidecar process is running in both instances.
- Shut down the Bitbucket application in one of the nodes.
- Take the backup of the mesh database and then corrupt it
mv mesh.mv.db mesh-backup.mv.db; >mesh.mv.db
- Start up the instance and monitor the /status endpoint of the node that has the corrupted mesh database.
Expected Results
Mesh sidecar will not start up and the /status endpoint will return
{"state":"ERROR"}
Actual Results
The /status endpoint keeps showing RUNNING and ERROR status alternatively.
while true; do curl -X GET http://xx.xx.xx.xx:7990/status && echo ""; sleep 1; done {"state":"RUNNING"} {"state":"RUNNING"} {"state":"RUNNING"} {"state":"RUNNING"} {"state":"RUNNING"} {"state":"ERROR"} {"state":"ERROR"} {"state":"ERROR"} {"state":"ERROR"} {"state":"RUNNING"} {"state":"RUNNING"} {"state":"RUNNING"} {"state":"RUNNING"} {"state":"RUNNING"} {"state":"ERROR"} {"state":"ERROR"} {"state":"ERROR"} {"state":"ERROR"} {"state":"RUNNING"} {"state":"RUNNING"} {"state":"RUNNING"} {"state":"RUNNING"} {"state":"RUNNING"} {"state":"ERROR"} {"state":"ERROR"} {"state":"ERROR"} {"state":"ERROR"} {"state":"ERROR"}
Bitbucket is repeatedly trying to restart the sidecar but is failing (and it will fail until manually fixed). There isn't a limit on how many times it will try to do this, so it will continue trying indefinitely.
If an Admin has configured their load balancer to check the health of the node in 10 seconds, it is possible that it might hit the node at a time after bitbucket has asked the sidecar to start, but before it has failed (i.e. where status reports "RUNNING").
The default time that the app waits before restarting is 5s, and it takes about 4-5s for it to fail, so it takes ~10s to go from RUNNING -> ERROR -> RUNNING. 30 is a multiple of 10, so the LB might only ever see "RUNNING" and thus it may think the node is healthy.
Please see the illustration of the scenario in the screenshot.
Workaround
Currently there is no known workaround for this behavior from the Bitbucket side. Admin can configure the load balancer to check for the health of the nodes in a 5 seconds interval to mitigate it temporarily.