Troubleshooting
Stuck Machines
Machines can get stuck if Stretchy cannot stop or decommission them. If a machine gets stuck, human intervention might be necessary.
If Stretchy cannot stop a machine itself:
-
Stop the virtual machine or container manually if it is still running.
-
Remove the virtual machine or container manually from the node.
After a couple of minutes, Stretchy should recognize that the machine no longer exists and proceed as normally.
If Stretchy cannot decommission a machine itself:
-
Check if the job source is available. If it is not, wait until it comes back online.
-
Manually remove the machine from the job source, if applicable.
Stretchy should then be able to complete the remaining decommissioning tasks on its own.
| It might Stretchy take some time to realize that something has happened because many operations use exponential back-off. The upper back-off limit is usually less than 15 minutes. Stretchy prints back-off intervals to the log. However, it might be necessary to enable debug logging to see them. |
Machine Waiting for a Node
Machines have to wait for a node to become available if there is no operational node with the required image on it and the necessary free resources (CPU, RAM) to accommodate it. Another possible reason is that a project or organization has reached its concurrency limits. That means that a machine has to wait until another one belonging to the same project or organization has completed.
If too many machines are waiting for too long, there are a number of possible remedies:
-
Unstick machines that are stuck and still assigned to a node.
-
Bring nodes online that are missing.
-
Check for mismatched labels (see below).
-
Increase the concurrency limits of the respective project or organization.
-
Add more nodes or increase the resources of existing nodes.
Mismatched labels can occur if an image has multiple labels attached to it but not all of those labels are present on a node with the image on it. The remedy is to add the missing labels to each node with the image on it.
windows-11 and windows that is installed on a node with the label windows-11 (windows is missing). A request for a machine with the label windows will be correctly resolved and assigned the image Windows 11. However, because there is no node with the label windows, the machine cannot be run.
Missing Nodes
Stretchy Orchestrator only considers a node to be missing if it is enabled but not connected.
In such a case, the best course of action is to look into the log written by stretchy-agent. It usually reports the problem it encountered and what it is trying to remedy the situation. That should provide enough clues on what intervention is necessary, if any.
| Stretchy Agent communicates with Stretchy Orchestrator using WebSocket. A connection is always initiated by the agent. |
In rare cases, Stretchy Agent might not recognize that a WebSocket connection has become unhealthy. That manifests itself by the agent simply sitting idle. That can be fixed by restarting stretchy-agent.