AUTOSCALING IN ALCHEMISCALE TECHNICAL DESIGN DOCUMENT
The autoscaling API allows custom, platform specific manager services to execute automatic horizontal resource allocation.
Background
Currently, the number of alchemiscale compute services needs to be manually managed by administrators. This is tedious and error prone, especially when balancing across a heterogeous set of compute platforms.
Objectives
In order to decrease the need for administrators to micromanage compute services, we aim to define a protocol to enable the implementation of a "compute manager" application.
These compute managers will allow compute services to be horizontally scaled based on the demand, as determined by the claimable tasks in the statestore.
Requirements
- Compute services growth can be made automatic, dictated by a compute manager's settings and statestore contents.
- Compute managers can be signaled to stop creating new services.
- Scaling descisions will be based on throughput estimates and the current number of compute services registered.
- A maximum number of instances will be enforced to prevent resource exhaustion.
Design
- Compute managers can query the
ComputeServiceRegistration
nodes it was responsible for creating. - Compute managers will be given an affirmative sigal that it can grow resources. Otherwise it must wait until the next query interval.
- Queries are executed on a set interval specified in the compute manager settings.
To keep things general, we will make the assumption that a compute manager won't have direct access to the compute services that it creates. This mostly accomodates for systems that are running on a queuing system, such as slurm. Because of this, created services must be able to cleanly shut themselves down. This can be triggered by either a time constraint, the number of executed tasks, or the number of empty task claims.
Implementation
The implementation requires a definition of a new
interface, ComputeManager
, which defines the following abstract methods:
-
register
-
request_claimable
-
check_managed_compute_services
-
create_compute_service
Testing and validation
Since this feature defines the interface through which a specific compute manager interacts with the database and creates compute services, the majority of testing is left to those specific implementations.
Risks
Since this feature interacts directly with the allocation of high performance computing resources, implementation errors and deployment misconfigurations manifest as a very real, potentially ballooning financial cost. Communication of these risks and putting appropriate guardrails in place is paramount.