AUTOSCALING IN ALCHEMISCALE TECHNICAL DESIGN DOCUMENT

The autoscaling API allows custom, platform specific manager services to execute automatic horizontal resource allocation.

Background

Currently, the number of alchemiscale compute services needs to be manually managed by administrators. This is tedious and error prone, especially when balancing across a heterogeous set of compute platforms.

Objectives

In order to decrease the need for administrators to micromanage compute services, we aim to define a protocol to enable the implementation of a "compute manager" application.

These compute managers will allow compute services to be horizontally scaled based on the demand, as determined by the claimable tasks in the statestore.

Requirements

Design

To keep things general, we will make the assumption that a compute manager won't have direct access to the compute services that it creates. This mostly accomodates for systems that are running on a queuing system, such as slurm. Because of this, created services must be able to cleanly shut themselves down. This can be triggered by either a time constraint, the number of executed tasks, or the number of empty task claims.

Implementation

The implementation requires a definition of a new interface, ComputeManager, which defines the following abstract methods:

  1. register
  2. request_claimable
  3. check_managed_compute_services
  4. create_compute_service

Testing and validation

Since this feature defines the interface through which a specific compute manager interacts with the database and creates compute services, the majority of testing is left to those specific implementations.

Risks

Since this feature interacts directly with the allocation of high performance computing resources, implementation errors and deployment misconfigurations manifest as a very real, potentially ballooning financial cost. Communication of these risks and putting appropriate guardrails in place is paramount.

Conclusion