AUTOSCALING IN ALCHEMISCALE TECHNICAL DESIGN DOCUMENT

The autoscaling API allows custom, platform specific manager services to execute automatic horizontal resource allocation.

Background

Currently, the number of alchemiscale compute services needs to be manually managed by administrators. This is tedious and error prone, especially when balancing across a heterogeneous set of compute platforms.

Objectives

In order to decrease the need for administrators to micromanage compute services, we aim to define a protocol to enable the implementation of a "compute manager" application.

These compute managers will allow compute services to be horizontally scaled based on the demand, as determined by the claimable tasks in the statestore.

Requirements

Design

A compute manager is responsible for communicating with the Alchemicale compute API to determine whether or not to allocate more resources in response to the number of claimable tasks. The compute API will issue a signal (and an optional data payload) with how the manager should proceed:

GROW, int, list[str]
The compute manager is allowed to scale up. The number of claimable tasks and a list of compute service IDs are included along with the signal.
SKIP, None
The compute manager should skip this cycle and try again later.
SHUTDOWN, string
The compute manager should shut down. A string is provided to indicate why a shutdown was requested and should ideally be included in the compute manager log before shutting down.

In order to scale up, compute services, which claim tasks and execute their associate protocols, are created programatically. In addition to the number of tasks, a compute manager can also query the statestore about the compute services that it was responsible for creating. The compute services are created by a compute manager from a predefined computer service settings template file.

We will make the assumption that a compute manager won't have direct access to the compute services that it creates. This mostly accommodates for systems that are running on a queuing system, such as slurm. Because of this, created services must be able to cleanly shut themselves down. This can be triggered by either a time constraint, the number of executed tasks, or the number of empty task claims, all of which are configurable when creating the compute service. Therefore, while the compute manager has the responsibility of up-scaling, the created services are responsible for down-scaling.

A note on uniqueness and manager identities

It is possible that competing managers will erroneously be duplicated during deployment by an administrator. Requiring a unique ID to deploy a compute manager will allow the alchemiscale server to limit communication to one instance of a manager with a given ID. This unique ID will also be tracked by the compute services that result from the compute manager, allowing managed services to re-associate with a restarted compute manager. Should a manager fail unexpectedly, the alchemiscale server will not allow the registration of a new manager until the previous registration has been flushed.

In order to effectively differentiate compute managers sharing the same manager ID, a uuid (v4) is introduced. This will allow new compute managers to take over the responsibilities of computer managers that may have not been able to cleanly shut down or who are unable to communicate with server due to network issues. Only a compute manager with the correct manager ID and uuid will be given the affirmative signal for compute upscaling. The two identifiers will be combined into a single identifier matching the pattern {manager_id}-{uuid4}.

Implementation

Introduction of autoscaling will require a new client, an additional model in the statestore, new API endpoints, and an abstract class for implementing new compute managers. Additionally, ComputeServiceRegistration nodes in the statestore will have an additional field referencing the name of a compute manager.

A new client supporting compute managers

While compute managers share a lot of overlap with compute services, the AlchemiscaleComputeClient provides functionality that shouldn't be exposed to the manager, such as task claiming. We will define a new, minimal client that fulfills the requirements for a compute manager:

class ComputeManagerClient(AlchemiscaleBaseClient):

    def register(self) -> ComputeManagerID:
        ...

    def deregister(self) -> ComputeManagerID:
        ...

    def heartbeat(self) -> ComputeManagerID:
        ...

    def get_instruction(self) -> ComputeManagerInstruction, dict):
        ...

where ComputeManagerID is a subclass of str:

class ComputeManagerID(str):

    @property
    def name(self):
        return self.split('-')[0]

    @property
    def uuid(self):
        return self.split('-')[1]

and the ComputeManagerInstruction is a StrEnum, reflecting the signals specified in the design section:

class ComputeManagerInstruction(StrEnum):
    GROW = "GROW"
    SKIP = "SKIP"
    SHUTDOWN = "SHUTDOWN"

Statestore representation

A compute manager must be registered with the alchemicale statestore, similarly to how compute services are registered.

class ComputeManagerRegistration:

    manager_id: str
    uuid: str
    last_heartbeat: datetime 

API

computemanager_register
With a given ID, register a compute manager with the alchemiscale statestore. Successful registration returns a compute manager ID string.
computemanager_deregister
With a given ID, deregister the compute manager from the alchemiscale statestore.
computemanager_get_instruction
Request a compute manager instruction.
computemanager_heartbeat
Update the heartbeat time associated with the compute manager and its supporting data payload.

Compute manager abstract class

The implementation requires a definition of a new interface, ComputeManager, which has the following attributes and defines the following methods:

class ComputeManager:
    compute_manager_id: ComputeManagerID
    heartbeat_interval: int
    sleep_interval: int
    client: ComputeManagerClient
    service_settings_template: bytes
    manager_settings: ComputeManagerSettings

    def register(self):
        ...

    def deregister(self):
	...

    def request_instruction(self):
	...

    @abstractmethod
    def create_compute_service(self):
	...
  1. register
  2. deregister
  3. request_instruction
  4. create_compute_service (abstract)

The register method communicates to the alchemiscale compute API the unique ID given to the compute manager. If a compute manager with that ID already exists in the statestore, None is returned. The previously registered compute manager must be deregistered either by shutting down the process responsible for creating it or by forcing a deregistration through an HTTP request with the unique ID, as implemented in the deregister method. A successful deregistration through deregister will return the unique compute manager ID, otherwise None is returned.

Instructions are requested from the alchemiscale compute API with the request_instruction method. This method returns ComputeManagerInstruction along with a data payload that should be used to inform how to execute the returned instruction.

The protocol for creating compute services must be defined with the create_compute_service method. The implementation of this method is highly dependent on the nature of the computing platform and is not in scope of this document.

It is recommended to create compute service tasks in such a way that shutting down the compute manager does not kill the created compute services.

Settings must be provided through a the ComputeManagerSettings base model.

class ComputeManagerSettings:
    api_url: str
    name: str
    heartbeat_interval: int
    logfile: Path | None

These settings are also dependent on the compute platform and only a subset of settings are strictly required.

Changes to ComputeServiceRegistration and the compute service register method

ComputeServiceRegistration nodes will need to keep track of the compute managers responsible for their creation. To support this at the statestore level, we need only add another string field, which can optionally store a compute manager name.

Testing and validation

Since this feature defines the interface through which a specific compute manager interacts with the database and creates compute services, the majority of testing is left to those specific implementations.

Risks

Since this feature interacts directly with the allocation of high performance computing resources, implementation errors and deployment misconfigurations manifest as a very real, potentially ballooning financial cost. Communication of these risks and putting appropriate guardrails in place is paramount.