AUTOSCALING IN ALCHEMISCALE TECHNICAL DESIGN DOCUMENT

The autoscaling API allows custom, platform specific manager services to execute automatic horizontal resource allocation.

Background

Currently, the number of alchemiscale compute services needs to be manually managed by administrators. This is tedious and error prone, especially when balancing across a heterogeneous set of compute platforms.

Objectives

In order to decrease the need for administrators to micromanage compute services, we aim to define a protocol to enable the implementation of a "compute manager" application.

These compute managers will allow compute services to be horizontally scaled based on the demand, as determined by the claimable tasks in the statestore.

Requirements

Design

A compute manager is responsible for communicating with the Alchemiscale compute API to determine whether or not to allocate more resources in response to the number of claimable tasks. The compute API will issue a signal (and an optional data payload) with how the manager should proceed:

OK, int, list[str]
The compute manager is allowed to scale up if its settings permit it to. The number of claimable tasks and a list of compute service IDs are included along with the signal. This is the default signal.
SKIP, None
The compute manager should skip this cycle and try again later. This signal is issued when the alchemiscale server has determined that the compute manager should not be allowed to increase the number of compute services. For example, if many of the managed compute services are consistently failing, the compute manager should not be allowed to continue creating new instances.
SHUTDOWN, string
The compute manager should shut down. A string is provided to indicate why a shutdown was requested and should ideally be included in the compute manager log before shutting down. A compute manager will get this signal if the compute manager registration has been invalidated, either by direct deletion or superseded by another compute manager with the same name.

In order to scale up, compute services, which claim tasks and execute their associate protocols, are created programatically. In addition to the number of tasks, a compute manager can also query the statestore about the compute services that it was responsible for creating. The compute services are created by a compute manager from a predefined computer service settings template file.

We will make the assumption that a compute manager won't have direct access to the compute services that it creates. This mostly accommodates for systems that are running on a queuing system, such as slurm. Because of this, created services must be able to cleanly shut themselves down. This can be triggered by either a time constraint, the number of executed tasks, or the number of empty task claims, all of which are configurable when creating the compute service. Therefore, while the compute manager has the responsibility of up-scaling, the created services are responsible for down-scaling.

A note on uniqueness and manager identities

It is possible that competing managers will erroneously be duplicated during deployment by an administrator. Requiring a unique ID to deploy a compute manager will allow the alchemiscale server to limit communication to one instance of a manager with a given ID. This unique ID will also be tracked by the compute services that result from the compute manager, allowing managed services to re-associate with a restarted compute manager. Should a manager fail unexpectedly, the alchemiscale server will not allow the registration of a new manager until the previous registration has been flushed.

In order to effectively differentiate compute managers sharing the same manager ID, a uuid (v4) is introduced. This will allow new compute managers to take over the responsibilities of computer managers that may have not been able to cleanly shut down or who are unable to communicate with server due to network issues. Only a compute manager with the correct manager ID and uuid will be given the affirmative signal for compute up-scaling. The two identifiers will be combined into a single identifier matching the pattern {manager_id}-{uuid4}.

Implementation

Introduction of autoscaling will require a new client, an additional model in the statestore, new API endpoints, and an abstract class for implementing new compute managers. Additionally, ComputeServiceRegistration nodes in the statestore will have an additional field referencing the name of a compute manager.

A new client supporting compute managers

While compute managers share a lot of overlap with compute services, the AlchemiscaleComputeClient provides functionality that shouldn't be exposed to the manager, such as task claiming. We will define a new, minimal client that fulfills the requirements for a compute manager:

class ComputeManagerClient(AlchemiscaleBaseClient):

    def register(self) -> ComputeManagerID:
        ...

    def deregister(self) -> ComputeManagerID:
        ...

    def update_status(self) -> ComputeManagerID:
        ...

    def get_instruction(self) -> ComputeManagerInstruction, dict):
        ...

where ComputeManagerID is a subclass of str:

class ComputeManagerID(str):

    @property
    def name(self):
        return self.split('-')[0]

    @property
    def uuid(self):
        return self.split('-')[1]

and the ComputeManagerInstruction is a StrEnum, reflecting the signals specified in the design section:

class ComputeManagerInstruction(StrEnum):
    OK = "OK"
    SKIP = "SKIP"
    SHUTDOWN = "SHUTDOWN"

Statestore representation

A compute manager must be registered with the alchemiscale statestore, similarly to how compute services are registered.

class ComputeManagerRegistration:

    manager_id: str
    uuid: str
    last_status_update: datetime
    status: str
    detail: str
    saturation: float

where the status field is a StrEnum with the following definition:

class ComputeManagerStatus(StrEnum):
    OK = "OK"
    STALLED = "STALLED"
    ERRORED = "ERRORED" 

The meaning of these status codes is as follows:

OK
The compute manager is operating as intended.
STALLED
The compute manager attempted to create more compute services, but was blocked by the compute platform. These indicate that the compute manager is still functioning, but is not running optimally—potentially because of settings conflicting with the compute platform environment.
ERRORED
The compute manager is not functioning correctly and needs administrator intervention on the platform in question. A message should be included with this status update and will be available through the detail field.

The saturation of a compute manager reflects the ratio of currently running compute services with the max allowed compute services defined in the compute manager's settings.

API

computemanager_register
With a given ID, register a compute manager with the alchemiscale statestore. Successful registration returns a compute manager ID string.
computemanager_deregister
With a given ID, deregister the compute manager from the alchemiscale statestore. For improved visibility, if the compute manager had an ERRORED status, it will not be removed from the statestore.
computemanager_get_instruction
Request a compute manager instruction.
computemanager_update_status
Update the status of the compute manager along with platform saturation or error detail. Compute managers with the ERRORED status should shut down.

Compute manager abstract class

The implementation requires a definition of a new interface, ComputeManager, which has the following attributes and defines the following methods:

class ComputeManager:
    compute_manager_id: ComputeManagerID
    status_update_interval: int
    sleep_interval: int
    client: ComputeManagerClient
    service_settings_template: bytes
    manager_settings: ComputeManagerSettings

    def _register(self):
        ...

    def _deregister(self):
	...

    def request_instruction(self):
	...

    @abstractmethod
    def start(self):
        ...

    @abstractmethod
    def stop(self):
        ...

    @abstractmethod
    def cycle(self):
        ...

    @abstractmethod
    def create_compute_service(self):
	...

The register method communicates to the alchemiscale compute API the unique ID given to the compute manager. If a compute manager with that ID already exists in the statestore, None is returned. The previously registered compute manager must be deregistered either by shutting down the process responsible for creating it or by forcing a deregistration through an HTTP request with the unique ID, as implemented in the deregister method. A successful deregistration through deregister will return the unique compute manager ID, otherwise None is returned.

Instructions are requested from the alchemiscale compute API with the request_instruction method. This method returns ComputeManagerInstruction along with a data payload that should be used to inform how to execute the returned instruction.

The protocol for creating compute services must be defined with the create_compute_service method. The implementation of this method is highly dependent on the nature of the computing platform and is not in scope of this document.

It is recommended to create compute service tasks in such a way that shutting down the compute manager does not kill the created compute services.

Settings must be provided through a the ComputeManagerSettings base model.

class ComputeManagerSettings:
    api_url: str
    name: str
    status_update_interval: int
    logfile: Path | None
    max_compute_services: int

These settings are also dependent on the compute platform and only a subset of settings are strictly required.

Changes to ComputeServiceRegistration and the compute service register method

ComputeServiceRegistration nodes will need to keep track of the compute managers responsible for their creation. To support this at the statestore level, we need only add another string field, which can optionally store a compute manager name.

Testing and validation

Since this feature defines the interface through which a specific compute manager interacts with the database and creates compute services, the majority of testing is left to those specific implementations.

Risks

Since this feature interacts directly with the allocation of high performance computing resources, implementation errors and deployment misconfigurations manifest as a very real, potentially ballooning financial cost. Communication of these risks and putting appropriate guardrails in place is paramount.