Requirements for ML systems

Reliability

The system should continue to perform the correct function at the desired level of performance, even in the face of adversity (e.g., faults, human errors).

Correctness might be difficult to determine for ML systems, especially if ground truth labels are not present. As a result, ML systems can fail silently; end users would not recognise that the system has failed.

Scalability

An ML system can grow in multiple ways: complexity, traffic volume, model count, etc. Whichever way the system grows, there should be reasonable ways to deal with the growth.

Maintainability

It is important to structure the workloads and set up the infrastructure so that different contributors (e.g., MLEs, DevOps engineers, SMEs) can work using tools they are comfortable with. Documentation, versioning, and reproducibility should be considered.

Adaptability

The ML system should have some capacity for both discovering aspects for performance improvement and allowing updates without service interruption, in order to adapt to shifting data distributions and business requirements.