According to aws.amazon.com, Amazon SageMaker AI has introduced a capacity-aware instance pool feature that automatically handles GPU capacity constraints for inference endpoints. The new capability allows users to define a prioritized list of instance types, with SageMaker AI automatically working through the list when capacity is constrained.
Previously, according to aws.amazon.com, building a real-time inference endpoint on SageMaker AI required committing to a single instance type at creation time. When that instance type had insufficient capacity, the endpoint would fail to reach a running state, forcing users to manually update their configuration and retry with different instance types until provisioning succeeded.
The new feature addresses capacity issues at multiple stages of the endpoint lifecycle. According to aws.amazon.com, it handles endpoint creation failures, autoscaling limitations during scale-out events when traffic increases, and scale-down operations. The capability is available for Single Model Endpoints, Inference Component-based endpoints, and Asynchronous Inference endpoints.
According to aws.amazon.com, when a scale-out event triggers and the specified instance type has insufficient capacity, the autoscaler would previously retry the same type indefinitely while traffic continued to increase. The capacity-aware instance pool feature eliminates this issue by automatically provisioning on available infrastructure without manual intervention.