
Reduce inference costs by deploying AI models on distributed GPU infrastructure.
Projets IA
Informations
This was a highly experimental project involving exploring and implementing efficient model deployment strategies in novel ways such as using distributed gaming computers/gaming GPUs to minimize computational expenses as this is not done in our industry for enterprise TTS.
Deploying and managing AI models on a distributed gaming GPU infrastructure + other methods for inference presented challenges related to load balancing, resource allocation, and ensuring consistent performance.
Relying on standard cloud instances with GPUs was not cost-effective for our use case. We needed a more customized and optimized infrastructure to handle the unique demands of our AI model deployments.
We analyzed the cost and performance trade-offs of using various cloud-based TTS services and GPU instances to identify limitations and potential cost concerns.
To reduce tokenization and inference costs, we theorized that building and managing our own distributed GPU infrastructure would provide greater cost control and optimization potential compared to relying on cloud providers.
- Design: We used an off the shelf solution for distributed GPU cluster using various gaming NVIDIA GPUs distributed across a network of gaming PCs
- Technical Work: We built and configured the infrastructure, set up workload distribution mechanisms, and implemented monitoring systems to track performance and resource utilization. We deployed our custom TTS models on the cluster for inference.
- Observations: Initial tests showed significant reductions in tokenization and inference costs compared to cloud-based solutions. However, we faced challenges with ensuring efficient resource allocation, managing GPU communication.
- Analysis & Conclusion: The prototype demonstrated the potential for substantial cost savings with our own distributed GPU infrastructure. However, we identified areas for improvement in cluster management, CI/CD, ensuring low latency for inference requests, and consis
