To ensure full control over data privacy and compliance, we transitioned from OpenAI’s GPT-4o to a self-hosted, distilled DeepSeek-R1 model deployed in Google Cloud Platform (GCP).
Our team engineered a robust Kubernetes cluster, leveraging a virtual machine powered by 2×NVIDIA L4 GPUs, and deployed the deepseek-ai/DeepSeek-R1-Distill-Qwen-14B model from Hugging Face. This setup gives us enterprise-grade performance, cost efficiency, and complete data sovereignty.
To streamline our deployment pipeline and maximize compatibility, we’re now running our inference server using the vLLM library, a drop-in replacement for the OpenAI API. This practice allows us to integrate with popular SDKs, like Python, with minimal code changes, making migration fast and efficient.
One of the standout advantages of using the vLLM inference server is its built-in support for model serving benchmarks.
This feature allows us to precisely measure key performance metrics such as token throughput, request throughput, and benefit from embedded Continuous Batching for optimized processing efficiency.
With vLLM, we can fine-tune performance, ensure scalability under load, and deliver a faster, more responsive AI experience, a critical feature for high-demand applications like AI-powered social selling.
Mitrix developed Leadguru as an AI-powered sales agent designed to automate and optimize the entire lead generation process. The AI engages potential clients across multiple platforms, identifying high-value prospects and prioritizing outreach efforts. By integrating intelligent automation, the AI agent ensures efficient prospecting, enabling the Mitrix sales team to focus on closing deals and delivering tailored software development solutions.