Job Description
We are building the foundation for the next decade of intelligent computing. As a Lead AI Infrastructure Engineer at Nexus Future Systems, you will architect the high-performance, scalable infrastructure that powers our proprietary generative AI models. You will bridge the gap between cutting-edge machine learning research and robust, production-grade engineering.
In this pivotal role, you will lead a team of engineers in deploying containerized microservices, optimizing large-scale data pipelines, and ensuring our systems are resilient, secure, and ready for the demands of 2026 and beyond. If you thrive in a fast-paced, innovative environment and want to define the future of tech, we want you on our team.
Responsibilities
- Architect & Deploy: Design and implement scalable, cloud-native infrastructure (Kubernetes, AWS, GCP) for high-velocity AI workloads.
- Performance Optimization: Analyze and optimize ML training and inference pipelines to reduce latency and improve cost-efficiency.
- Collaboration: Partner with Data Scientists and ML Engineers to integrate models seamlessly into production environments.
- Security & Compliance: Enforce enterprise-grade security protocols and data governance standards across all infrastructure layers.
- Team Leadership: Mentor junior engineers, conduct code reviews, and drive technical best practices within the engineering organization.
- Disaster Recovery: Develop and maintain robust disaster recovery plans to ensure 99.99% system uptime.
Qualifications
- Experience: 8+ years of experience in software engineering, with at least 3 years in a lead role focusing on infrastructure.
- Core Tech: Deep expertise in Python, Go, or Rust, and experience with containerization tools (Docker, Kubernetes).
- Cloud Mastery: Proven track record of architecting solutions on AWS, GCP, or Azure with a focus on serverless or managed services.
- ML Knowledge: Strong understanding of machine learning operations (MLOps) and large-scale data processing frameworks (Spark, Kafka, Airflow).
- Problem Solving: Exceptional ability to troubleshoot complex, distributed system issues in real-time.