Senior HPC AI Cluster Engineer

Germany
Yesterday
Job Type
Permanent
Work Pattern
Full-time
Work Location
Remote
Seniority
Senior
Education
Degree
Posted
22 May 2026 (Yesterday)

NVIDIA is looking for an experienced HPC-AI Engineer to join the Networking Clusters Solutions Infrastructure team. we are focused on building supercomputers and AI clusters based on groundbreaking technologies. We are looking for an outstanding engineer, be a key player to the most exciting computing hardware and software to contribute to the latest breakthroughs in artificial intelligence and GPU computing. Provide insights on at-scale system design and tuning mechanisms for large-scale compute runs. You will work with the latest Accelerated computing and Deep Learning software and hardware platforms, and with many scientific researchers, developers, and customers to craft improved workflows and develop new, leading differentiated solutions. You will interact with HPC, OS, GPU compute, and systems specialist to architect, develop and bring up large scale performance platforms.

What you will be doing:

  • Design, implement and maintain large scale HPC/AI clusters with monitoring, logging and alerting

  • Manage Linux job/workload schedules and orchestration tools

  • Develop and maintain continuous integration and delivery pipelines

  • Develop tooling to automate deployment and management of large-scale infrastructure environments, to automate operational monitoring and alerting, and to enable self-service consumption of resources

  • Deploy monitoring solutions for the servers, network and storage

  • Perform troubleshooting bottom up from bare metal, operating system, software stack and application level

  • Being a technical resource, develop, re-define and document standard methodologies to share with internal teams

  • Support Research & Development activities and engage in POCs/POVs for future improvements

What we need to see:

  • A degree in Computer Science, Engineering, or a related field and 8+ years of experience

  • Knowledge of HPC and AI solution technologies from CPU’s and GPU’s to high speed interconnects and supporting software

  • Experience with job scheduling workloads and orchestration tools such as Slurm, K8s

  • Excellent knowledge of Windows and Linux (Redhat/CentOS and Ubuntu) networking (sockets, firewalld, iptables, wireshark, etc.) and internals, ACLs and OS level security protection and common protocols e.g. TCP, DHCP, DNS, etc.

  • Experience with multiple storage solutions such as Lustre, GPFS, Weka.io. Familiarity with newer and emerging storage technologies.

  • Python programming and bash scripting experience.

  • Comfortable with automation and configuration management tools such as Jenkins, Ansible, Puppet/chef

  • Deep knowledge of Networking Protocols like InfiniBand, Ethernet

  • Deep understanding and experience with virtual systems (for example VMware, Hyper-V, KVM, or Citrix)

  • Familiarity with cloud computing platforms (e.g. AWS, Azure, Google Cloud)

Ways to stand out from the crowd:

  • Knowledge of CPU and/or GPU architecture

  • Knowledge of Kubernetes, container related microservice technologies

  • Experience with GPU-focused hardware/software (DGX, Cuda)

  • Experience with RDMA (InfiniBand or RoCE) fabrics

We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, sex, gender, gender expression, sexual orientation, age, marital status, veteran status, or disability status. We will ensure that individuals with disabilities are provided reasonable accommodation to participate in the job application or interview process, to perform essential job functions, and to receive other benefits and privileges of employment. Please contact us to request accommodation.

Related Jobs

View all jobs
Spotlight

Machine Learning Engineer - National Security (Gloucestershire)

Mind Foundry Gloucester, Gloucestershire, United Kingdom
On-site Clearance Required

Senior HPC AI Cluster Engineer

NVIDIA Germany
Remote

Senior HPC AI Cluster Engineer

Remote

Senior HPC AI Cluster Engineer

Remote

Senior HPC AI Cluster Engineer

Remote

Senior HPC AI Cluster Engineer

Remote

Senior HPC and AI Network Software Architect

NVIDIA Switzerland
On-site

Industry Insights

Discover insightful articles, industry insights, expert tips, and curated resources.

Where to Advertise Machine Learning Jobs in the UK (2026 Guide)

Where to advertise machine learning jobs UK in 2026: the specialist boards and communities that reach ML, MLOps and deep learning engineering talent. The candidate pool is small, highly specialised and in demand across AI labs, financial services, healthcare, autonomous systems and consumer technology simultaneously. Machine learning engineers and researchers move between roles through professional networks, conference communities and specialist platforms — not general job boards where ML roles compete with unrelated software engineering positions for the same audience. This guide, published by MachineLearningJobs.co.uk, covers where to advertise machine learning roles in the UK in 2026, how the main platforms compare, what employers should expect to pay, and what the data says about hiring across different role types.

Machine Learning Jobs UK 2026: What to Expect Over the Next 3 Years

Machine Learning Jobs UK 2026: roles, salaries and the MLOps, LLM and generative AI hiring trends shaping UK ML careers over the next three years. Machine learning has undergone a transformation that few technology disciplines can match. In the space of three years it has moved from a specialism sitting at the edges of most organisations' technology strategies to a capability that sits at the centre of them. The tools have changed, the expectations have shifted, and the range of industries treating machine learning as a core business function — rather than an experimental one — has expanded dramatically. For job seekers, this creates both opportunity and complexity in roughly equal measure. The machine learning jobs market of 2026 is significantly larger than it was three years ago, but it is also significantly more demanding. Employers have developed more sophisticated expectations, the technical bar for specialist roles has risen, and the landscape of tools, frameworks, and architectural patterns that practitioners are expected to know has broadened considerably. The candidates who will thrive over the next three years are those who understand where the discipline is heading — which specialisms are attracting the most investment, which technologies are reshaping what machine learning engineers and researchers are expected to build, and how the definition of a machine learning career is evolving beyond the model-building core toward a much wider range of roles across the full ML lifecycle. This article breaks down what the UK machine learning jobs market is likely to look like through to 2028 — covering the titles emerging right now, the technologies driving employer demand, the skills that will matter most, and how to position your career ahead of the curve.