The rapid advancement of artificial intelligence (AI) is significantly transforming industries, particularly with the evolution of data centres. Conventional datacentres, originally intended for web hosting, database administration, and general enterprise computing, are being reconfigured to accommodate the unusual processing requirements of AI workloads. Just as artificial intelligence is shaping every sector, infrastructure requirements for AI have changed drastically. It has transcended just applications running in data centres; AI requirements have transformed data centres from mere traditional data centre types to a more AI-centric infrastructure.
Transitioning From Traditional to AI-Capable Data Centres.
Traditional data centres prioritise stability, scalability, and cost-effectiveness for jobs such as web hosting, databases, and corporate applications that require identical hardware, low power consumption, and standardised cooling. However, AI workloads like deep learning model training and large-scale inferencing need much higher processing power, quicker data access, and energy-intensive infrastructure.
Compute Architecture: Transition from CPUs to Accelerators
Traditional data centres rely heavily on central processing units (CPUs) to perform a range of tasks sequentially. For instance, they use Intel Xeon or AMD EPYC processors with two to four CPU sockets per server, virtualisation-optimised configurations, and a focus on balanced performance across diverse workloads.
While a shift in AI infrastructure has made use of accelerators such as:
- GPUs/TPUs/DPUs: Accelerators such as NVIDIA GPUs and Google’s Tensor Processing Units (TPUs) meet AI’s parallel processing demands by performing matrix operations more quickly. The Data Processing Unit (DPU) is used to offload data-intensive tasks from the CPU and GPU.
- Scalability: AI training requires clusters of accelerators. For example, NVIDIA’s DGX systems integrate multiple GPUs for distributed learning.
- Specialised chips: Custom AI chips (e.g., AWS Inferentia and Cerebras WSE-3) optimise performance per watt for specific tasks, like inferencing.
Storage Systems: Speed and Scale A driving factor
Traditional data centres utilise one or a combination of SAN (Storage Area Networks) and NAS (Network Attached Storage), HDD dominance with some SSD tiers, RAID configurations for reliability, storage networks separated from compute networks, and 10–50 GB/s throughput capabilities.
AI workloads can only flourish with performance at scale.
AI-Driven Shift:
- High-Speed Storage: Solid-State Drives (SSDs) and NVMe reduce latency for accessing massive datasets.
- Distributed Architectures: Object storage (e.g., AWS S3) and data lakes manage unstructured data (images, videos).
- Data Proximity: Storage-computing convergence (e.g., computational storage) minimises data movement and accelerates preprocessing.
Networking Infrastructure: Low Latency, High Throughput
In traditional data centres, Ethernet runs at 10–40 Gbps, dominating north-south traffic patterns (client-server). Multi-tiered network systems
Oversubscription rates of 4:1 or more
AI-Driven Shift:
- High-Speed Interconnects: InfiniBand and 400 Gbps Ethernet enable microsecond-level latency for GPU-to-GPU communication.
- RDMA: Remote Direct Memory Access bypasses CPUs, which is critical for distributed training.
- Scalable Topologies: Fat-tree or dragonfly network designs prevent bottlenecks in multi-node clusters.
Cooling Solutions: Tackling Thermal Density
Traditional air cooling, using CRAC units, typically handles approximately 10 to 20 kW per rack.
AI-Driven Shift:
- Liquid Cooling: Direct-to-chip or immersion cooling manages 50-100 kW racks from dense GPU setups.
- Rear-Door Heat Exchangers: Augment air cooling for moderate-density AI workloads.
- Sustainability: Renewable-powered cooling (e.g., Google’s seawater cooling) aligns with ESG goals.
Power and Energy Management
Traditional: 5-10 kW per rack; focus on uptime via diesel generators.
AI-Driven Shift:
- High-Density Power: 30–50 kW per rack requires advanced PDUs and grid redundancy.
- Efficiency Metrics: Lowering PUE (Power Usage Effectiveness) via modular UPS systems and renewable integration.
- Edge Synergy: Offloading inference to edge nodes reduces the central data centre’s load.
Software and Orchestration
Traditional: virtualisation (VMware) and monolithic apps.
AI-Driven Shift:
- Kubernetes & ML Orchestration: KubeFlow, Ray, and Slurm manage distributed training.
- AI frameworks, such as TensorFlow, PyTorch, and Hugging Face, optimise GPU utilisation.
- MLOps Pipelines: Tools like MLflow automate model deployment and monitoring.
Conclusion
The shift from traditional to AI-capable data centres is a significant evolution in data centre design, reimagining every aspect to support the unique requirements of AI workloads. As AI advances, infrastructure will be optimised for specific AI workloads, such as training versus inference, computer vision versus natural language processing, and other specialised applications. The future data centre will likely be a highly heterogeneous environment, with zones and systems tailored to the specific requirements of the workloads they support. Organizations embarking on significant AI initiatives must evaluate their existing infrastructure against these new requirements and develop comprehensive strategies for building or accessing specialised facilities. Success will give them a competitive advantage in the AI-driven economy.