Sep 28 2022
Data Center

How to Prepare a Federal Network for AI

Artificial intelligence requires resources that have a large impact on existing IT infrastructure.

Artificial intelligence, no longer confined to small research projects and blue-sky thinking, has established a solid and valuable presence in government IT portfolios. Federal AI teams are improved detection of stock market misconduct, better intelligence interpretation and more accurate weather predictions.

IT managers may think of new AI-based applications as just another app, but that could be dangerous. Using AI means using machine learning and neural networks, and these technologies can have a huge impact on both on-premises and cloud-based resources.

Let’s look at how AI can affect the main components in data centers: storage, network and compute.

Click the banner below to receive information about becoming an Insider

Agencies Recognize the Need for Extra Storage Space

AI, machine learning and neural networks eat storage like crazy. Consider some of the big open-source data sets being used for ML: YouTube-8M, which has 350,000 hours of video; Google’s Open Images, with 9 million images; and ImageNet, with 14 million images. ML tools will stress both the capacity and performance of storage systems.

Data centers based on storage area networks with spinning disks have mostly given way to flash-based solid-state drive arrays, yet that may not be enough performance for demanding ML applications. IT managers looking for serious performance may wish to investigate the new Non-Volatile Memory Express–based storage arrays.

Fortunately, NVMe is now becoming mainstream enough that most popular storage vendors are on top of it, including NetAppHPEDell and IBM.

EXPLORE: The latest tape innovations and its impact on data storage.

NVMe can be attached directly to systems and delivers performance by connecting to the PCIe bus. This allows every CPU core to talk directly to the storage system and take advantage of NUMA memory, eliminating the bottleneck of a controller and the single queue that comes with a traditional storage array.

But attaching NVMe directly to a single server depends on the speed of that server, which may simply shift the bottleneck.  

IT managers also would be wise to investigate NVMe over Fabric SANs. These extend the speed of NVMe storage arrays across network fabrics, most commonly Ethernet and Fibre Channel. NVMe over Fabric delivers best when paired with a high-speed backbone, which brings us to the next part of our data center equation: the network.

To deliver the performance needed for AI, IT managers should think about changes to both architecture and hardware.”

Why Agencies are Switching to Spine-and-Leaf Architecture

High-speed data center networking functions are the basis for everything else: intersystem links, storage and reliable connectivity to customers. That means not only high-speed but also low-latency and low-loss networks. To deliver the performance needed for AI, IT managers should think about changes to both architecture and hardware.

IT managers with traditional three-tier core/distribution/edge networks in their data centers should plan to replace all that gear — even without AI in the picture — with spine-and-leaf architecture. Changing to spine-and-leaf ensures every system in a computing pod is no more than two hops from every other system.

Selecting 40-gigabit-per-second or 100Gbps links between leaf switches and the network spine helps reduce the impact of oversubscription when servers are commonly connected at 10Gbps to the network leaf switches.

To really be on the cutting edge of performance, IT managers can aim for a 100Gbps fabric end to end, although some find that 10Gbps server connections occupy a price-performance sweet spot.

LEARN ABOUT: Edge computing and how it's enhancing information gathering.

When a network supports high-speed NVMe over Fabric storage, IT managers have another option for notching up speeds to match the demands being made by ML models: remote direct memory access (RDMA) combined with lossless Ethernet.    

NVMe over Fabric can run over standard Ethernet, using Transmission Control Protocol to encapsulate traffic. However, NVMe over Fabric storage delivers even lower latency when server network interface controllers (NICs) are replaced with RDMA NICs (RNICs).

By offloading everything from the CPU and bypassing the OS kernel, network stack and disk drivers, performance is supercharged over traditional architectures.

The lossless Ethernet side of the equation is provided by modern high-performance network switches that can compensate for oversubscription, prioritize RDMA traffic and manage congestion end to end within the data center.

$1.7 billion

The amount requested in the Biden administration’s fiscal 2022 budget for artificial intelligence research and development

Source: The Networking and Information Technology Research and Development Program and the National Artificial Intelligence Initiative Office, Supplement to the President’s FY2022 Budget, December 2021

IT Managers Must Consider GPUs Carefully

With high-speed networking in place and high-speed storage systems ready to roll, IT managers are poised for the last part of the equation: computing power.

Start researching AI and ML, and you may discover that your old servers are not powerful enough; you may need to immediately invest in graphics processing units to handle the load.

In truth, moving to GPUs will give the best results in many cases, but not all the time. For IT managers with extensive experience in traditional servers who have large server farms already deployed, adding GPUs can be an expensive choice.

The key point here is parallelism: the requirement to run multiple streams at the same time, combined with memory use. GPUs are great at parallel operations, and mainstream ML tools are especially efficient and high-performing when they can run on these GPUs.

DISCOVER: How agencies are working to upgrade their legacy systems.

That said, all this performance comes at a cost, and GPU upgrades don’t do anything if your developers and operations teams don’t dim the lights as they run the processor-intensive parts of their ML models.

That’s the big difference between GPUs and storage and network upgrades, which deliver better performance for everything running in the data center, all the time.

IT managers should plan their investments carefully when it comes to GPUs and make sure that workloads are heavy enough to justify investing in this new technology.

It’s also worthwhile to look at the major cloud computing providers, including Amazon, Google and Microsoft. They already have the GPU hardware installed and ready to go, and will be happy to rent it to you through their cloud computing services.

Illistration by John Lanuza
Close

Become an Insider

Unlock white papers, personalized recommendations and other premium content for an in-depth look at evolving IT