This article is part of VentureBeat's special issue, “AI at Scale: From Vision to Viability.” Read more from this special issue here.
This article is part of VentureBeat's special issue, “AI at Scale: From Vision to Viability.” Read more from the magazine here.
If you were to travel 60 years back in time to Stevenson, Alabama, you would find the Widows Creek Fossil Plant, a 1.6-gigawatt generating station with one of the tallest chimneys in the world. Today, there is a Google data center where the Widows Creek plant once stood. Instead of running on coal, the old facility's transmission lines will incorporate renewable energy to power the company's online services.
That metamorphosis, from a carbon-burning facility to a digital factory, is symbolic of a global shift to digital infrastructure. And we are going to see the production of information go into high gear thanks to AI factories.
These data centers are decision engines that build up computing, networking and storage resources while turning information into insight. Data centers are packed rising in recording time to satisfy the insatiable demand for artificial intelligence.
The infrastructure to support AI possesses many of the same challenges that defined industrial factories, from power to scalability and reliability, requiring modern solutions to century-old problems.
The new workforce: Computing power
In the age of steam and steel, labor meant thousands of workers working machines around the clock. In today's AI factories, output is determined by computing power. Training large AI models requires large processing resources. According to Aparna Ramani, VP of engineering at Metathe training growth of these models is about a a factor of four per year throughout the industry.
That level of scaling is on the way to creating some of the same bottlenecks that existed in the business world. There are supply chain constraints, for starters. GPUs – the engines of AI revolution – come from a handful of manufacturers. They are incredibly complex. They are in high demand. And so it should come as no surprise that they are subject to cost instability.
In an effort to avoid some of these supply constraints, big names like AWS, Google, IBM, Intel and Meta are designing their own custom silicon. These chips are optimized for power, performance and cost, making them specialists with specific features for their workloads.
This trend isn't just about hardware, though. There is also concern about how AI technologies will affect the labor market. Research published by Columbia Business School studied the investment management industry and found that the adoption of AI leads to a 5% decline in the labor share of income, mirroring the trends seen during the Industrial Revolution.
“AI is likely to be transformative for many, perhaps all, sectors of the economy,” said Professor Laura Veldkamp, one of the authors of the paper. “I am very optimistic that we will find useful employment for many people. But there will be transfer costs.”
Where do we find the energy to scale?
Cost and availability aside, the GPUs that are the AI factory workers are particularly power hungry. When the xAI team brought their Colossus supercomputer cluster online in September 2024, they reportedly had access to seven to eight megawatts of space from the Tennessee Valley Authority. But the group's 100,000 H100 GPUs need much more than that. Therefore, xAI introduced VoltaGrid's mobile generators to make up for the short-term difference. In early November, Memphis Light, Gas & Water reached a more permanent agreement with the TVA to deliver an additional 150 megawatts of capacity to xAI. But critics argue that the consumption of the site puts pressure on the city's grid and contributes to poor air quality. And Elon Musk he already has plans for another 100,000 H100 / H200 GPUs under the same roof.
According to McKinseythe power needs of data centers are expected to increase to approximately three times their current capacity by the end of the decade. At the same time, the rate at which processors double their performance efficiency is slowing. That means performance per watt is still improving, but at a slow pace, and certainly not fast enough to keep up with the demand for computing horsepower.
So, what will it take to match the feverish adoption of AI technologies? A report from Goldman Sachs suggests that US utilities will need to invest about $50 billion in new generation capacity just to support data centers. Analysts also predict that data center power consumption will drive about 3.3 billion cubic feet per day of new natural gas demand by 2030.
Scaling becomes more difficult as AI factories get bigger
Training the models that make AI factories accurate and efficient can take tens of thousands of GPUs, all working simultaneously, months at a time. If GPU fails during training, the run must be stopped, reverted to a recent checkpoint and restarted. However, as the complexity of AI factories increases, so does the likelihood of failure. Ramani addressed this issue during a Demonstration of AI Infra @ Scale.
“Stopping and restarting is very painful. But it's made worse by the fact that, as the number of GPUs increases, so does the likelihood of failure. And at some point, the number of failures could be so high that we lose too much time reducing those failures and you barely finish a training run.”
According to Ramani, Meta is working on near-term methods to detect failures faster and get back up and running faster. Further, research on asynchronous training may improve fault tolerance while improving GPU utilization and spreading training across multiple data centers.
AI will always change how we do business
Just as factories of the past relied on new technologies and organizational models to scale the production of goods, AI factories consume computing power, network infrastructure and storage to produce signals – the smallest piece of information used by an AI model.
“This AI factory is generating, creating, producing something of great value, a new product,” Nvidia CEO Jensen Huang said. The main theme of Computex 2024. “It's completely vulnerable in almost every industry. And that is why it is a new Business Revolution. “
McKinsey says that AI is generational he has the ability to contribute equivalent to $2.6 to $4.4 trillion in annual economic benefits across 63 different use cases. In each application, whether the AI factory is hosted in the cloud, deployed at the edge or managed independently, the same infrastructure challenges must be overcome, the same as an industrial factory. According to the same McKinsey report, to achieve even a quarter of that growth by the end of the decade requires another 50 to 60 gigawatts of data center capacity, for starters.
But the result of this growth is set to inevitably change the IT industry. Huang explained that AI factories will enable the IT industry to generate information for $100 trillion worth of business. “This is going to be a manufacturing business. Not the manufacturing industry of computers, but the use of computers in manufacturing. This has never happened before. A very surprising thing. “
Source link