How to improve AI energy efficiency with open-source tools: Q&A with Mosharaf Chowdhury

Topics:

‘Any company could use our tools to measure and optimize their AI models and reduce AI energy use’

The rapid rise of AI has led to concerns about its energy consumption, but a College of Engineering professor and his team have been working on solutions since before “LLM” made its way to mainstream dictionaries. 

Mosharaf Chowdhury, associate professor of computer science and engineering, saw the issue on the horizon in 2020. He noticed that the small AI models of the day were growing rapidly — and so were their data sets and energy use. Over the past five years, Chowdhury’s lab has become a leader in measuring and managing AI energy use for the real world as well as the research space. 

Group of people standing in a computer server room behind a university of michigan logo
Members of the ML Energy Initiative, from right: Professor Mosharaf Chowdhury and his graduate students Jae-Won Chung, Jeff Ma and Ruofan Wu, who have been assessing the amount of power used by open-source generative AI models. (Photo by Marcin Szczepanski, College of Engineering)

Through his Zeus project, the team developed open-source tools to measure AI energy efficiency and optimize AI training, reducing power draw during training by up to 50%. In this Q&A, he explains how these tools work and where they could have the most benefit. 

What do we know about AI energy use?

U-M, Los alamos information
  • U-M and Los Alamos National Laboratory are collaborating on a new supercomputing and AI research center to expand computational capacity and accelerate high-impact research for the public good.
    Visit the project page on the Record site for information.

One thing we know is that a lot of people are concerned about it, which is fair. However, many who worry can be overly pessimistic, and those who want more data centers are often overly optimistic. The reality is not black and white, and there’s a lot we don’t know. That’s mainly because the companies deploying the large public-facing models don’t provide that information and we don’t have access to their proprietary AI models. 

MIT Technology Review used some of our tools to get a sense last year. One of the aspects the article highlighted is that the majority of AI energy use comes from what we call “inference,” which is the process AI uses to respond to queries. Although individual queries don’t use a lot of energy on their own, it adds up due to the sheer volume. More than half of the people surveyed in a recent Brookings Institution study said they regularly use AI for personal use. The public-facing models do their inference in commercial data centers. In 2024, data centers consumed 4% of US electricity and that’s expected to more than double by 2030.

Tell us about your tool for measuring AI energy use and AI energy efficiency.

We developed software that runs alongside AI models and directly measures from the hardware counters. Hardware counters measure events at regular intervals — every 100 milliseconds, for example. We measure how much power each GPU draws during each interval, then we put it all together.

It’s not an estimate. It’s not an ‘envelope’ calculation, like many of the estimates used before our work. An envelope is when you multiply the maximum power draw per GPU by the number of GPUs. That only tells you how bad things can be in the worst case. 

We’ve used our software to measure almost all the top-tier open-source models for a wide range of tasks, including large language models for chatting and coding, and generative AI models for image and video generation. These are open-source models, not closed-source names the public is most familiar with, but their size and requirements are on par with the state-of-the-art. We’ve put together the ML Energy Leaderboard site, where you can see how much energy they use to respond to various prompts. 

You’ve also figured out how to make AI model training consume up to 50% less energy. How does that work and why is it important?

All AI systems have to be trained before they’re used. While inference takes up the lion’s share of AI’s energy footprint, optimizing training could have an outsized impact because it’s the largest single draw.

During training, large AI models marshal thousands of GPUs to work together. But they’re not all going to finish their workloads at the same time. We recognized that the nodes on track to finish last determine when the overall task is done, so there’s no need for other nodes to expend extra energy to finish more quickly. Our software acts as a conductor and guides individual GPUs to proceed at the right pace — not too fast, nor too slow. 

For very large model training, across several research works, we’ve shown reductions of up to 35% in training energy and expect up to 50% savings using existing technologies. 

How broadly are your tools being used and by whom?

Computer server with multiple wires coming out of it.
Computers at the Michigan Academic Computing Center, a 2-megawatt data center where the ML Energy Initiative team has been assessing the amount of power used by open-source generative AI models. (Photo by Marcin Szczepanski, Michigan Engineering)

They’ve been available open-source for years. We can’t always track everyone who uses them. 

We’ve been collaborating with Nvidia for almost a year now on a variety of AI energy optimization projects. They’ve also validated some of our software internally. Google also reached out to us for feedback on their inference energy measurement work. These are only two prominent examples, but we’ve been fortunate to receive much positive feedback from the industry. 

Researchers here at the University of Michigan and elsewhere are measuring AI energy use and improving AI energy efficiency in their respective fields with our Zeus tools. These include applications in natural language processing, databases, and computer architecture.

Everyone should be using it! Any company could use our tools to measure and optimize their AI models and improve AI energy efficiency.

How would U-M’s high-performance computing center advance your work?

We need access to large-scale computing resources to run the AI models we’re measuring or optimizing. 

Right now, our research is extremely constrained because we don’t have more computing resources. We rely on some outdated servers from 2020 for development, and for the actual measurements for the leaderboard, we have to scramble for intermittent access to cloud servers. Other than that, Michigan Institute for Computational Discovery and Engineering (MICDE) made a couple machines available for us to use at times. 

To expand our optimization work and really maximize what we can do to reduce AI energy use, we need to be able to run things at a very large scale. Access to a data center or a high performance computing center would enable that. 

Given the growth of AI, I think it’s important for top-tier universities like U-M to have access to a facility like this. Many of my colleagues are using AI as a tool to explore great ideas and that requires substantial computational resources. Many top computer scientists are opting to work in industry instead of academia because only in industry can they get access to enough computational resources for the scale of research they want to do.

I think it’s important that universities have the ability to contribute. It shouldn’t just be private companies. When only industry has the resources to do large-scale AI research, the public loses access to independent, transparent work. Universities are where open-source tools like ours come from, and where the next generation of researchers learn to build AI responsibly.

Tags:

Leave a comment

Commenting is closed for this article. Please read our comment guidelines for more information.