How MassMutual and Mass General Brigham turned AI pilot sprawl into production results

Enterprise AI programs rarely fail because of bad ideas. More often, they get stuck in ungoverned pilot mode and never reach production. At a recent VentureBeat event, technology leaders from MassMutual and Mass General Brigham explained how they avoided that trap — and what the results look like when discipline replaces sprawl.

At MassMutual, the results are concrete: 30% developer productivity gains, IT help desk resolution times reduced from 11 minutes to one, and customer service calls cut from 15 minutes to just one or two.

“We’re always starting with why do we care about this problem?” Sears Merritt, MassMutual’s head of enterprise technology and experience, said at the event. “If we solve the problem, how are we gonna know we solved it? And, how much value is associated with doing that?”

Defining metrics, establishing strong feedback loops

MassMutual, a 175-year-old company serving millions of policy owners and customers, has pushed AI into production across the business — customer support, IT, customer acquisition, underwriting, servicing, claims, and other areas.

Merritt said his team follows the scientific method, beginning with a hypothesis and testing whether it has an outcome that will tangibly drive the business forward. Some ideas are great, but they may be “intractable in the business” due to factors like lack of data or access, or regulatory constraint.

“We won’t go any further with an idea until we get crystal clear on how we’re going to measure, and how we’re going to define success.”

Ultimately, it’s up to different departments and leaders to define what quality means: Choose a metric and define the minimum level of quality before a tool is placed into the hands of teams and partners.

That starting point creates a quick feedback loop. “The things that we find slow us down is where there isn’t shared clarity on what outcome we’re trying to achieve,” which can lead to confusion and constant re-adjusting, said Merritt. “We don’t go to production until there is a business partner that says, ‘Yes, that works.’”

His team is strategic about evaluating emerging tools, and “extremely rigorous” when testing and measuring what “good” means. For instance, they perform trust scoring to lower hallucination rates, establish thresholds and evaluation criteria, and monitor for feature and output drift.

Merritt also operates with a no-commitment policy — meaning the company doesn’t lock itself into using a particular model. It has what he calls an “incredibly heterogeneous” technology environment combining best of breed models alongside mainframes running on COBOL. That flexibility isn’t accidental. His team built common service layers, microservices and APIs that sit between the AI layer and everything underneath — so when a better model comes along, swapping it in doesn’t mean starting over.

Because, Merritt explained, “the best of breed today might be the worst of breed tomorrow, and we don’t want to set ourselves up to fall behind.”

Credit: Brian Malloy Photo

Weeding instead of letting a thousand flowers bloom

Mass General Brigham (MGB), for its part, took more of a spray and pray approach — at first.

Around 15,000 researchers in the not-for-profit health system have been using AI, ML, and deep learning for the last 10 to 15 years, CTO Nallan “Sri” Sriraman said at the same VB event.

But last year, he made a bold choice: His team shut down a sprawl of non-governed AI pilots. Initially, “we did follow the thousand flowers bloom [methodology], but we didn’t have a thousand flowers, we had probably a few tens of flowers trying to bloom,” he said.

Like Merritt’s team at MassMutual, MGB pivoted to a more holistic view, examining why they were developing certain tools for specific departments of workflows. They questioned what capabilities they wanted and needed and what investment those required.

Sriraman’s team also spoke with their primary platform providers — Epic, Workday, ServiceNow, Microsoft — about their roadmaps. This was a “pivotal moment,” he noted, as they realized they were building in-house tools that vendors were already providing (or were planning to roll out).

As Sriraman put it: “Why are we building it ourselves? We are already on the platform. It is going to be in the workflow. Leverage it.”

That said, the marketplace is still nascent, which can make for difficult decisions. “The analogy I will give is when you ask six blind men to touch an elephant and say, what does this elephant look like?” Sriraman said. “You’re gonna get six different answers.”

There’s nothing wrong with that, he noted; it’s just that everybody is discovering and experimenting as the landscape keeps shifting.

Instead of a wild West environment, Sriraman’s team distributes Microsoft Copilot to users across the business, and uses a “small landing zone” where they can safely test more sophisticated products and control token use.

They also began “consciously embedding AI champions“ across business groups. “This is kind of a reverse of letting a thousand flowers bloom, carefully planting and nourishing,” Sriraman said.

Observability is another big consideration; he describes real-time dashboards that manage model drift and safety and allow IT teams to govern AI “a little more pragmatically.” Health monitoring is critical with AI systems, he noted, and his team has established principles and policies around AI use, not to mention least access privileges.

In clinical settings, the guardrails are absolute: AI systems never issue the final decision. “There’s always going to be a doctor or a physician assistant in the loop to close the decision,” Sriraman said. He cited radiology report generation as one area where AI is used heavily, but where a radiologist always signs off.

Sriraman was clear: “Thou shall not do this: Don’t show PHI [protected health information] in Perplexity. As simple as that, right?”

And, importantly, there must be safety mechanisms in place. “We need a big red button, kill it,” Sriraman emphasized. “We don’t put anything in the operational setting without that.”

Ultimately, while agentic AI is a transformative technology, the enterprise approach to it doesn’t have to be dramatically different. “There is nothing new about this,” Sriraman said. “You can replace the word BPM [business process management] from the ’90s and 2000s with AI. The same concepts apply.”

Defining metrics, establishing strong feedback loops

Weeding instead of letting a thousand flowers bloom

Leave a Reply Cancel reply

Related News

Wisconsin governor says ‘no’ to age checks for porn

Can Artificial Intelligence be governed—or will it govern us?

Netflix launches a standalone app for kids’ games

You can now search through app reviews on the Google Play Store