The measurement problem we can't ignore: AI's impact on humanity
Our quest to create benchmarks for human flourishing during a two-day workshop at MIT
This week at MIT, some of the brightest minds in AI grappled with a deceptively simple question: How do we know if AI is good for humans?
Before diving in, let’s get some context: Who’s asking?
About me: I’m Erika, cofounder and Chief Customer Officer of Storytell.ai. We help people talk to their data so they can get their most important questions answered and make smarter decisions.
I’m leading the Building Humane Technology movement because so many technologists care deeply about the impact of their work and want to implement shovel-ready best practices to ensure what they build is better for humans.
We host hackathons in San Francisco and online workshops
We’ve kicked off an open-source project with frameworks for builders.
Now, why are we asking? A few signs that, despite significant benefits, LLMs can lead to negative outcomes, such as:
Sycophantic LLMs that nudge users into rabbit holes
Last week, I joined 80 researchers, technologists, and advocates from around the world at MIT’s Media Lab for a two-day workshop on designing benchmarks for human flourishing with AI. What unfolded was both more complex and more hopeful than I expected.
The workshop wasn’t about building another leaderboard where models compete on mathematical reasoning or factual accuracy. Instead, we were tasked with something far thornier: creating ways to measure whether AI genuinely supports human learning, emotional well-being, and social connection. As I sit here now, reflecting on those intense two days earlier this week, one thing is clear: this work is non-trivial, and that’s exactly why it matters.

Complexity hiding in plain sight
When my team tackled emotional and social well-being, we quickly discovered that context is everything. What looks like helpful emotional support for one person might enable harmful patterns for another. A chatbot offering validation to someone processing grief serves a different purpose than one validating someone’s anxious misperceptions about their relationships.
We spent hours wrestling with scenarios, trying to capture the nuanced ways AI might help or harm. Should a model always offer empathy? What about when someone needs honest feedback instead of comfort? These weren’t just technical questions—they were fundamentally human ones, requiring us to articulate what we mean by flourishing in the first place.

Beyond the single exchange
Perhaps the most eye-opening realization was how inadequate most benchmarks are.
The traditional benchmark—a single question, a single answer, known as “single turn”—feels almost quaint when you consider how people actually interact with AI. We don’t just ask ChatGPT one question and move on. We have conversations, build relationships, develop dependencies. We ask follow-up questions, share personal details, seek ongoing support.
During the workshop, teams designed multi-turn scenarios to capture a flavor of these longer interactions. How does a model’s helpfulness change over the course of a relationship? Does it encourage healthy boundaries or foster dependence? These questions require evaluation methods that can track patterns across time, not just measure single moments of performance.
In our deployment and adoption group, we discussed the power (and limitations) of creating digital simulations of user archetypes and using those simulations to mimic a year of interaction with AI. How does it improve or degrade our ability to learn? Connect with others? Understand ourselves?
That said, I’m currently working on a single-turn benchmark, humanebench.ai, to understand how well (or poorly) major models support the principles of humane technology, as there’s still value in understanding patterns, especially the space between the upper and lower bounds. For instance, have a look at this heat map from darkbench.ai, which focuses on the prevalence of deceptive patterns across LLMs.
The people in the room
What struck me most wasn’t the complexity of what we were trying to solve for—though that was considerable—but the caliber of people who showed up. Researchers from MIT, Oxford, and beyond. Industry practitioners from major AI companies. People like me, who run grassroots organizations to foster the foundation of humane AI.
These weren’t people looking for quick wins or easy answers. They came prepared to sit with uncertainty, to prototype rough ideas, to iterate on frameworks that might not land the plane. There was something deeply moving about watching a room full of accomplished professionals embrace beginner’s mind in service of something larger.
The paradox of measurement
One of the most sobering insights was acknowledging the limitations of what we were building, since benchmarks don’t show actual human impact. We weren’t creating a crystal ball that could predict real-world outcomes. We were creating tools that might help us ask better questions.
No one can define or measure justice, democracy, security, freedom, truth, or love. No one can define or measure any value. But if no one speaks up for them, if systems aren’t designed to produce them, if we don’t speak about them and point toward their presence or absence, they will cease to exist.
― Donella Meadows, Thinking in Systems: A Primer
This paradox runs through much of the work of humane technology. We want to measure care, but care resists measurement. We want to quantify flourishing, but flourishing is deeply personal. We want benchmarks that capture the full complexity of human experience, but benchmarks require simplification to be useful.
Yet the workshop participants pressed forward anyway, designing preliminary frameworks that could at least point us in better directions; tools that help us notice patterns, spot potential problems, ask more thoughtful questions about the AI systems we’re building and deploying. Is that enough? Not by a long shot. But we have to find a way, and this is a start.

What stays with me
Walking away from MIT, I found myself thinking about how this work intersects with the broader challenges of building humane technology, which we’re exploring in our open-source project. How do we design systems that support human agency rather than replacing it? How do we create tools that strengthen our connections rather than substitute for them? How do we measure what matters most when what matters most is often hard to measure?
The workshop didn’t answer these questions, but it brought together people committed to wrestling with them seriously. In a field that often moves at breakneck speed, there was something radical about slowing down to consider not just what AI can do, but what AI should do for human flourishing.
The real benchmark might not be what we came up with this week. It might be whether we can sustain the kind of thoughtful, collaborative inquiry that made the workshop possible. In a world of artificial intelligence, perhaps our most human task is learning how to stay grounded in what we value most—and building from there. That said, we absolutely need functional benchmarks for human flourishing, even if they’re only part of the puzzle.
What does it mean to you to evaluate whether technology supports human flourishing? And how might we create more spaces where this kind of deep, patient work can happen?

Words of gratitude
First, I want to thank the MIT Media Lab’s Advancing Humans with AI program for hosting this important work, and to the Omidyar Network for supporting research that puts human well-being at the center of AI development.
I'd like to express my gratitude to the folks at Henrietta’s Table (Alicia & Lucas, you’re the best!) and Sarah Ladyman, who helped us host pre-workshop drinks at The Charles Hotel as part of the short documentary on new narratives in AI that I’m co-creating with Danielle Perszyk. I also want to thank everyone who agreed to be in the doc — we couldn’t do this work without you.
You’re all invited to attend our screening & panel on Oct 27th at TechCrunch Disrupt in San Francisco.



Regarding the topic of the article, the shift to measuring human flourishing instead of just technical benchmarks is truely perceptive. What specific, actionable metrics for social connection or emotional well-being do you see us developers being able to implement first?