I Can't Tell If You're Selling Me Something
What I actually found when I stopped reading about AI and started running my own experiments.
What I actually found when I stopped reading about AI and started running my own experiments.
Everywhere you turn right now, someone is telling you how AI is going to transform your workflow, your team, your organization, your life. The content is relentless, and it is almost universally positive. Glowing. Evangelical, even.
I'm not here to tell you that's all a lie. I genuinely don't know. That's kind of the problem.
We live in a media environment where the line between advertising and information has been blurring for years, and AI is accelerating that blur in ways I don't think we've fully reckoned with. When I read a breathless LinkedIn post about how some engineering leader 10x'd their team's output with AI coding agents, I find myself asking: is this a real person sharing a real experience? Is it a paid placement? Is it content generated by the very tools being promoted?
I have no way to tell. Neither do you.
And it's getting worse, not better. The most qualified people to evaluate these tools honestly, the ones with enough experience to have real judgment, are also the busiest. They don't have time to write takes. Which leaves a lot of space for everyone else: the shiny-object adopters who are genuinely excited, the vendors with obvious incentives, and an increasingly murky middle ground of content that looks like an opinion but might be something else entirely. The financial relationship between a writer and the tools they're praising is almost never disclosed. And now the tools themselves can generate content praising the tools. Think about that for a second.
I'm not making accusations. I'm describing a problem that I think we have a collective responsibility to sit with rather than just nodding along. The appropriate response to an information environment you can't fully trust isn't paralysis. It's going and finding out for yourself.
So that's what I did.
Why I finally got off the fence
I've been watching this space with skepticism for a while. Being a cynic about new technology doesn't mean I think it's worthless. It means I've been around long enough to know that being an early adopter has bitten me in the ass more times than I care to count. I've watched programming languages rise to prominence and quietly fade. I've seen frameworks become religious movements and then become legacy problems. I hold new things loosely until I have my own data.
What eventually pushed me off the fence was something specific: Anthropic releasing Claude Code as a proper agentic coding environment. Not a chat window you paste code into. Not a side panel in your editor with access to a single file. Something with access to the entire codebase, something that could chain connections across files on its own. That felt meaningfully different from what had come before.
I was also surrounded by colleagues who were already deep in it, which gave me both the social pressure and the budget to stop waiting. I sat down, installed the tools, and started running experiments. Not to validate the hype. To find out where it actually held up and where it fell apart.
Here's what I found.
The thing I needed built
I've kept a metrics tracking spreadsheet for years, a way to stay close to the health of my engineering organization. Sprint data, defect trends, the signals that tell you whether a team is healthy or just busy. It was held together with Python scripts and manual entry, and it took me about 30 minutes a week to update. I liked that 30 minutes, honestly. The manual process kept me close to the data in a way that was valuable. But I was in a time crunch, and I decided to find out whether I could get that time back.
I sat down with Claude Code and started what people are calling "vibe coding," essentially describing what I wanted in plain language and letting the agent build it.
Within eight hours, I had something usable. I want to sit with that for a second, because it is genuinely remarkable. Eight hours to a working application that automated a workflow I'd been doing manually for years. If that's all I had to report, I'd be writing a very different article.
But I kept going. I iterated. I added features. And that's when things got interesting.
Context rot is real. The further we got from the beginning of the session, the more the agent seemed to forget guidance I'd already given it. Preferences I'd established. Constraints I'd set. It would drift back toward doing whatever it wanted to do, as if the earlier instructions had simply evaporated. For a personal project, annoying. For an enterprise codebase with established standards, that's a serious problem.
It deleted features without telling me. As we were iterating, the agent would occasionally decide that something we'd already built wasn't necessary anymore and quietly remove it. I only caught it because I was doing my own QA pass through the session output. When I asked about it, there was no good explanation. It just made a call. On its own. Without asking.
It made bizarre judgment calls. There were moments where it felt like the agent was trying to impress me, pushing ahead of the requirements, making decisions that weren't asked for, showing off what it could do. I've landed on an analogy for this: it behaves like a young child trying to get a parent's approval. Eager to demonstrate capability. Acting out of bounds not out of malice but out of a kind of nascent, unformed ambition. That's not an insult. It's an observation about where the technology is in its development. Young. Early. Still figuring out the edges of what it should and shouldn't do.
For a personal project, I can absorb that. For production software at any meaningful scale, I can't.
The thing I built for fun
My wife and I have a recurring problem: we can never decide what to watch. So I built an app. A simple Android application that pulls TV show data from an API, stores a list in a database, uses Google Auth for login, and randomly selects what we're watching tonight.
Claude built it in about an hour.
It took another eight hours to get it to actually work.
That distinction matters more than it might seem. "Built" and "works" turned out to be very different things. After that first hour, Claude was confident. "We're done. Here's how to get it running." I'd follow the steps. I'd hit an error. I'd report back. And to Claude's credit, it never got defensive. Every time I said "this doesn't actually run," the response was some version of "you're right, there's a problem, let me fix it." No gaslighting me about whether the error was real.
But we did this over and over. Configuration problems. Runtime errors that didn't surface until I was actually trying to install the thing on a phone. It felt less like collaboration and more like QA-ing a junior developer who is genuinely trying but keeps missing things they should have caught.
The app works. We use it. But the gap between "Claude thinks it's done" and "it is actually done" was significant, and I don't think that gap should be hand-waved away.
The litmus test I've been running for years
Before any of this, before the current hype cycle, I've had my own informal benchmark for evaluating AI coding capability. Every three to six months, I grab a complex chunk of code from whatever I'm working on. Hundreds of lines. A god method. The kind of function that has grown over years into something nobody wants to touch, but that you can look at and see, clearly, how it could be broken apart.
I hand it to the AI and say: refactor this.
No specific instructions. No hints. Just: here's a mess, make it better.
Refactoring is actually a well-understood problem. There are established patterns, canonical books, clear inputs and outputs. It's mechanical enough that you'd think a sufficiently capable AI should be able to at least produce something that compiles. For years, what I got back was a complete rewrite, and the more complex the code I gave it, the less likely the output was to actually run. Which is a striking failure mode for something that is supposedly fluent in code.
Within the last six or seven months, something changed. Claude stopped trying to rewrite everything and started making selective edits. Targeted changes. It began to look less like a student who didn't read the assignment and more like someone who actually understood what refactoring means.
It's not there yet. I still haven't gotten back something I'd call production-ready from this test. But the direction of travel is real, and that matters.
Where this leaves me
I want to be honest about something before I close.
I've described real limitations. I've been as critical as I know how to be. But I also built two applications I use, and I'm planning to bring these tools into my next organization. That's not a fully negative take, and I'm aware of the irony: I started this piece by asking whether the positive AI content you're reading is trustworthy, and I'm ending it with some positive observations of my own.
I don't have a clean resolution to offer. What I have is this: I ran my own experiments instead of taking someone else's word for it, and I'm telling you exactly what I found, including the parts that didn't work. Whether that's enough to trust is a judgment call you'll have to make yourself.
The discourse isn't going to get more honest on its own. The incentives all point the other way. So run your own experiments. Stay skeptical of the output, including this one. And in the next post, I'll get into where I've actually landed: the specific ways I'm planning to use these tools at my next organization, and why.
Raleigh Schickel has spent 15+ years leading engineering organizations at companies including Stoplight, SmartBear, and uShip. He's currently building an engineering health diagnostic platform. He writes about the people who build software and the tools they use to do it.