Published in Forbes Business Council, September 2025
A common misconception in automated software testing is that the document object model (DOM) is still the best way to interact with a web application. But this is less helpful when most front ends are stitched together from overly complex DOMs and megabytes of runtime logic.
Quality assurance of software, increasingly written by swarms of AI-assisted developers, has become one of the hottest topics in Silicon Valley. It's the nuclear fallout of the LLM boom and the vibe-code epidemic of 2025. What was once seen as arcane wizardry performed by savvy engineers is now daily output from tools like Windsurf, Claude Code and Lovable.
Uncle Ben said it best: "With great power comes great responsibility." That's never been truer than when ambitious developers deploy AI-generated code to users who depend on these platforms for their social, financial and medical needs. AI is helpful, but it doesn't produce perfect code. In fact, LLMs make mistakes at a higher rate than thoughtful human developers. Writing code with AI today is a double-edged sword.
That edge cuts deepest in testing. Vibe-coded apps may appear functional in the browser, but their underlying code is often bloated, disorganized and difficult to maintain. I frequently see DOM trees that contain thousands of deeply nested, nonstandard elements. No documentation, no spec, no clear intention—just opaque markup glued together with bulky scripts. (Cue Homer Simpson holding himself together with clothespins.)
Many "AI-native" browser automation tools still cling to the outdated idea that the DOM is a reliable source of truth. That might have worked for hand-coded web apps from 2010, but today's AI-generated UIs, minified JavaScript runtimes, <canvas> rendering and frameworks like Flutter make it less reliable.
As data flows from the server to the browser, into a context window and finally into an LLM, detail and fidelity are lost at every step: a process I call the loss chain. By the time the agent acts, it's working from abstractions that no longer match what's actually on the screen.
DOM elements also don't behave like fully rendered UIs. Try finding the right <div> that's supposed to be a login button when its click handler is buried in minified code. That makes it hard for automation to know what to click, and when apps rely on graphics or canvas-based interfaces, testing tools can break down into guesswork.
I've been researching browser agents, testing frameworks and vision models full time since the beginning of this year. This intense focus is no accident—it's the main building block of my startup. Having previously worked as a software engineer on projects aimed at catching regressions, I've seen firsthand the limitations of traditional approaches. I've also had the privilege of meeting fellow founders working in the browser automation space to discuss these exact challenges.
Based on this, I believe the solution is computer vision: equipping automation tools with eyes and hands (screenshots and virtual peripheral devices). This avoids the loss chain entirely by letting the AI act on what is actually rendered in the browser, not on an abstraction.
The AI agent is more effective when it can click the login button it sees, rather than guessing at one buried in a DOM tree that exists solely to appease the renderer. For modern vibe-coded apps, the rendered screen is the source of truth, not the markup.
It's true that vision-based testing is still slower, more expensive and less mature than DOM-based frameworks. But that's changing fast. Vision models are getting faster, inference costs are dropping and open-weight contenders like ByteDance's UI-TARS are already challenging Anthropic's CUA.
So, when do these trade-offs matter most?
For projects with read-heavy workflows (like text ingestion, scraping and research), DOM-based frameworks are often still the better choice. Vision models rely on optical character recognition (OCR) and an LLM's reasoning for text comprehension, which can cause them to struggle with small fonts and misidentify placeholders and long, nonsensical strings like URLs or hashes. Without context beyond a screenshot of the viewport, they can also misinterpret icons and buttons that lack accompanying text.
On the flip side, vision models excel at write-heavy tasks that require a lot of interaction with a webpage. In my experience, they are great for handling things like canvases, iframes, scrollable containers, drag-and-drop actions and nonstandard input components. If you're testing applications like design tools or games, the visual approach might be worth the extra cost and time.
Computer use is inherently visual. We design interfaces for humans to look at and interact with using a mouse and keyboard. If AI aims to be the brain, it needs eyes and hands. In the age of vibe-code, don't lock yourself into a DOM-shaped view of the world. Build tools that see.
© Boris Skurikhin