We Used 6 AI Tools on Real Client Projects for 6 Months — Here's What Actually Survived

Written by
Sumit Patel
Published
March 10, 2026
Reading Level
Advanced Strategy
Investment
22 min read
The Brief
ChatGPT, Claude, DeepSeek, Gemini, Lovable, Grok — tested across ERP builds, CRM integrations, and B2B SaaS. Honest breakdown from a two-developer team that ships production code.
This is not a feature comparison.
We are a two-person development team at EdgeNRoots — one frontend, one backend. Between us we shipped a hospitality CRM with live AI features, a luxury rental aggregator with a custom formula engine, and a B2B SaaS for Maruti Suzuki dealerships — all in the last 12 months. These are the tools that were open on our machines while that work was happening. Not tools we benchmarked in a spreadsheet.
Let me tell you about a real production incident first, because it sets up everything that follows. Krishna was leading backend engineering on BanquetFirst — an AI-powered hospitality CRM we were building at EdgeNRoots. One of the features was a WhatsApp drip campaign engine: multi-step, time-delayed message sequences tied to lead lifecycle stages. He had ChatGPT help scaffold the initial webhook handler for incoming Meta Cloud API events. The code looked clean. It passed local testing. What ChatGPT did not flag: the webhook handler was creating a new message queue worker instance on every incoming event, with no cleanup on connection drop. On a quiet day — fine. On the day a venue ran a promotion that spiked incoming leads by 400% — the process table filled up and the server ground to a halt. Two days of debugging. The fix was straightforward once found. The issue was that ChatGPT generated confident, working-looking code without flagging the lifecycle risk. When Krishna ran the same prompt through Claude after the incident, Claude's response included: 'Note: if this webhook fires at high frequency, spawning a new worker per event without a concurrency limit or cleanup mechanism will create resource exhaustion. Consider a queue-based pattern with a fixed worker pool.' That difference — between confident generation and honest flagging — is the throughline of this entire article. This is a six-month look at the tools our team actually used, on real shipped products. No benchmarks. No demo projects. Just what was open on our screens.
Key Takeaways
6 PointsWhy Most AI Tool Comparisons Tell You Nothing Useful
Every comparison article runs the same test: give each AI a coding prompt, compare the outputs, declare a winner. That test is almost meaningless for production development.
The questions that actually matter are different. Does the model flag its own uncertainty, or does it generate confidently wrong code? How does it handle a 200-line codebase pasted for context? Does it maintain the constraints you gave it three exchanges ago, or does it quietly drift? What happens when you push it on edge cases it is not confident about?
We did not run structured benchmarks. We used these tools daily across real projects and noticed patterns. What follows is those patterns.
- Single-prompt benchmarks do not predict multi-turn reliability. The model that wins a coding challenge may fall apart across a 20-message debugging session.
- Speed matters but compounds weirdly. A 3-second response keeps your flow. A 45-second response — even a better one — breaks it. This affects which tool you actually reach for under deadline.
- Context window size is misleading. Most models degrade noticeably when you push past roughly 30-40% of their advertised limit. The number in the spec sheet is not the number you can rely on.
- Switching between tools mid-task costs more than most developers account for. Re-establishing project context after switching is typically a 5-10 minute loss. We stopped doing it.
Claude: The Only Tool We Trust When Something Goes to Production
We use Claude more than any other AI on the team, and the reason is not capability — it is honesty.
Krishna built an invoice OCR extraction pipeline for ScrapCity, a B2B SaaS serving Maruti Suzuki dealerships. The pipeline needed to extract part names, quantities, and vehicle model data from scanned invoices, then cross-reference against an inventory system. The edge cases were nasty: handwritten corrections on printed forms, inconsistent part naming across dealerships, invoices where the same part appeared under three different line items.
When he worked through the architecture with Claude, the responses consistently included what the model was uncertain about. Not hedging for the sake of it — specific flags. 'This approach assumes the OCR output preserves table structure. If your scanner flattens the layout, you will need a different extraction strategy.' That kind of heads-up, given before you have committed two days to an approach, is worth more than fast code generation.
On the frontend side, I use Claude for anything touching production Redux logic or complex TypeScript generics. Not because it is faster than ChatGPT — it is not, especially during peak hours. But on tasks where a wrong answer has a real cost, I want the model that says 'I am not confident about this' rather than the model that sounds confident whether it is right or wrong.
The weak spots are real: slower response times, smaller plugin ecosystem, no image generation worth using. For rapid brainstorming or first-draft generation, ChatGPT is still the faster call.
- Best for: Production backend logic, complex TypeScript, long document analysis, code review where accuracy matters more than speed.
- Pricing: Free tier hits rate limits fast for daily use. Pro at $20/month is the practical floor. Heavy team use needs the Team plan.
- Real strength: It flags uncertainty instead of papering over it. On production code, that is the most valuable feature any AI tool can have.
- Real weakness: Noticeably slower than ChatGPT at peak times. If your workflow is high-volume brainstorming or rapid drafting, you will feel the difference.
ChatGPT: Fastest to a Working Draft, Least Likely to Tell You What It Missed
ChatGPT is still the tool both of us open first for general tasks, and it earns that position. For writing a first-draft API route, generating test data structures, or brainstorming approaches to a new feature, it is faster and more versatile than anything else in our stack.
The limitation is the same one that caused the BanquetFirst webhook incident: ChatGPT generates confident output regardless of how confident it should actually be. Across six months, we noticed this pattern consistently — it produces code that looks correct and runs correctly in isolation, but misses systemic issues that only surface under load or in edge conditions.
We have not stopped using it. We have just stopped treating its output as production-ready without a review pass. The Custom GPTs feature is genuinely useful — I have one configured for writing StackNova blog briefs and another that generates Redux Toolkit boilerplate in our team's specific patterns. For repeatable, structured tasks, those are real time-savers.
For anything touching client-facing systems or database logic, ChatGPT output goes through a Claude review pass before it ships. That is the workflow we landed on after a few expensive lessons.
- Best for: First drafts, general coding tasks, brainstorming, structured content generation, repetitive workflow tasks via Custom GPTs.
- Pricing: Free tier is usable for light use. Plus at $20/month is worth it if you hit rate limits, which daily users will within two weeks.
- Real strength: Fastest to a working first draft across nearly every category. The Custom GPT ecosystem is a genuine productivity multiplier for repeatable tasks.
- Real weakness: Generates confident output regardless of actual confidence. Do not ship production code from ChatGPT without a review step.
v0.dev vs Lovable: Two Different Tools People Keep Conflating
Both tools generate frontend code from prompts. That is where the similarity ends, and confusing them leads to the wrong choice.
I have used both. My honest take: v0.dev produces better UI. Lovable scaffolds a full app faster. Those are different jobs.
v0.dev (Vercel) generates React component code with noticeably better design sensibility. The output looks like something a frontend developer with real design taste produced. Layouts are tighter, spacing is considered, component structure is cleaner. For building UI that needs to look good — landing pages, dashboards, client-facing product screens — v0 consistently produces output closer to what you actually want to ship. I reach for v0 when I care about the visual result.
Lovable scaffolds entire application structures faster. It is stronger at wiring together full pages with routing, form handling, and data flow — not just individual components. Krishna used Lovable for ScrapCity specifically because he needed a full-app scaffold, not a single polished component. As a backend developer who needed a functional frontend he could wire to his Node.js APIs, Lovable gave him a working structure to iterate on. He was not optimizing for visual polish — he was optimizing for speed to something functional.
The honest ceiling on both: complex state management, deep business logic, and domain-specific validation need developer judgment on top of whatever they generate. Neither tool removes the need for the developer — they just collapse the time between blank canvas and reviewable first draft.
If your question is 'which one should I use' — ask what you actually need. Better looking UI with cleaner component code: v0.dev. Full app scaffold for a solo full-stack project where you are the backend developer crossing over: Lovable.
- v0.dev: stronger for UI quality and component design. Output looks more intentional. Better when visual result matters.
- Lovable: stronger for full-app scaffolding speed. Better when a backend developer needs a functional frontend structure to wire APIs against.
- Pricing: v0.dev has a free tier; paid starts at $20/month. Lovable free tier with generation limits; paid around $25/month.
- Do not conflate them: they solve related but different problems. Using the wrong one for your actual need produces frustrating results.
- Real limitation on both: business logic, complex state, domain validation — still needs you. These tools accelerate to a reviewable first version, not to production-ready.
DeepSeek: The Tool We Reached For When Everything Else Failed
Before Claude became our go-to for production work, there was a period where the workflow was ChatGPT as primary, Gemini as secondary, and DeepSeek as the last resort when both failed to solve something.
That framing — last resort — is actually the most honest description of how DeepSeek entered our stack. There were debugging sessions, particularly on complex recursive logic and data pipeline issues in early EdgeNRoots work, where ChatGPT would give a confident answer that was wrong, Gemini would give a similar wrong answer slightly rephrased, and DeepSeek running locally would actually trace through the logic correctly and identify the problem.
It earns its place specifically on pure code reasoning tasks. Not writing tasks, not architecture discussions, not anything requiring broad knowledge — pure code. Given a specific function, a specific error, and enough context, DeepSeek's reasoning on what is actually happening in the code is genuinely strong. Stronger than its reputation suggests, particularly because most comparisons test it on general tasks where it is clearly weaker.
The other real advantage is local deployment. For EdgeNRoots projects handling booking data, OTA revenue calculations, or insurance industry pipelines — anything where client code leaving your infrastructure is a concern — DeepSeek running locally is the only option on this list that solves that problem. Claude and ChatGPT are cloud tools. DeepSeek does not have to be.
Once Claude arrived in the workflow, it displaced DeepSeek for most debugging and architecture work because Claude flags uncertainty better and handles longer context more reliably. But DeepSeek stayed in the stack for local-deployment scenarios and as a useful secondary opinion on stubborn code problems.
- Honest positioning: it was the fallback when ChatGPT and Gemini failed to fix a problem. That is still the best description of when to reach for it.
- Best for: Pure code reasoning on specific bugs, complex recursive logic, local deployment for client-sensitive codebases.
- Pricing: Free. Open-source. Runs locally — the only tool on this list with zero cloud exposure.
- Displaced by Claude for: most debugging and architecture work once Claude was in the stack. Claude handles uncertainty better across longer context.
- Still worth keeping: for local-deployment scenarios and as a second opinion on code problems Claude or ChatGPT are not solving cleanly.
Gemini: One Specific Use Case, Limited Beyond It
We tested Gemini thoroughly, and our honest conclusion is that it earns its place in exactly one scenario: when you are working inside Google Workspace and need AI assistance that integrates natively with Docs, Sheets, or Drive.
For processing a large set of Google Sheets reports or pulling structured data from Drive into a document draft, Gemini's native integration is genuinely faster than copy-pasting into another tool. The 2-million-token context window is also real, and for processing large document sets in one pass, it has an advantage.
As a standalone reasoning tool or coding assistant, we did not find it competitive with ChatGPT or Claude. Response quality on complex tasks was consistently a tier below both. Its knowledge cutoff and reliance on live search for recent information introduces accuracy inconsistencies that we ran into several times during testing.
We are not on a Gemini subscription. If you are already in Google Workspace all day, the integration case might justify it. For everyone else, the $20/month is better spent on the two tools above.
- Best for: Teams already inside Google Workspace who need AI assistance across Docs, Sheets, and Drive without context-switching.
- Pricing: Free tier available. Gemini Advanced at $20/month unlocks the full context window.
- Real strength: Native Workspace integrations are genuinely faster for document-heavy workflows. Largest available context window.
- Real weakness: As a standalone tool, noticeably behind ChatGPT and Claude on reasoning quality. Heavy reliance on search for recent information creates inconsistency.
The Workflow We Actually Use (Two Developers, Three Active Products)
After six months of iteration, here is what our actual daily AI workflow looks like — not what we planned, but what stabilized through use.
For frontend work (my side): ChatGPT for first drafts of components and utility functions. Claude for anything touching production Redux logic, complex TypeScript, or code that needs to be reviewed before it ships. DeepSeek locally when I am working with client ERP data that should not leave our infrastructure.
For backend work (Krishna's side): Claude as the primary tool for architecture decisions, database schema review, and production API logic. ChatGPT for scaffolding and boilerplate. Lovable for frontend screens on the ScrapCity project where he is solo full-stack. DeepSeek for local testing of sensitive codebase logic.
The rule we follow is not 'use the best tool for each task' — it is 'use the minimum number of tools that cover your actual needs, and learn them deeply.' Switching mid-task costs more than using a slightly inferior tool consistently.
Total monthly AI spend between us: roughly $60-80 across both subscriptions. The productivity return on that spend, measured in hours not spent on tasks these tools now handle, is not close.
- Frontend workflow (Sumit): ChatGPT for first drafts, Claude for production review, v0.dev for polished UI components, DeepSeek locally for sensitive client code.
- Backend workflow (Krishna): Claude for architecture and production logic, ChatGPT for scaffolding, Lovable for full-app frontend scaffold on solo projects.
- Rule: minimum tools, used deeply. Two primary tools outperform five tools switched between constantly.
- Monthly spend: $60-80 for two developers across all subscriptions. DeepSeek and v0 free tiers cover a lot.
What All of These Tools Still Get Wrong
None of the tools above are reliable enough to remove the human review step from production code. That is the most important thing to understand after six months of heavy use, and it is not close to changing.
Hallucination is not solved. Every tool on this list will occasionally generate plausible-sounding but incorrect logic — Claude least, ChatGPT most consistently, but none are immune. The rate is lower than it was in 2024, but it has not gone to zero and it will not in the near term.
Context consistency degrades across long sessions. Past roughly 40% of a model's advertised context window, coherence drops measurably. If you are debugging a complex issue across a 30-message conversation, the model's awareness of earlier constraints will quietly drift. We re-paste key constraints at intervals on long sessions.
Output quality varies session to session. The same prompt on Tuesday and Thursday can produce meaningfully different quality results. Do not build automated workflows that assume perfectly consistent AI output — they will fail in ways that are hard to debug.
Data privacy is not uniformly resolved. Cloud AI tools process inputs on external servers. For client-sensitive code, we use DeepSeek locally or, when that is not practical, we redact client-specific details before pasting. Treating cloud AI tools as safe for all code by default is a mistake.
- Hallucination rate is lower but not zero. Production code from any AI tool needs a human review pass before it ships. This is not optional.
- Context drift is real above 40% of the advertised window. Re-paste key constraints during long debugging sessions.
- Session-to-session output variance is significant. Automated workflows assuming consistent AI quality will break unpredictably.
- Cloud AI tools are not appropriate for all client code without redaction. Local deployment is the only option that keeps sensitive data entirely off third-party infrastructure.
Which Tool Should You Start With (Based on What Kind of Developer You Are)
If you are a solo developer or freelancer building client projects: start with ChatGPT Plus ($20/month) for speed and versatility. Add Claude Pro ($20/month) as a second tool specifically for production review passes — it is worth the extra $20 if you ship client-facing code. Total spend: $40/month.
If you are a developer building a product solo and need frontend help: look at Lovable. It is the most significant change to the solo developer workflow we have seen in the past year. Use Claude for backend architecture alongside it.
If your work involves sensitive client data: keep DeepSeek in your stack for free, run it locally, and use it specifically for the code that should not go to cloud AI. You do not need to choose between AI assistance and data privacy if you deploy locally.
If you are already embedded in Google Workspace: Gemini may be worth evaluating for the integration. Test it for a month before committing.
If you are tempted to run four or five AI tools simultaneously: do not. The switching cost eats the productivity gain. Two tools used well will outperform five tools used inconsistently, every time.
- Solo freelancer: ChatGPT Plus + Claude Pro = $40/month. Claude is the review layer; ChatGPT is the drafting layer.
- Frontend-heavy solo work: v0.dev (free tier to start) for polished UI components. Better design output than Lovable.
- Solo backend developer going full-stack: Lovable + Claude = what Krishna uses to ship ScrapCity alone. Lovable for app scaffold; v0 for individual UI screens that need to look good.
- Sensitive client code: DeepSeek locally. Free, private, and the fallback that works when other tools fail on stubborn bugs.
- Google Workspace teams: evaluate Gemini, test for a month before committing.
- Rule: two tools used deeply beat five tools used casually. Pick your stack and commit to it for 90 days.
Side-by-Side: Which Tool Actually Fits Your Workflow
After six months and three shipped products, here is how we honestly position each tool — not by spec sheet, but by what we actually reach for and when.
| tool | best for | fails at | safe for production | price | our verdict |
|---|---|---|---|---|---|
| Claude | Production logic, code review, complex TypeScript, architecture decisions | Speed, image generation, high-volume brainstorming | ✅ Yes — flags uncertainty, catches edge cases | $20/mo (Pro) | Primary tool. The one we trust most when it costs something to be wrong. |
| ChatGPT | First drafts, boilerplate, brainstorming, general tasks, Custom GPTs | Flagging its own errors — confident whether right or wrong | ⚠️ With review — do not ship without a Claude pass on critical code | $20/mo (Plus) | Fastest to a working first draft. Always needs a review layer. |
| v0.dev | UI-quality React components, design-first screens, polished frontend output | Full app scaffolding, complex state, business logic | ⚠️ Starting point, not finished product — but the starting point is good | Free tier / $20/mo paid | Better design output than Lovable. Reach for it when visual quality matters. |
| Lovable | Full-app scaffold for solo full-stack projects, fast functional structure | Visual polish, complex domain logic | ⚠️ Functional scaffold — needs review and iteration | Free tier / ~$25/mo paid | Krishna's pick for ScrapCity solo full-stack. Scaffolds full apps; v0 builds better components. |
| DeepSeek | Stubborn bugs when other tools fail, local deployment for sensitive client code | General writing, non-technical tasks, anything outside pure code | ✅ With review — strong on specific code reasoning tasks | Free (open-source, runs locally) | The last resort that actually works. Stayed in stack for local privacy use after Claude arrived. |
| Gemini | Google Workspace document processing, large context window tasks | Standalone reasoning, general coding, anything outside Google ecosystem | ⚠️ Not our first or second choice for code | $20/mo (Advanced) | One specific use case. Not worth it unless you live in Google Workspace. |
Frequently Asked Questions
Strategic Summary
Final Thoughts
Six months, three shipped products, two developers — that is the data set this article is built on. The honest summary: Claude for production review, ChatGPT for speed and first drafts, v0.dev when UI quality matters, Lovable when Krishna needs a full-app scaffold solo. DeepSeek stays in the stack for local-deployment scenarios and as the fallback that actually works when everything else fails on a stubborn bug. Two things that surprised us: how quickly v0.dev became the go-to for frontend component work once we stopped conflating it with Lovable, and how much of DeepSeek's value showed up specifically in the debugging sessions where the more famous tools confidently gave wrong answers. The thing that did not surprise us: Claude flagging the webhook concurrency issue that would have taken down the BanquetFirst server. Once you have had that experience, you stop questioning whether the extra $20/month is worth it. --- Last updated: May 2026. Pricing verified at time of update. Models tested: Claude Sonnet 4.6, ChatGPT GPT-4o, DeepSeek R2 (local), Gemini 2.5 Pro, v0.dev (current build), Lovable (current build). All project references are real EdgeNRoots production systems.
Next up
Continue your research
7 Open-Source Cursor Alternatives That Cost $0-$10/Month in 2026
n8n vs Make vs Zapier for Developers 2026
AI Code Review in Production: What It Catches, What It Misses
AI Meeting Summarizer Comparison 2026
Sources & Research
Anthropic Claude — Official Pricing
https://www.anthropic.com/pricing
OpenAI ChatGPT Plans
https://openai.com/chatgpt/pricing/
Lovable — AI Web App Builder
https://lovable.dev
DeepSeek — Open Source Models
https://github.com/deepseek-ai/DeepSeek-V3
formula-interpreter — npm package by Krishna Murti Dubey
https://www.npmjs.com/package/formula-interpreter






