If you can believe it, this post was sparked not by some startup blog or tech conference talk, but by a culinary book 😅. Make the Bread, Buy the Butter by Jennifer Reese. In it, she evaluates “What can I actually make myself, and when is it smarter to just buy the thing?”, based on taste, cost, and hassle.
It’s an opinionated book, sure, but she concludes that you should absolutely make some things from scratch (bread, marshmallows, hot dog buns, guacamole, yogurt) because they’re worth it — but just buy others like butter and ketchup. Basically, if it’s too much work for too little reward, save yourself the headache.
Building a SaaS has the same trade-offs. When you’re bootstrapping, you’re making architectural decisions at 2 AM with a half-empty coffee cup, weighing “what can I stand up in a weekend” against “what’s going to scale when we hit 10,000 users.”
Some of those decisions will age beautifully; others will come back for revenge. You’ll never bat 1.000, but you learn which “recipes” are worth your time and which are better left to someone else’s expertise.
To that end, I wanted to talk about the five big build vs. buy decisions we made while building our multi-tenant SaaS over the past 2.5 years — what we chose, how it played out at scale (27,000+ monthly active users generating ~120K page views a month), and what we’d do differently. Hopefully, it’ll help you skip a few bad “recipes” of your own.
1. Auth & User Management
What we did: NextAuth (now Auth.js) using database sessions stored in Postgres instead of JWTs. Started with simple admin/user roles, evolved into full RBAC with workspace-level permissions and feature flags.
War Stories: OAuth with Google broke for an entire weekend because ITP (Safari’s Intelligent Tracking Prevention) was silently killing our callback flow. 🙃 No error messages, no obvious failure mode — users would click “Sign in with Google,” go through the OAuth dance, and land back on our login page like nothing happened. Easy enough fix — it’s just sussing out that it wasn’t our code that was the real pain. Honestly, this was my bad. Just not enough testing with Safari.
The hidden maintenance cost: Session cleanup was the real gotcha. You can read about it on this GitHub issue, actually. Back then, at least, using NextAuth meant your expired sessions pile up in Postgres like digital tumbleweeds — our sessions table just kept growing until queries started slowing down. Had to build a background cleanup job to purge old sessions.
What we’d do now: Honestly? I’d still pick NextAuth/Auth.js today. Only one of those issues I mention above were on NextAuth, and even then, it wasn’t really a deal breaker, per se.
Plug-n-play solutions like Clerk sound amazing, they make things so much simpler vs. setting up NextAuth, but guess what happened when Clerk had that major outage that took down auth for hundreds of SaaS apps for nearly an hour in June? That’s right — our platform kept working. Zero issues. Having that trust over our auth layer meant we could just work on improving our UI/UX, or adding features that bring revenue, rather than waiting on a vendor to fix critical auth issues.
The lesson: Auth is security-critical infrastructure. You want a battle-tested library handling OAuth quirks and CSRF protection, but you also want control when things break at 2 AM on Sunday.
2. Email Infrastructure
What we did: Tried pretty much all the cheap email providers first. SendGrid, Mailgun, you name it — the templating systems were clunky and we kept hitting spam filters. Eventually settled on Notify.CX: better deliverability, drag-and-drop template builder that didn’t make me question my career choices, and a dead simple API.
They handled the growing pains of domain warming complexity (gradual volume ramp-up, reputation monitoring, throttling) for us — we hit 90% deliverability practically overnight, 95%+ within weeks.
What we’d do now: Same choice — Notify saved us ~$8,000/year vs competitors and gave our non-technical team members the ability to optimize emails without code.
Before, every email change needed developer PRs and deployments, creating a graveyard of untested ideas. Now, we could actually run experiments and get creative with our engagement.
The lesson: If you want the full breakdown, I wrote about this extensively here. TLDR: Email infra is a full-time job. Let experts handle it while you focus on features users can see.
3. Competitive Intelligence
What we tried: Puppeteer scripts on a $20/month DigitalOcean droplet. Rotating proxy service I tested early on (quality questionable 😅). Daily scrapes of competitor pricing pages, feature announcements, blog posts — all dumped into Postgres tables.
The maintenance nightmare: Worked just fine for the first few months — then we got caught in the weeds, hard, spending more time maintaining scrapers than analyzing data. Sites change CSS selectors constantly — our scrapers would work for 2–3 weeks, then silently start returning empty arrays. Nothing quite like that sinking feeling when you realize your data is weeks stale.
The breaking point: IP bans were frequent and unpredictable (no surprises there). Got blocked by a major competitor and had to manually rotate through different proxy providers.
The real breaking point was Cloudflare’s bot detection. Now, full disclosure: as an engineer, I absolutely love this CF feature, but for our use case, I loathed it. That one opt-in meant our scraper success rate dropped from 90% to 10% overnight for practically every competitor we were monitoring. Spent a full week trying to work around it before realizing we were in an arms race with security teams way above our pay grade.
Hey, you gotta know when to hold ’em, when to fold ’em, right?
What’s more, we actually gave a damn about scout’s honor when going about this, but that meant that every time we wanted data, we’d spend hours researching robots.txt compliance, GDPR implications (EU specific), data retention requirements, etc. This legal uncertainty was worse than the technical challenges.
What we did differently: Switched to Bright Data’s fully managed data acquisition service and haven’t looked back since. They handle the entire data pipeline — proxy management, CAPTCHA solving, compliance, data validation, and enrichment — for us, with constant communication throughout.
That meant we didn’t have to bootstrap:
- Actual scraping infra at scale: 99.99% uptime, 150M+ real user IPs across 195 countries. None of it our headache.
- Automated validation & enrichment: Cross-referencing, deduplication, quality monitoring
- Compliance handling: GDPR, CCPA, robots.txt — all managed automatically
- Smart reports & dashboards: Real-time tracking with actionable insights, i.e. more than just raw data dumps (saves us a step in analytics later).
What this unlocked: Just like emails, we had a graveyard of “competitive analysis ideas we never executed” because maintaining scrapers consumed so much engineering bandwidth. Now we could actually flex our strength — finding patterns in data:
- Feature rollout timing: Monitor competitor changelog pages, product hunt launches, and blog announcements to see which tools, features, or monetization options are gaining traction vs. getting quietly deprecated (way more reliable than X/Twitter speculation)
- Content strategy intel: Track which types of content perform well across different blogging platforms — are long-form essays trending up on Substack while Twitter-style micro-posts dominate elsewhere? Helps inform our own creator tools.
- Creator retention patterns: Scrape public creator profiles, follower counts, and posting frequency to identify churn signals before they hit our own platform (another boring-as-all-hell task, but hey, if it saves you from losing your top creators to competitors…)
- Market timing stuff: Minor thing, really, but this also helped us track creator economy funding announcements, newsletter acquisitions, and new platform launches to predict where the blogging/creator space is headed 3–6 months out.
The data quality we got back was clean enough to feed directly into our analytics pipeline and trigger automated Slack alerts when competitors made significant moves. No more manual data cleaning or wondering if a feature launch was real or a scraper bug — and trust me, this uncertainty will drive you insane if you let it.
What we’d do now: Skip the DIY scraping entirely. Start with fully managed data acquisition from day one.
The lesson: If your competitive intelligence pipeline requires the same operational overhead as your production application, something is wrong. Data collection looks deceptively simple until you hit scale, compliance requirements, and anti-bot measures. Trust me when I say you do not have time for this. Let the experts handle this arms race.
4. File Storage & CDN
What we tried: Started with raw S3 uploads, then built a custom Node.js service for image resizing that stored multiple variants on upload. Tried S3 + CloudFront but cache invalidation was a constant headache — you know the drill with CDN edge cases.
Do you want to know how bad this baptism by fire was? 😅 Take a look at our progression of hacks:
- Direct S3 uploads: Worked until we needed responsive images and real-time transformations
- Custom resize service: Memory leaks killed it during traffic spikes (I was very high on sharp.js — I still think it’s great — but alas, it is NOT as memory-efficient as you’d hope)
- S3 + CloudFront: Cache invalidation never worked reliably — stale images for hours (the classic CDN nightmare)
- On-demand resize proxy: Built our own transformation API, worked until we hit disk limits and realized we’d just reinvented Cloudinary : Worse Edition™
War stories: Just about any traffic spike whenever something went viral. Our DIY custom resize service just… stopped processing. Worse, this was not a hard crash — it kept responding to health checks — but returning 500s for actual transformations. Every image on the platform was broken, users couldn’t upload anything new. (Nothing quite like watching your monitoring show “everything’s peachy” while support tickets flood in.)
What we did differently: Switched to ImageKit. Global edge network, real-time transformations via URL params, auto WebP/AVIF optimization based on browser support.
What this unlocked: We could instead build —
- Intelligent cropping for avatars and thumbnails: I wrote a tiny cropping utility in Next.js that called ImageKit’s focal point detection, but with a fallback for cases where detection fails — we used Tailwind utilities for consistent aspect ratios.
- Device/network-aware quality switching: Hooked into
navigator.connectionin the browser to decide whatq(quality) param to send to ImageKit — lower quality on 3G, higher on broadband. Optimizes Core Web Vitals without *too* much work. - Lazy-loading + intersection-based prefetch: Used Next.js’
next/imagewith Intersection Observer to prefetch the next 2–3 images when a user scrolls past 70% of the current one.
What we’d do now: Skip all the DIY heroics. Start with a proper media management service from day one. With the plumbing taken care of, I could work on things that were *so* much more fun and joyful.
The lesson: Global CDN performance and real-time image transformations have been solved problems for years. Do NOT spend weeks reinventing that wheel poorly.
5. Analytics & Monitoring
Getting reliable data about competitors was only half the battle. The other half was understanding what was happening inside our own platform. Which, finally, brings us to analytics.
What we tried: console.log (of course) → CloudWatch → Winston → attempted ELK (way too resource-intensive for our scale at the time) → DataDog + custom Postgres tables for business metrics.
Here was my daily 2–3 hour debugging ritual:
User: “My post disappeared.”Me: grep CloudWatch for their UUID, check Postgres directly, correlate frontend/backend logs, and finally, check if they were flagged for spam before asking them for a repro.Which they never could.
DataDog’s was great but its costs ramped up way faster than we expected, and Google Analytics 4 didn’t give us what we needed. Between sampled data, no easy way to tie events to our internal user IDs, and a steep learning curve for seemingly basic funnels (‘show me users who clicked signup but didn’t convert’), we found ourselves spending more time wrestling with the tool than acting on insights.
The breaking point: 12% signup email delivery bug took 5 days to isolate. No tracing between signup click → email job queued → actual SMTP send. Not an email provider issue, just three disconnected systems and a lot of guessing. This is why distributed tracing exists, folks.
What we’d do now: For budget-conscious setups, this stack is a no brainer —
- Better Stack for observability. Far easier than running Prometheus + Grafana yourself, though that combo still makes sense if you have dedicated SRE/DevOps. Datadog is great, but its cost curve is brutal for startups.
- PostHog for product analytics. Self-hostable, privacy-friendly, great for funnels and feature usage.
- Plausible for marketing analytics. Lightweight, privacy-first, and avoids GA4’s complexity.
The lesson: I don’t intend this to come off as some life-changing zen advice, but…when you’re debugging in production, correlation beats causation every time.😅 Please, y’all. Get yourself a real analytics package.
The *Real* Bitter Lesson
Your takeaway shouldn’t be “never build anything in-house” — it’s recognizing where your competitive advantage actually lies. For us, that’s in our core product features, not in solving/babysitting infra problems that others have already solved better than us.
If you’re building your own multi-tenant SaaS, here’s what I’d buy vs. build:
- Build: Your core product logic, user-facing features, business-specific workflows
- Buy: Data acquisition, Auth libraries (but not systems), email infra, media processing, observability.
For buying, I think there’s a legitimate argument about how much of your stack you actually want to own. Some orgs will want 100% control over every moving part, and that’s fair. For us, given our budget and time constraints (which were miniscule) we just had to be ruthless about this and get creative with what we did have, so we could flex muscles we had actually developed.
That might mean standing on the shoulders of well-chosen libraries and services, mixing managed and self-hosted solutions, or building a “good enough” internal system with a clear plan to replace it when the product can justify it.
The only ‘trick’ that works in the SaaS space is knowing which decisions are reversible, and which ones will cost you a weekend (or three) when they break at scale. ChatGPT won’t help you here. It’ll only come from experience. That’s the real bitter lesson.