Reddit can feel chaotic — confessions, hot takes, overreactions, plot twists, existential spirals, relationship implosions, shower thoughts, tech grievances. A Reddit post like “TIFU by accidentally ruining my marriage” is practically a whole micro-genre of storytelling.
But what if it isn’t chaos at all? What if most Reddit titles follow a small set of archetypes — an underlying grammar that shapes which ones hit the front page, or blow up across the internet?
This weekend, I found myself asking the same question and went: ‘Hey. I have access to both the Reddit API and the Bright Data SERP API. I can quantify this!” and so I sourced ~1000 Reddit posts, enriched them, rebuilt missing metadata, ran a series of linguistic and embedding-based analyses, and ultimately mapped out eight semantic archetypes behind how viral Reddit titles work (okay fine — seven interesting ones and one boring outlier 😅).
What I found was…actually shockingly systematic and predictable.
Across platforms, Reddit posts split into four distinct virality ecosystems — each with its own emotional signature, rhythm, and storytelling DNA. And beneath those ecosystems, the eight semantic “archetypes” I talked about.
I’m going to present my findings straight up — then get into the methodology at the end. It would be a very boring read otherwise 😅 If you’d like to skip ahead or read at your own pace, here’s the Table of Contents.
Table of Contents
- The Types of Virality
- The Engagement Paradox — Massive Reach vs Deep Engagement
- The Goldilocks Zone — Curiosity vs Complexity
- The Eight Archetypes of Viral Reddit Content
- What Makes Each Archetype Distinct
- What the Archetypes Say
- What This Means (and What It Doesn’t)
- The Archetypes That Win
- Methodology — How the Dataset Was Built (and Why It Had to Be Done This Way)
The Types of Virality
Before we dive into the archetypes themselves, we need to understand that not all virality is the same. A post that dominates Reddit doesn’t necessarily travel beyond it — and a post that ranks on Google might have been quietly ignored by Reddit’s own users.
I identified four distinct virality patterns by measuring two independent dimensions: internal performance (Reddit upvotes normalized by z-score or subreddit size) and external ranking (Google SERP position). Here’s what I found:
Double-Viral: The Unicorns
27 posts | Average: 17,941 upvotes
These are the titles that won everywhere. High Reddit engagement + top 10 Google ranking. They didn’t just succeed within their subreddit — they became searchable internet artifacts.
The signature: Longest titles (12.2 words vs. ~10 for others), highest curiosity scores (0.85), most first-person narratives (41%). These posts are structurally complex but emotionally immediate. They create urgency and longevity.
What makes them different: They balance narrative momentum with search intent. The title works as both a Reddit hook and a Google query. Example: “TIFU by accidentally letting my crush know I…” — instantly shareable, eternally searchable.
Google-Viral: The Evergreen Archive
442 posts | Average: 520 upvotes
Strong external ranking but middling Reddit performance. These posts didn’t dominate their subreddit, but Google decided they were valuable enough to surface for years afterward.
The signature: Moderate emotional density (0.12), low sentiment polarity (near-neutral), high ambiguity (1.79). They’re often questions rather than confessions — informational, not emotional.
What makes them different: They optimize for reference value and not urgency. Posts like “What’s the most interesting fact you’ve ever learned?” generate evergreen comment threads that people link to, bookmark, and revisit. They have staying power but not explosive energy.
Reddit-Viral: The Platform Darlings
32 posts | Average: 10,291 upvotes
These posts crushed it on Reddit but didn’t travel. High internal virality, zero external ranking. They succeeded because of Reddit’s community dynamics, not in spite of them.
The signature: Highest sentiment positivity (0.113 compound score), lowest curiosity scores (0.53), most emotional density (0.138). These are feel-good posts, in-jokes, or community-specific content that doesn’t translate externally.
What makes them different: They’re native to the platform. They use Reddit’s language, reference Reddit’s culture, and reward Reddit literacy. They don’t need to be searchable because they’re not trying to leave.
Not-Viral: The Baseline
483 posts | Average: 686 upvotes
Low on both dimensions. These posts exist, but they didn’t catch fire internally or externally. They’re the control group — the posts that didn’t find their audience.
The signature: Similar metrics to Google-viral posts, but slightly lower across the board, just low enough to not land on page 1 of Google search. The difference between “not viral” and “Google-viral” is often just which question got asked or when it was posted.
What they reveal: Virality isn’t random, but it’s not deterministic either. These posts had reasonable structure, but they lacked the spark — whether that’s timing, phrasing, or sheer luck.
The Engagement Paradox
Here’s a twist. The virality bucket with the most upvotes — the unicorn — also has the lowest engagement ratios.
- Double-viral: 0.22 comments per upvote
- Google-viral: 2.28 comments per upvote
- Reddit-viral: 0.30 comments per upvote
- Not-viral: 2.27 comments per upvote
I think I can explain this. Because virality creates breadth, not depth, when a post goes viral, it gains the eyeballs of thousands of people who might not even be part of the community — they stumbled in from r/all or Google search. These people mostly upvote without commenting. When a post stays niche, the people who do engage are deeply invested — they’re there to argue, share stories, or ask follow-up questions.
The implication is that if you want reach, optimize for upvotes. If you want community, optimize for conversation. You rarely get both…
…but you do get it, occasionally! 27 occurrences out of 984 posts I analyzed reach this truly rare tier, the Goldilocks Zone. So named after the fairytale of Goldilocks and the Three Bears (Goldilocks rejects porridges that are too hot and too cold, until finding one that’s “just right”.)
The Goldilocks Zone: Where Curiosity Meets Complexity
To understand what separates double-viral posts from everything else, I plotted every post on two dimensions:
Curiosity Score (x-axis) — A composite measure of structural hooks: superlative questions, contrarian framing, vague outcomes, implied stories, and open invitations. Higher = more “you have to click this.”
Complexity Score (y-axis) — A normalized z-score combining three rhythmic and semantic signals:
- Average syllables per word (linguistic density)
- Rhythmic variance (cadence variation across the title)
- Ambiguity score (how much is left unsaid)
When you plot the four virality groups on this grid, double-viral posts cluster in the lower-right quadrant: a rare mix of high curiosity and high complexity that manages to be viral everywhere.
These posts are not merely intriguing — they’re also linguistically rich. And they’re not just complex — they’re also structurally compelling.
What this reveals:
- Curiosity alone isn’t enough. Google-viral posts score 0.69 on curiosity, but their complexity is below average (z-score ≈ 0). They ask interesting questions, but they don’t create narrative tension.
- Complexity alone isn’t enough. Reddit-viral posts have moderate complexity (z-score ≈ -0.1), but their curiosity scores are the lowest (0.53). They’re emotionally dense (112% of mean) and highly positive (367% of mean sentiment), but structurally predictable. They win on feeling, not intrigue.
- Double-viral posts sit at the intersection: curiosity score of 0.85 (highest) + complexity z-score of 0.35 (highest). They’re the only group that consistently scores above the median on both axes. They also run emotionally neutral-to-negative (-7.5% sentiment), creating tension rather than warmth.
Here’s the critical takeaway: Curiosity is necessary**,** but not**** sufficient for double-virality.
Curiosity gets the click. Complexity rewards the read. You need both to create a post that travels — one that’s shareworthy and searchworthy, urgent and evergreen, yet not pushing emotional buttons quite as much.
If that sounds difficult, it is; the Goldilocks Zone isn’t about being only “good”. It’s about being great on two dimensions simultaneously — a feat that only ~2.7% of posts achieve.
The Eight Archetypes of Viral Reddit Content
When I ran K-Means clustering (K=8) on 1,024-dimensional embeddings of nearly 1,000 Reddit titles, eight distinct archetypes — or clusters — emerged from the semantic space. After interpretation and analysis (check Methodology for the deets) here they are with their average upvotes + engagement ratios.
So these aren’t arbitrary categories I invented, they’re natural groupings that the algorithm discovered based on how titles are actually written.
Let’s look at these clusters/archetypes in more detail.
Cluster 0: The Existential Deep-Dive
112 posts | 0 double-viral
"Why do people do what they do?”
“Why do we want to be special?”
“Help me understand the ‘why’ behind having conversations/talking to people"
Introspective, philosophical, searching. These posts ask big questions about human nature, purpose, and meaning. They’re not looking for facts — they’re looking for interpretations.
The tone is contemplative, sometimes anxious. There’s an implicit vulnerability: “I don’t understand people — help me decode them.”
Why it doesn’t go viral: Too abstract. These posts perform well on Google (53 google-viral) because they match search intent for philosophical queries, but they don’t create the visceral emotional hooks that drive Reddit upvotes or social shares.
Emotional profile: Moderate curiosity (0.625), moderate emotional density (0.113), positive sentiment (0.141). These posts are searching, not confessing. They lack the narrative structure and moral tension that makes content shareable. They’re questions without stakes.
Signature n-grams: “why do people”, “what is”, “how do you”, “those of you who”, “help me understand”
Cluster 1: The Emotional Excavation
149 posts | 5 double-viral
“What’s the worst mistake you’ve ever made?”
“What mistake haunts you to this day?”
This cluster trades in heavy emotional recall — the worst, the most painful, the most embarrassing. These aren’t exactly lighthearted questions. They’re invitations to dredge up trauma, regret, or humiliation and display it publicly.
The posts use superlatives as emotional amplifiers: worst, biggest, most humiliating. They create a confessional gravity that pulls stories out of people who normally wouldn’t share.
Why it goes viral: Collective catharsis. When someone admits their darkest moment, it gives everyone else permission to do the same. The comment section becomes a support group.
Emotional profile: Second highest emotional density of all clusters (0.164), but also the most negative sentiment (-0.229). So they’re pain-driven. High curiosity score (1.034, second-highest) because the superlatives (“worst”, “biggest”) create structural hooks even without narrative complexity.
Signature n-grams: “worst mistake”, “most embarrassing”, “biggest regret”, “haunts you”, “never told”, “still regret”
Cluster 2: The Curiosity Engine
132 posts | 2 double-viral
“What’s the most interesting fact you’ve ever learned?”
“What’s the best plot twist you’ve ever seen — and what made it work so well?”
High-energy, low-stakes exploration. These posts ask for novelty, humor, surprise, or fascination — not trauma. They’re designed for storytelling, not therapy.
The tone is playful, the framing is open-ended, and the emotional valence is positive. This is Reddit at its most curious, not its most confessional.
Why it goes viral (externally): These posts generate evergreen content. A thread about surprising historical facts or mind-blowing coincidences gets linked, saved, and revisited. Google loves them because they’re informational and low-drama.
Emotional profile: High curiosity (0.856, third-highest), moderate emotional density (0.136), and the most positive sentiment of any cluster (0.300). These are feel-good posts built for discovery, not tension. They perform well externally (62 Google-viral) but rarely achieve double-virality because they lack moral stakes or narrative urgency.
Signature n-grams: “most interesting”, “most surprising”, “funniest moment”, “plot twist”, “mind-blowing”, “random fact”
Cluster 3: The Accidental Escalation
177 posts | 8 double-viral | Second-highest performer
“Did I overreact or am I right?”
“Things escalated way faster than I expected”
These posts start small and detonate. Someone misreads a situation, reacts emotionally, and suddenly a minor disagreement becomes a friendship-ending event. Often, the narrator is genuinely unsure if they’re the villain or the victim — which is why a lot of r/AITA posts end up here, too.
The core tension is moral ambiguity. The post isn’t just telling a story — it’s asking for a moral verdict. And that invitation to judge creates engagement by design.
Why it goes viral: People love adjudicating other people’s lives. The “Did I overreact?” framing turns readers into jurors, and every comment thread becomes a courtroom drama.
Emotional profile: High emotional density (0.181, highest of all clusters), with a negative sentiment (-0.018). These are charged posts — anger, guilt, confusion, defensiveness. The emotional intensity is what creates the urgency.
Signature n-grams: “Did I overreact?”, “Was I wrong?”, “accidentally”, “ended up”, “never expected”, “justified”
Cluster 4: The Saga Continues
43 posts | 1 double-viral
“FINAL UPDATE : I (26M) found out my fiance (27F) possibly…”
“Update: I think my best friend hates me and I don’t know why”
These are story updates — the second (or third, or final) chapter of an ongoing narrative. They’re not standalone posts. They require context from earlier threads (link almost always provided; often closely related to Cluster 5).
The tone is a mix of closure, unresolved tension, and “I can’t believe this is still happening.” They act like season finales for interpersonal drama.
Why they underperform: They’re sequels without the original. If you didn’t read Part 1, you’re lost. That limits their external reach, even if Reddit readers who followed the saga are deeply invested.
Emotional profile: Extremely low curiosity (0.256, lowest of all clusters), moderate emotional density (0.058), somewhat-positive sentiment (0.054). Updates don’t need to hook you — they’re for people already following the story. But that exclusivity prevents them from traveling beyond the original audience.
Signature n-grams: “final update”, “update”, “things changed”, “fiance”, “advice”, “how should I”
Cluster 5: The Confession Booth
257 posts | 9 double-viral | The highest-performing archetype
“TIFU by opening Reddit in front of my girlfriend and exposing a side of me I’m not proud of”
“AITA for ignoring my friend when she starts acting like a …”
Now this is Reddit’s beating heart, as far as viral content goes, anyway. First-person confessions about spectacular failures, moral gray areas, and the kind of chaos that makes you think “thank god that wasn’t me.”
The structure is always the same: I did something. It spiraled. Here’s what happened. The tone oscillates between self-deprecation/bias confirmation, and dark humor, and the narrative arc is built for screenshots. TIFU (Today I F***ed Up) and AITA (Am I The A**hole) posts dominate here — formats that turn personal disasters into communal entertainment.
Why it goes viral: Confessions create parasocial intimacy at scale. You’re getting “the tea”, or watching someone unravel in real-time, and either way it’s both…cringe, and cathartic. The emotional honesty is the hook; the narrative momentum is what makes it shareable.
The paradox: This cluster scores below average on curiosity (0.323 vs. mean ~0.68), yet produces more double-viral posts than any other archetype. Why? Because it doesn’t need curiosity-driven hooks — it has narrative structure. The confession format creates urgency through emotional stakes and moral tension, not through clever phrasing. The story sells itself.
Emotional profile: Low emotional density (0.058), near-neutral sentiment (-0.017). Needless to say, these aren’t feel-good posts. The tension and the ambiguity are what make them work.
Signature n-grams: “TIFU by”, “AITA for”, “I was wrong for”, “never told anyone”, “turns out”
Cluster 6: The Opinion Battlefield
81 posts | 2 double-viral
“What’s the most overrated movie everyone loves but you can’t stand?”
“Which DC character is the most misunderstood?”
These posts are the best debate triggers. They ask you to rank, judge, or defend cultural opinions — movies, shows, music, characters. Everyone can join in. The question itself is a provocation.
The magic word is overrated (or underrated, or misunderstood). These terms invite controversy by design. There’s no objectively correct answer, which means every response is an argument waiting to happen.
Why it goes viral: Conflict drives engagement. People will scroll past a neutral question, but they’ll stop to defend their favorite movie or roast someone else’s taste.
The curiosity paradox: This cluster has the highest curiosity score of all archetypes (1.556, more than double the mean), yet it produces only 2 double-viral posts. Why? Because curiosity without narrative structure only creates debate, not stories. These posts generate comment wars, not shareable moments. They’re argumentative, not cathartic.
Emotional profile: High emotional density (0.150, third-highest), but negative sentiment (-0.101). These are intellectualized disagreements, not emotional confessions. The detachment is part of why they don’t travel — there’s no vulnerability to connect with.
Signature n-grams: “most overrated”, “most underrated”, “most misunderstood”, “completely disagree”, “all time”
Cluster 7: The Tech Support Desk
33 posts | 0 double-viral
“PC won’t boot up after installing new GPU, how do I fix this?”
Straightforward troubleshooting. Hardware issues, OS bugs, app crashes. Highly structured, task-oriented, and emotionally flat=
Why it doesn’t go viral: Because it’s not trying to. These posts exist to solve problems, not to entertain. They have utilitarian value — they help someone having the same error/issue as the OP, that’s why they’re discoverable via Google — but zero narrative momentum.
Signature n-grams: “won t”, “update”, “issue”, “fix”, “help”, “doesn t work”, “bug”, “error”
I already gave you the numbers — but if you were interested in how virality distribution looked like for all 8 clusters, visualized, here you go.
The Linguistic Fingerprint: What Makes Each Archetype Distinct
After profiling each archetype individually, patterns emerge when you look at them side-by-side. This heatmap shows the most frequent emotion tokens + bigrams and trigrams across all eight clusters, normalized from 0 (least frequent) to 3 (most frequent):
What jumps out:
- Cluster 5 (Confessions) has a unique n-gram signature: “tifu by”, “aita for”, “am i” — first-person accountability language that doesn’t appear anywhere else
- Cluster 6 (Opinion Battlefield) owns the superlative-comparison space: “most overrated”, “most underrated”, “all time”
- Clusters 0 and 2 share question-based n-grams (“what is”, “how do”), but Cluster 0 skews philosophical while Cluster 2 skews curious
- Cluster 3 (Escalation) is the only one with “did i”, “was i”, “ended up” — the language of moral uncertainty
The heatmap confirms what the virality data already suggested: the highest-performing clusters use the most distinctive language. Confessions and Escalations don’t sound like anything else in the dataset — and that linguistic uniqueness is part of what makes them shareable.
What the Archetypes Reveal
The clusters that go double-viral — mostly Confessions (Cluster 5) and Accidental Escalations (Cluster 3) — share three structural traits:
- First-person narratives with emotional stakes
- Ambiguity or moral tension that invites judgment
- A narrative arc that rewards reading to the end
The clusters that stay Reddit-bound — Existential Questions (Cluster 0) and Tech Support (Cluster 7) — lack narrative drive. They’re either too abstract or too transactional.
And the clusters that go Google-viral but not Reddit-viral — Curiosity Engine (Cluster 2) and Emotional Excavation (Cluster 1) — generate reference content. They’re the posts people bookmark, link to, and revisit. They have longevity, but not urgency.
The pattern: Virality isn’t about being interesting. It’s about being narratively urgent and emotionally unresolved.
The posts that travel farthest are the ones that make you think: “I need to know what happens next” or “I need to tell someone about this.”
We have an honest-to-god repeatable structure here, folks. Can’t call it luck anymore.
What This Means (And What It Doesn’t)
So here’s what we learned:
- Virality is structurally predictable. The posts that go double-viral aren’t random flukes, but sit at the intersection of high curiosity, high complexity, and first-person narrative structure. They create tension without resolution, invite judgment without certainty, and balance emotional stakes with linguistic precision.
- Curiosity alone won’t save you. Cluster 6 (Opinion Battlefield) has the highest curiosity score of any archetype (1.556) but produces only 2 double-viral posts. Why? Because debate isn’t narrative. Controversy creates comments, not shares. You need story, not just intrigue.
- Emotion without structure is platform-bound. Reddit-viral posts (Cluster 5: Confessions, when they stay Reddit-only) score highest on emotional density and sentiment positivity, but if they lack narrative arc or moral ambiguity, they don’t travel. They’re quite native to Reddit’s culture — and don’t translate externally.
- The Goldilocks Zone is real, but narrow. Only 27 posts out of 984 achieved double-virality. That’s 2.7%. These posts are not merely “good”, they’re also structurally precise. They balance multiple dimensions simultaneously: curiosity + complexity + narrative + emotional stakes. Miss any one of those, and you’re only landing among the masses.
- Google wants evergreen, Reddit wants urgent. Google-viral posts (Cluster 2: Curiosity Engine) generate reference content — threads people bookmark and revisit. Reddit-viral posts generate right now energy — confessions, escalations, emotional catharsis. Double-viral posts do both: they create urgency and longevity.
But this isn’t a how-to guide for manufacturing viral posts. You can’t just plug these features into a title generator and expect gold. Timing, subreddit culture, community mood, and sheer randomness still matter.
But what this does show is that virality isn’t pure chaos. There are patterns. There are structures. And those structures are measurable, repeatable, and predictable enough that an algorithm can find them without being told what to look for.
The machine found eight archetypes. The data showed three virality ecosystems. And the Goldilocks Zone revealed that the posts which travel farthest aren’t the loudest, the funniest, or the most emotional — they’re the ones that balance narrative urgency with structural complexity.
And now you know the recipe!
The Archetypes That Win
If you’re trying to write a double-viral unicorn title that travels, from what I could tell, here’s what actually works:
- ✅ First-person confessional structure (TIFU, AITA formats)
- ✅ Moral ambiguity or unresolved tension (“Did I overreact?”)
- ✅ Emotional stakes without emotional overload (neutral-to-negative sentiment, not overly positive)
- ✅ Complexity that rewards the read (rhythmic variance, ambiguity, clause density)
- ✅ Curiosity hooks that imply narrative payoff (not just debate prompts)
The posts that fail are either:
- Too transactional (Cluster 7: Tech Support)
- Too abstract (Cluster 0: Existential Questions)
- Too argumentative (Cluster 6: Opinion Battlefield)
- Too dependent on prior context (Cluster 4: Story Updates)
That’s all the analyses I have. If you’d like to know more about my methodology, read on. Or you can go back to the Table of Contents by clicking here.
Methodology — How the Dataset Was Built (and Why It Had to Be Done This Way)
What I wanted to do was study the anatomy of what makes a Reddit post truly popular — both internally within Reddit, and externally via Google. A way to find posts that went truly ‘viral’ i.e. started on Reddit, but succeeded beyond it. So I flipped the entire pipeline on its head.
Step 1 — Start with Google, Not Reddit
I actually began by querying Google itself (using Bright Data’s SERP API, with gl (country) and hl (language) query params set to US and EN respectively) with a broad site:reddit.com query, and also ones that used several of Reddit’s most populous subs:
const SUBREDDITS = [
'AskReddit',
'tifu',
'AmItheAsshole',
'unpopularopinion',
'relationship_advice',
'LifeProTips',
'todayilearned',
'Showerthoughts',
'mildlyinteresting'
];
So, site:reddit.com/r/lifeprotips and so on.
If you’re following along, know that you’ll need to sign up here to get customerId, zone, password, host, and port values for the SERP proxy. I’ll include pricing details too, just in case.
// Core SERP fetching - use this with an array of queries like 'site:reddit.com', 'site:reddit.com/r/aita' etc.
async function fetchSearchResults(searchQuery, page = 1) {
// check cache first
if (CONFIG.cacheEnabled) {
const cache = loadCache();
const cacheKey = getCacheKey(searchQuery, page);
const cached = cache[cacheKey];
if (isCacheValid(cached)) {
console.log(` [CACHE] Using cached results for page ${page}`);
return cached.data;
}
}
// fetch from API
console.log(` [FETCH] Requesting page ${page} from API...`);
try {
const proxyUrl = `http://brd-customer-${CONFIG.customerId}-zone-${CONFIG.zone}:${CONFIG.password}@${CONFIG.proxyHost}:${CONFIG.proxyPort}`;
const agent = new HttpsProxyAgent(proxyUrl, { rejectUnauthorized: false });
const start = (page - 1) * CONFIG.resultsPerPage;
const searchUrl = `https://www.google.com/search?q=${encodeURIComponent(searchQuery)}&start=${start}&brd_json=1`;
const response = await fetch(searchUrl, {
method: 'GET',
agent,
headers: {
'User-Agent': 'Mozilla/5.0',
'Accept': 'application/json, text/html, */*'
}
});
const text = await response.text();
let data = JSON.parse(text);
// save to cache
if (CONFIG.cacheEnabled) {
const cache = loadCache();
const cacheKey = getCacheKey(searchQuery, page);
cache[cacheKey] = {
query: searchQuery,
page,
data,
timestamp: Date.now()
};
saveCache(cache);
console.log(` [FETCH] Page ${page} fetched and cached successfully`);
} else {
console.log(` [FETCH] Page ${page} fetched successfully`);
}
return data;
} catch (err) {
console.error(`Search request failed:`, err.message);
throw err;
}
}
Instead of analyzing what Reddit surfaces, I analyzed what Google believes is valuable enough to index and show people. This produced a mix of high-performers, older long-tail posts, niche threads, and cross-subreddit outliers — exactly the kind of variety needed to study what becomes “evergreen” outside Reddit’s own algorithms.
Now, I had an array of these objects. Notice the ‘page’ and ‘rank’ fields — Google’s SERP metrics.
{
"title": "TIFU by not knowing that the guy I like works on the same ...",
"url": "https://www.reddit.com/r/tifu/comments/acyy54/tifu_by_not_knowing_that_the_guy_i_like_works_on/",
"snippet": "TIFU by not knowing that the guy I like works on the same floor as me. M. Happened yesterday and still feel like I wanna die. Btw I'm 19 and ...",
"subreddit": “tifu”,
"postId": "acyy54",
"urlSlug": "tifu_by_not_knowing_that_the_guy_i_like_works_on",
"page": 1,
"rank": 2,
"collectedAt": "2025-11-13T20:31:22.279Z"
}
A natural question that you might ask here is:
Why didn’t you just scrape the biggest subreddits? What about r/pics, r/aww, r/gaming, r/memes, r/space, and r/funny? Wouldn’t that give you the best sample of popular content?
Short answer: I excluded them because those subreddits don’t use language in a way that can teach us anything about the grammar of virality. If I fetched those posts, they’d be little more than metadata (“Found this at the park <image>”). You can’t study linguistic virality where the picture does the work. I needed places where the title is the content. Spaces where people write rich, expressive, patterned text. That required a corpus where people actually… write.😅
Also, meme subreddits have niche formats that don’t transfer. r/programminghumor uses camelCase titles like whenYourProductManagerDoesThis() — that’s performative syntax for an in-joke, and would only pollute the dataset. (Though I think a separate deep dive into extremely niche subreddit lingo is 100% a project I’d be interested in).
Step 2 — Reddit Enrichment & Virality Reconstruction
Each SERP result was then enriched via Reddit’s API to pull full metadata: upvotes, comments, timestamps, subreddit size, and engagement signals. All I needed was the Reddit Post ID.
{
"title": "TIFU by not knowing that the guy I like works on the same floor as me",
"upvotes": 7544,
"comments": 150,
"score": 7544,
"upvoteRatio": 1,
"awards": 0,
"createdUtc": 1546726429,
"ageDays": 2503.956791446758,
"subredditSize": 18831612,
"gilded": 0,
"archived": true,
"locked": false,
}
With that and the SERP data I already had, I computed two comparable measures of virality — internal virality (how well a post performed inside its own subreddit, normalized to account for subreddit size) and external virality (based on Google SERP rank).
Subreddit sizes vary massively, so raw upvotes alone don’t mean much — 1,000 upvotes in a small subreddit is huge, while in a mega-sub it can be normal.
My first attempt used straightforward z-score normalization per subreddit, but small subs (or subs with only a handful of posts in my dataset) produced zero standard deviation, which effectively collapsed the metric. I fixed this by switching to a hybrid system:
- Dataset has ≥10 posts from this subreddit? Use z-score normalization for IV
- Dataset has <10 posts from this subreddit? Use fallback:
IV = upvotes / √(subreddit_size)
// 1. Internal Virality (IV) via z-score or fallback
for (const [sub, items] of Object.entries(subMap)) {
const scores = items.map(p => p.upvotes || 0);
const useZ = scores.length >= MIN_POSTS_FOR_ZSCORE;
let mean = 0, std = 0;
if (useZ) {
mean = scores.reduce((a, b) => a + b, 0) / scores.length;
const variance =
scores.reduce((a, b) => a + (b - mean) ** 2, 0) / scores.length;
std = Math.sqrt(variance);
}
for (const p of items) {
if (!p.engagement) p.engagement = {};
const score = p.upvotes || 0;
// Hybrid IV
const iv_z = useZ ? (std > 0 ? (score - mean) / std : 0) : 0;
const iv_fallback = useZ ? null : computeFallbackInternalVirality(p);
const iv_final = useZ ? iv_z : iv_fallback;
p.engagement.internalVirality_z = iv_z;
p.engagement.internalVirality_fallback = iv_fallback;
p.engagement.internalVirality_final = iv_final;
}
}
External virality was simpler — an inverse SERP rank plus a “top 3” flag for the strongest signals:
// 2. External Virality (EV)
const EV_rank = rank > 0 ? 1 / rank : 0;
const EV_top3 = rank <= 3 ? 1 : 0;
// 3. Double-Viral Indicator (DVI)
const iv_threshold = 2; // strong subreddit outperformance
const DVI = (IV >= iv_threshold && EV_top3 === 1) ? 1 : 0;
With this hybrid system, each post gets stable, comparable virality scores. Using IV + SERP rank, I classified posts into four groups: Reddit-only viral, Google-only viral, double-viral, and not viral, with “double-viral” defined as outperforming its subreddit (IV ≥ ~1.8–2) and showing up on Google’s first results page.
Anyway, once done, this allowed each item in my dataset to have a complete behavioral signature.
{
"title": "TIFU by not knowing that the guy I like works on the same floor as me",
// everything from step 1, plus:
"engagement": {
"fullTitle": "TIFU by not knowing that the guy I like works on the same floor as me",
"upvotes": 7544,
"comments": 150,
"score": 7544,
"upvoteRatio": 1,
"awards": 0,
"createdUtc": 1546726429,
"ageDays": 2503.956791446758,
"subredditSize": 18831612,
"gilded": 0,
"archived": true,
"locked": false,
"internalVirality": 1.63,
"externalVirality_rank": 0.5,
"externalVirality_top3": 1,
"doubleViralIndicator": 0,
"engagementRatio": 0.02,
"viralType": "google_viral"
}
}
Now, we could proceed to compute and add structural/emotional/semantic metrics for each title.
Step 3 — Extract Emotional, Structural, and Semantic Features
Now I processed each title for linguistic cues using a custom pipeline built with vader-sentiment, compromise (for nlp), and a handmade syllable analyzer for rhythm.
The analysis uses my own curated lexicons — power verbs, temporal markers, social-context lists, emotional triggers, twist markers, intensifiers, superlatives, contrarian markers, vague outcomes, implied-story phrases, and more — plus a 10k most-common-English wordlist (I used this one from Google’s Trillion Word Corpus) for rarity scoring.
Tack that onto our existing data, and each post looked a little something like this:
{
"title": "TIFU by not knowing that the guy I like works on the same floor as me",
// everything from the previous step, plus:
"features": {
"length": {
"wordCount": 14,
"charCount": 74,
"avgWordLength": 4.357142857142857
},
"punctuation": {
"hasQuestion": false,
"hasExclamation": false,
"hasEllipsis": false,
"hasParenthetical": false,
"hasBrackets": false,
"hasColon": false,
"hasQuotes": false,
"hasDash": false
},
"structure": {
"nounCount": 5,
"verbCount": 3,
"adjCount": 0,
"nounRatio": 0.35714285714285715,
"verbRatio": 0.21428571428571427,
"adjRatio": 0,
"functionWordRatio": 0.35714285714285715
},
"emotion": {
"sentimentCompound": 0,
"sentimentPositive": 0,
"sentimentNegative": 0,
"sentimentNeutral": 1,
"emotionalWordCount": 1,
"emotionalDensity": 0.07142857142857142,
"emotionalWords": [
"tifu"
]
},
"rhythm": {
"syllableCount": 20,
"avgSyllablesPerWord": 1.4285714285714286,
"hasAlternation": true,
"alternationRatio": 0.6153846153846154,
"rhythmicVariance": 0.6226998490772391,
"rhythm": "2-1-3-1-1-2-1-1-1-1-1-2-1-2",
"clauseCount": 2,
"clauseSyllables": [
14,
6
],
"avgSyllablesPerClause": 10
},
"meta": {
"hasUpdateTag": false,
"hasOCTag": false,
"hasSeriousTag": false,
"hasFinalTag": false,
"hasLongTag": false,
"hasNSFWTag": false
},
"rarity": {
"rareWordCount": 4,
"rareWordRatio": 0.2857142857142857,
"hasNumbers": false,
"hasProperNouns": false
},
"narrative": {
"hasFirstPerson": true,
"hasSecondPerson": false,
"hasThirdPerson": false,
"perspective": "first",
"tense": "mixed",
"hasTwistMarker": false,
"twistWords": [],
"arcComplexity": "compound"
},
"ambiguity": {
"hasAmbiguousSubject": true,
"hasVaguePronouns": true,
"ambiguityScore": 2.5,
"questionType": "not-question",
"hasEllipsis": false,
"hasIncompleteInfo": true
},
"semantic": {
"containsPowerVerb": false,
"powerVerbs": [],
"containsTemporalMarker": true,
"temporalMarkers": [
"tifu"
],
"containsSocialContext": false,
"socialMarkers": []
},
"structural": {
"hasComparison": false,
"hasEscalation": false,
"hasMoralJudgment": false,
"hasUtility": false,
"hasDuration": false,
"structuralViralityScore": 0
},
"curiosity": {
"isSuperlativeQuestion": false,
"isContrarian": false,
"hasVagueOutcome": false,
"hasImpliedStory": true,
"isMeta": false,
"isOpenInvitation": false,
"curiosityScore": 1
}
}
}
Each post ends up with a compact, multi-dimensional feature fingerprint describing its tone, structure, semantics, and curiosity profile.
Now, we can finally get to the real work mentioned before — find every Reddit post archetype, and categorize into virality buckets.
Step 4 — Convert Titles into Embeddings
First: archetypes. To capture that semantic meaning beyond keywords, I embedded every title using the local bge-m3 model (using Transformers.js, and the Xenova/bge-m3 ONNX) via the HuggingFace feature-extraction pipeline.
Each title was passed through the model with CLS pooling and L2 normalization, producing a 1024-dimensional vector (default bge-m3 settings) that encodes its meaning, tone, and stylistic shape.
We tack that on to the JSON from the previous step.
{
"title": "TIFU by not knowing that the guy I like works on the same floor as me",
// everything from the previous step, plus:
"embedding": [
-0.04333054646849632,
-0.013095727190375328,
-0.02149086631834507,
0.028069401159882545,
// 1020 more
]
}
These embeddings blow up our data.json to something like 40 Megabytes, and form the backbone for downstream clustering and similarity analysis.
Step 5 — Cluster to Reveal the Hidden Archetypes
Once every title was converted into a 1024-dimensional bge-m3 embedding, I ran K-Means (K = 8) using the ml-kmeans library with k-means++ initialization.
This clustered posts purely by semantic distance in embedding space, producing eight groups of titles that are internally similar in tone, structure, and meaning.
As part of this step, I also generated supporting signals (emotional tokens + top n-grams) that help describe each cluster later.
The result is eight data-driven “archetypes” of Reddit titles — real groupings that emerge from the geometry of the embedding space — and tack on a “cluster” field for each item in our data.json. This value ranges from a possible 0 to a 7.
{
"title": "TIFU by not knowing that the guy I like works on the same floor as me",
// everything from the previous step, plus:
“cluster”: 5
}
At this point we know (or can easily calculate):
- That every post has a
cluster: <id>assignment - How many posts live in each cluster
- Which posts lie closest to each cluster centroid (representative examples)
What we don’t know yet is what each cluster represents.
Step 6 — Interpret Clusters & Analyze Across 4 Virality Outcomes
Finally, I analyzed how each group of posts behaves across four virality outcomes (double-viral, Google-only, Reddit-only, not viral). For each outcome, I computed statistical profiles across all extracted features and calculated effect-size differences to see which traits distinguish viral from non-viral posts.
I then compared clusters against these outcomes by examining their top emotional words, dominant bigrams/trigrams, and representative titles.
To convert the raw clusters into human-readable archetypes, each cluster’s evidence bundle — centroid examples, n-grams, emotional markers, and virality distribution — was interpreted by an LLM.
Limitations of This Study
- Sample Size + Subreddit Bias: ~1,000 posts across primarily text-heavy, storytelling-focused communities is good for analyzing the grammar/linguistics of virality, but is in no way representative of all Reddit. Generally, image-heavy subs, meme communities, political/news subreddits will always dominate.
- Algorithms Changed: My collected posts span multiple years (age range: 0 to 2,500+ days), during which both Reddit’s ranking algorithm and Google’s SERP algorithm probably changed in many ways.
- My “Double-Viral” Threshold Is Empirically Derived: I defined double-viral as
Internal Virality ≥ 1.8 + SERP rank ≤ 10. This threshold was arrived at through trial-and-error until results "looked right"—it produced a clean split between clearly-viral and clearly-not posts. This was exploratory analysis for a weekend project, not professional predictive modeling. - BGE-M3 embeddings: The
BAAI/bge-m3model**** is trained on general text corpora, not specifically social media viral content.
So this project is essentially a descriptive map — but maps are useful even when they don’t explain every road.
And that’s everything!
Hi again 👋 This was the second in a series of data-driven deep dives I’m doing this month — forensic teardowns of things that are interesting, or things that shouldn’t work but do. If you want to see what else I find buried in data, follow along.