How We Fixed a ReDoS Vulnerability That Turned 15-Second Requests into 5-Minute Timeouts at brand.dev

The Problem: When HTML Parsing Goes Catastrophically Wrong

It all started with our monitoring dashboard at brand.dev lighting up with timeout alerts – requests that normally took 15 seconds were suddenly timing out after 300 seconds. At first, we thought it was a scaling issue. Maybe we were getting hammered with traffic? Nope. CPU usage was pinned at 100% on single cores while processing certain domains. Memory usage was fine. Network I/O was minimal. What the hell was going on?

The symptoms were bizarre and inconsistent. Most domains processed fine – we'd scrape their HTML, extract brand colors, logos, and metadata in under 10 seconds. But occasionally, a request would just... hang. And hang. And hang. Until our 5-minute timeout mercifully killed it. The weird part? It wasn't about the size of the HTML. Some tiny sites would trigger it, while massive enterprise pages processed just fine.

We were confused and slightly pissed (again). Our service is the backbone of personalized onboarding for companies like Yuzu, Boonio, and SendX. When brand.dev is slow, their users' first impressions suffer. We needed answers, fast.

The Investigation: Following the CPU Flame

Night after night, we poured over logs, trying to find patterns. Which domains caused timeouts? Was it something about their HTML structure? We started attaching debuggers and profilers to our Node.js processes, and that's when we saw it: one single regex was consuming 99.8% of CPU time.

// The culprit (simplified for clarity)
const extractMetaTags = (html) => {
	const pattern = /<meta\s+(?:[^>]*?\s+)?property=["']([^"']+)["'][^>]*?content=["']([^"']*?)["'][^>]*?>/gi;
	return html.match(pattern);
};

This looked innocent enough, right? We're just trying to extract Open Graph meta tags. But here's where things got interesting – we added some logging and discovered that certain HTML structures were causing this regex to perform millions of backtracking operations.

The breakthrough came when we found a particularly nasty piece of HTML from a client's site:

<meta
	property="og:description"
	content="We provide enterprise solutions for... [imagine 2000 characters here] ... 
and that's why we're the best!!!!!!!!!!!!!!!!!!!!!!!!!"
/>

Notice those exclamation marks at the end? Combined with malformed HTML (missing closing quotes in some cases), our regex engine was going absolutely ballistic trying to figure out where the content attribute ended. It was trying every possible combination of character matches – a textbook case of catastrophic backtracking.

The "Aha!" Moment: Understanding ReDoS

We'd heard of SQL injection, XSS, CSRF... but ReDoS? Regular Expression Denial of Service? That was new to us. After diving deep into the computer science behind it (and trust me, we went deep), we finally understood what was happening.

Our regex pattern had nested quantifiers ([^>]*? inside a larger pattern with + and *). When the HTML didn't match our expected structure, the regex engine would backtrack exponentially. For every additional character in certain malformed HTML structures, the processing time would double. As Russ Cox demonstrated, some patterns that take Perl over 60 seconds to process can be handled in 20 microseconds with a proper implementation.

We did the math: if 20 characters took 1 second, 30 characters would take 1,024 seconds. No wonder we were hitting 300-second timeouts! The regex engine was exploring 2^n possible matching paths. It wasn't infinite – it would eventually finish – but by that point, our servers would have melted into slag.

Reading through GitHub issues, we discovered we weren't alone. Cloudflare had a 27-minute global outage from a similar issue. Stack Overflow went down for 34 minutes. Even AngularJS had ReDoS vulnerabilities (CVE-2024-21490). We were in "good" company, but that didn't make us feel better.

The Quick Fix: Timeouts and Circuit Breakers

Our first instinct was damage control. We couldn't leave the service vulnerable while we engineered a proper solution. So we implemented a multi-layered defense, following Microsoft's guidance on timeout mechanisms:

// Emergency timeout wrapper
const regexWithTimeout = (pattern, text, timeoutMs = 1000) => {
	const controller = new AbortController();
	const timeout = setTimeout(() => controller.abort(), timeoutMs);
 
	try {
		// Run regex in a worker thread with abortion capability
		const result = runRegexInWorker(pattern, text, controller.signal);
		clearTimeout(timeout);
		return result;
	} catch (err) {
		if (err.name === 'AbortError') {
			console.error('Regex timeout - possible ReDoS');
			return null;
		}
		throw err;
	}
};

We also added circuit breakers that would skip regex extraction entirely if a domain repeatedly caused timeouts. Better to return partial data than no data at all.

This stopped the bleeding, but it wasn't a solution. We were basically putting a bandaid on a severed artery.

The Better Solution: Google RE2 for Safe Regex

While researching ReDoS defenses, we discovered Google's RE2 – a regex engine that guarantees linear time complexity. Unlike traditional backtracking engines, RE2 uses a finite automaton approach that makes ReDoS attacks impossible. The tradeoff? No support for backreferences or lookahead/lookbehind assertions. But for 99% of our use cases, we didn't need those features anyway.

We tested RE2 with our problematic patterns:

// Using RE2 for guaranteed safety
const RE2 = require('re2');
 
const safeExtractMetaTags = (html) => {
	// RE2 guarantees this won't cause exponential backtracking
	const pattern = new RE2(/<meta\s+(?:[^>]*?\s+)?property=["']([^"']+)["'][^>]*?content=["']([^"']*?)["'][^>]*?>/gi);
	return html.match(pattern);
};
 
// Benchmark results:
// Traditional regex with pathological input: 300,000ms (timeout)
// RE2 with same input: 23ms
// That's a 13,000x improvement!

RE2 was a game-changer for patterns we absolutely needed to keep as regex. According to the documentation, it's the same engine that powers Google's search infrastructure, processing billions of queries without ReDoS risk.

The Real Solution: Abandoning Regex for HTML Parsing

But here's the thing we should have known from the start (and that every senior developer will tell you): you can't parse HTML with regex. HTML is a context-free grammar. Regex handles regular languages. They're fundamentally incompatible. It's like trying to understand French grammar using only English rules – you might get lucky sometimes, but you're guaranteed to fail spectacularly on edge cases.

We'd been trying to be clever, thinking our HTML extraction needs were "simple enough" for regex. We were wrong. Dead wrong. The famous Stack Overflow answer about not parsing HTML with regex? The one that descends into Lovecraftian horror about Tony the Pony? Yeah, we should have listened.

So we rewrote everything using Cheerio (a jQuery-like server-side DOM implementation for Node.js):

// Before (vulnerable to ReDoS)
const extractOpenGraphRegex = (html) => {
	const pattern = /<meta\s+(?:[^>]*?\s+)?property=["']og:([^"']+)["'][^>]*?content=["']([^"']*?)["'][^>]*?>/gi;
	const matches = [];
	let match;
	while ((match = pattern.exec(html)) !== null) {
		matches.push({ property: match[1], content: match[2] });
	}
	return matches;
};
 
// After (ReDoS-proof)
const extractOpenGraphCheerio = (html) => {
	const $ = cheerio.load(html);
	const ogTags = [];
 
	$('meta[property^="og:"]').each((i, elem) => {
		const property = $(elem).attr('property').replace('og:', '');
		const content = $(elem).attr('content') || '';
		ogTags.push({ property, content });
	});
 
	return ogTags;
};

The results were immediate and dramatic:

  • Malformed HTML that took 300+ seconds now processed in 50ms
  • CPU usage dropped from 100% to ~5% during parsing
  • We could handle 10x more concurrent requests
  • Memory usage actually decreased (Cheerio is surprisingly efficient)

Going Deeper: Stream Processing for Large Documents

But we weren't done. Some enterprises have MASSIVE HTML pages (we're talking 5MB+ of markup). Loading all that into Cheerio at once was wasteful. So we implemented streaming HTML parsing using htmlparser2:

const streamExtractBrandData = (htmlStream) => {
	return new Promise((resolve, reject) => {
		const brandData = {
			colors: new Set(),
			metadata: {},
			logos: [],
		};
 
		const parser = new htmlparser2.Parser(
			{
				onopentag(name, attributes) {
					if (name === 'meta' && attributes.property?.startsWith('og:')) {
						brandData.metadata[attributes.property] = attributes.content;
					}
					if (name === 'link' && attributes.rel === 'icon') {
						brandData.logos.push(attributes.href);
					}
				},
				ontext(text) {
					// Extract colors from CSS - using RE2 for safety!
					const colorPattern = new RE2(/#[0-9a-fA-F]{6}/g);
					const colorMatches = text.matchAll(colorPattern);
					for (const match of colorMatches) {
						brandData.colors.add(match[0]);
					}
				},
				onerror(error) {
					reject(error);
				},
				onend() {
					resolve(brandData);
				},
			},
			{ decodeEntities: true }
		);
 
		htmlStream.pipe(parser);
	});
};

This approach meant we never held more than a small buffer of HTML in memory at once. We could process gigantic pages without breaking a sweat. As noted in performance benchmarks, streaming parsers can handle documents of any size with constant memory usage.

Detection and Prevention Tools

During our ReDoS journey, we discovered several invaluable tools that every team should use:

  1. recheck - Online tool for checking regex complexity
  2. regexploit - Finds ReDoS vulnerabilities and generates attack strings
  3. redos-detector - ESLint plugin for automatic detection
  4. regex101 - Visualizes regex execution and backtracking

We now run these tools in our CI/CD pipeline. Any regex that scores as "vulnerable" gets flagged for review.

The Lessons Learned

Looking back, there were so many red flags we ignored, all documented in OWASP's ReDoS guide:

  1. We used unbounded quantifiers (*, +) in patterns processing untrusted input
  2. We had nested quantifiers that could create exponential backtracking
  3. We were trying to parse HTML with regex (seriously, don't do this)
  4. We had no timeouts on regex operations
  5. We hadn't load-tested with malformed HTML

The most frustrating part? The fix was simpler than the original regex code. Cheerio's API is cleaner, more maintainable, and actually designed for parsing HTML. We'd been suffering from a classic case of "when all you have is a hammer" syndrome.

Our New Regex Security Checklist

Based on our experience and security research, we created this checklist for all regex in our codebase:

// ✅ SAFE: Using RE2 for untrusted input
const re2Pattern = new RE2(/^[a-zA-Z0-9]+$/);
 
// ✅ SAFE: Bounded quantifiers
const boundedPattern = /^.{1,100}$/;
 
// ✅ SAFE: No nested quantifiers
const simplePattern = /^[a-z]+_[0-9]+$/;
 
// ❌ DANGEROUS: Nested quantifiers
const evilPattern = /^(a+)+$/;
 
// ❌ DANGEROUS: Alternation with overlap
const badPattern = /(a|a)*/;
 
// ❌ DANGEROUS: Unbounded repetition on complex groups
const terriblePattern = /^(.*)*$/;

Performance Improvements and Monitoring

With our new HTML parsing pipeline, the improvements were staggering:

  • P99 latency dropped from 45 seconds to <1 second for regex parsing
  • Timeout errors went from ~100/day to zero
  • Customer complaints about API timeouts: also zero

We also added comprehensive monitoring based on Imperva's recommendations:

// Track parsing performance
const parseWithMetrics = async (html, domain) => {
	const startTime = performance.now();
	const startMemory = process.memoryUsage();
 
	try {
		const result = await extractBrandData(html);
 
		const duration = performance.now() - startTime;
		const memoryDelta = process.memoryUsage().heapUsed - startMemory.heapUsed;
 
		metrics.record('parse.duration', duration, { domain });
		metrics.record('parse.memory', memoryDelta, { domain });
 
		if (duration > 5000) {
			logger.warn('Slow parse detected', { domain, duration });
		}
 
		return result;
	} catch (error) {
		metrics.increment('parse.errors', { domain, error: error.name });
		throw error;
	}
};

The Community Saves the Day (Again)

Just like with our Sharp memory issues, the solution came from the community. The OWASP documentation on ReDoS, the academics who've studied catastrophic backtracking, the Cloudflare engineers who shared their post-mortem – all of it helped us understand and fix our vulnerability.

Special shout-out to the maintainer of Cheerio who responded to our questions within hours and pointed us toward the streaming parser approach. And to the security researcher who responsibly disclosed a similar vulnerability in a competitor's API, which prompted us to audit our own code in the first place.

The Takeaways

If you're building a service that processes HTML, learn from our mistakes:

  1. Never parse HTML with regex. Just don't. Use a proper parser like Cheerio, jsdom, or htmlparser2.
  2. If you must use regex on untrusted input, use RE2. It guarantees linear time complexity and makes ReDoS impossible.
  3. If you can't use RE2, implement timeouts. Every time. No exceptions. Microsoft's guide shows how.
  4. Understand the complexity of your regex patterns. Tools like regex101 and regexploit can help visualize backtracking.
  5. Load test with malformed input. Your regex might work fine on valid HTML but explode on broken markup.
  6. Monitor parsing performance in production. Set up alerts for slow operations before they become outages.

The irony isn't lost on us. We built brand.dev to help companies create better first impressions, but our own service was creating terrible impressions when it timed out. Now, with proper HTML parsing and ReDoS protection in place, we can actually deliver on that promise.

Oh, and that regex pattern that was causing all our problems? We framed it and hung it on our office wall as a reminder. Sometimes the best code is the code you delete.

P.S. - If you're curious about our Sharp memory saga mentioned earlier, that's a whole other adventure in Node.js performance optimization. Let's just say that memory fragmentation in multi-threaded image processing is... special.

References and Further Reading