Cloudflare Worker Cron Watchdog: SPOF Fix and Automatic DNS Failover for the Vercel Proxy Architecture
6 min read
0
This article is the follow-up to Cloudflare for SaaS + Worker: Optimal IP Routing and Edge Cache Proxy for Vercel. The previous guide covered building a routing hub using double-CNAME to direct traffic through a third-party optimized node pool. This article addresses the single point of failure (SPOF) risk that third-party CNAME pool introduces, with an automated DNS failover solution.
In the previous setup, the bridge CNAME record on the auxiliary domain (cnd.proxy.dev) points to a third-party-maintained optimized node pool address (e.g., opt.public-cf-ip.xyz). This dependency is the only link in the entire chain not under your control. Potential failure scenarios include:
Any of these events causes cnd.proxy.dev to stop resolving, breaking the request chain for every business domain simultaneously.
Running monitoring logic on a separate VPS introduces its own stability concerns and operational overhead. This solution uses Cloudflare Worker Cron Triggers to run periodic health check tasks directly on the CF edge network, fully decoupled from the business infrastructure being monitored.
⚠️ Critical pitfall: Using
fetch()inside a Worker to directly request the third-party optimized domain, or calling CF's own DoH endpoint (cloudflare-dns.com), can trigger CF's internal loop detection (hairpin routing interception). This may result in 522 errors or empty responses even when the upstream is healthy — causing false-positive alerts that treat a live upstream as down.
Solution: Use Google DoH (dns.google) as the third-party probe. Resolving the target CNAME's A record via DNS over HTTPS provides an objective, external perspective on upstream health, eliminating hairpin routing interference entirely.
| Trigger Condition | Current DNS State | Action |
|---|---|---|
| Google DoH resolution fails | CNAME (normal) | Degrade: switch to CF official fallback IP |
| Google DoH resolution succeeds | A record (fallback) | Recover: switch back to optimized CNAME |
| States are consistent | Any | Log heartbeat only, no API call |
Fallback IPs are randomly selected from the CF official Anycast IP pool. Random selection distributes risk, reducing the chance of a secondary failure caused by a single IP being regionally blocked.
API Token: Create a custom Token in the CF dashboard. Grant only DNS: Edit permission for the target domain — apply the principle of least privilege.
Obtaining the Record ID: The CF dashboard does not display DNS record IDs directly. To retrieve it:
cnd.proxy.dev CNAME record and save it.PUT request. The 32-character string at the end of the URL is the Record ID.Create a new Worker in the CF dashboard. Only the scheduled handler is needed (responds to Cron Triggers; no HTTP routing required):
const API_TOKEN = "YOUR_CF_API_TOKEN";
const ZONE_ID = "YOUR_ZONE_ID";
const RECORD_ID = "YOUR_RECORD_ID";
// The third-party optimized CNAME being monitored
const TARGET_CNAME = "opt.public-cf-ip.xyz";
// Healthy-state target config (CNAME mode)
const OPTIMAL_CONFIG = { type: "CNAME", content: TARGET_CNAME, proxied: false };
// CF official fallback IP pool (distributed selection reduces regional blocking risk)
const FALLBACK_IPS = [
"1.0.0.1", "1.1.1.1", "162.159.153.4",
"162.159.36.1", "104.18.2.161", "104.21.23.50"
];
export default {
async scheduled(event, env, ctx) {
try {
// 1. External probe: resolve the target CNAME's A record via Google DoH
const dohUrl = `https://dns.google/resolve?name=${TARGET_CNAME}&type=A`;
let isHealthy = false;
try {
const dohReq = await fetch(dohUrl);
const dnsData = await dohReq.json();
// Status 0 (NOERROR) with at least one Answer means upstream resolves correctly
if (dnsData.Status === 0 && dnsData.Answer?.length > 0) {
isHealthy = true;
}
} catch (e) {
console.error("DoH probe request failed:", e);
isHealthy = false; // Treat probe failure as upstream down
}
// 2. Read current DNS record state to avoid unnecessary write operations
const getReq = await fetch(
`https://api.cloudflare.com/client/v4/zones/${ZONE_ID}/dns_records/${RECORD_ID}`,
{
headers: {
"Authorization": `Bearer ${API_TOKEN}`,
"Content-Type": "application/json"
}
}
);
if (!getReq.ok) return console.error("Failed to read current DNS record. Check Token and Record ID.");
const currentRecord = (await getReq.json()).result;
// Determine whether currently in degraded (fallback IP) state
const isCurrentlyFallback = currentRecord.type === "A" && FALLBACK_IPS.includes(currentRecord.content);
// 3. State machine: switch only when needed, avoid redundant API calls
if (!isHealthy && !isCurrentlyFallback) {
// Trigger degradation: randomly select a fallback IP
const randomIp = FALLBACK_IPS[Math.floor(Math.random() * FALLBACK_IPS.length)];
console.log(`[ALERT] Upstream optimized pool unreachable. Degrading to fallback IP: ${randomIp}`);
await updateDnsRecord({ type: "A", content: randomIp, proxied: false });
} else if (isHealthy && isCurrentlyFallback) {
// Trigger recovery: switch back to optimized CNAME
console.log("[RECOVERED] Upstream restored. Switching back to optimized CNAME...");
await updateDnsRecord(OPTIMAL_CONFIG);
} else {
console.log(`[OK] Upstream healthy: ${isHealthy} | Current record: ${currentRecord.content}`);
}
} catch (err) {
console.error("Watchdog fatal error:", err);
}
}
};
async function updateDnsRecord(config) {
const req = await fetch(
`https://api.cloudflare.com/client/v4/zones/${ZONE_ID}/dns_records/${RECORD_ID}`,
{
method: "PUT",
headers: {
"Authorization": `Bearer ${API_TOKEN}`,
"Content-Type": "application/json"
},
body: JSON.stringify({
name: "cnd", // Must exactly match the DNS record name (i.e., "cnd" in cnd.proxy.dev)
type: config.type,
content: config.content,
proxied: config.proxied,
ttl: 1 // Auto TTL ensures changes propagate quickly after switching
})
}
);
if (!req.ok) console.error("DNS record update failed:", await req.text());
}After deploying the Worker, navigate to Triggers → Cron Triggers for that Worker and add a schedule:
| Field | Value | Notes |
|---|---|---|
| Cron Expression | */5 * * * * |
Runs a health check every 5 minutes |
Combining the optimized proxy from the previous article with the watchdog from this one, the full high-availability request flow is:
User Request
→ cnd.proxy.dev (DNS-only CNAME, managed by Watchdog)
├─ [Healthy] → opt.public-cf-ip.xyz (third-party optimized node) → SaaS Fallback Origin → Worker → Vercel
└─ [Degraded] → CF Official Anycast IP → SaaS Fallback Origin → Worker → Vercel
The Watchdog probes upstream health every 5 minutes via Google DoH and automatically switches between the optimized CNAME (low-latency priority) and official CF IP (availability fallback) — no manual intervention required.