Cloudflare Worker Cron Watchdog: SPOF Fix and Automatic DNS Failover for the Vercel Proxy Architecture

Prerequisites

This article is the follow-up to Cloudflare for SaaS + Worker: Optimal IP Routing and Edge Cache Proxy for Vercel. The previous guide covered building a routing hub using double-CNAME to direct traffic through a third-party optimized node pool. This article addresses the single point of failure (SPOF) risk that third-party CNAME pool introduces, with an automated DNS failover solution.

Problem Analysis: SPOF Risk in the Third-Party CNAME Node Pool

In the previous setup, the bridge CNAME record on the auxiliary domain (cnd.proxy.dev) points to a third-party-maintained optimized node pool address (e.g., opt.public-cf-ip.xyz). This dependency is the only link in the entire chain not under your control. Potential failure scenarios include:

The node pool is hit by a DDoS attack or gets DNS-blocked
The domain expires or is placed on hold by the registrar
The maintainer shuts down the service

Any of these events causes cnd.proxy.dev to stop resolving, breaking the request chain for every business domain simultaneously.

Solution Design

Monitoring Infrastructure

Running monitoring logic on a separate VPS introduces its own stability concerns and operational overhead. This solution uses Cloudflare Worker Cron Triggers to run periodic health check tasks directly on the CF edge network, fully decoupled from the business infrastructure being monitored.

Probe Strategy: Avoiding Hairpin Routing

⚠️ Critical pitfall: Using fetch() inside a Worker to directly request the third-party optimized domain, or calling CF's own DoH endpoint (cloudflare-dns.com), can trigger CF's internal loop detection (hairpin routing interception). This may result in 522 errors or empty responses even when the upstream is healthy — causing false-positive alerts that treat a live upstream as down.

Solution: Use Google DoH (dns.google) as the third-party probe. Resolving the target CNAME's A record via DNS over HTTPS provides an objective, external perspective on upstream health, eliminating hairpin routing interference entirely.

Failover Logic

Trigger Condition	Current DNS State	Action
Google DoH resolution fails	CNAME (normal)	Degrade: switch to CF official fallback IP
Google DoH resolution succeeds	A record (fallback)	Recover: switch back to optimized CNAME
States are consistent	Any	Log heartbeat only, no API call

Fallback IPs are randomly selected from the CF official Anycast IP pool. Random selection distributes risk, reducing the chance of a secondary failure caused by a single IP being regionally blocked.

Deployment Steps

Step 1: Prepare CF API Credentials and Record ID

API Token: Create a custom Token in the CF dashboard. Grant only DNS: Edit permission for the target domain — apply the principle of least privilege.

Obtaining the Record ID: The CF dashboard does not display DNS record IDs directly. To retrieve it:

Open browser DevTools (F12) and switch to the Network panel.
In the CF dashboard, edit the cnd.proxy.dev CNAME record and save it.
In the Network panel, find the corresponding PUT request. The 32-character string at the end of the URL is the Record ID.

Step 2: Deploy the Watchdog Worker

Create a new Worker in the CF dashboard. Only the scheduled handler is needed (responds to Cron Triggers; no HTTP routing required):

const API_TOKEN = "YOUR_CF_API_TOKEN";
const ZONE_ID = "YOUR_ZONE_ID";
const RECORD_ID = "YOUR_RECORD_ID";
 
// The third-party optimized CNAME being monitored
const TARGET_CNAME = "opt.public-cf-ip.xyz";
 
// Healthy-state target config (CNAME mode)
const OPTIMAL_CONFIG = { type: "CNAME", content: TARGET_CNAME, proxied: false };
 
// CF official fallback IP pool (distributed selection reduces regional blocking risk)
const FALLBACK_IPS = [
  "1.0.0.1", "1.1.1.1", "162.159.153.4",
  "162.159.36.1", "104.18.2.161", "104.21.23.50"
];
 
export default {
  async scheduled(event, env, ctx) {
    try {
      // 1. External probe: resolve the target CNAME's A record via Google DoH
      const dohUrl = `https://dns.google/resolve?name=${TARGET_CNAME}&type=A`;
      let isHealthy = false;
 
      try {
        const dohReq = await fetch(dohUrl);
        const dnsData = await dohReq.json();
        // Status 0 (NOERROR) with at least one Answer means upstream resolves correctly
        if (dnsData.Status === 0 && dnsData.Answer?.length > 0) {
          isHealthy = true;
        }
      } catch (e) {
        console.error("DoH probe request failed:", e);
        isHealthy = false; // Treat probe failure as upstream down
      }
 
      // 2. Read current DNS record state to avoid unnecessary write operations
      const getReq = await fetch(
        `https://api.cloudflare.com/client/v4/zones/${ZONE_ID}/dns_records/${RECORD_ID}`,
        {
          headers: {
            "Authorization": `Bearer ${API_TOKEN}`,
            "Content-Type": "application/json"
          }
        }
      );
 
      if (!getReq.ok) return console.error("Failed to read current DNS record. Check Token and Record ID.");
      const currentRecord = (await getReq.json()).result;
 
      // Determine whether currently in degraded (fallback IP) state
      const isCurrentlyFallback = currentRecord.type === "A" && FALLBACK_IPS.includes(currentRecord.content);
 
      // 3. State machine: switch only when needed, avoid redundant API calls
      if (!isHealthy && !isCurrentlyFallback) {
        // Trigger degradation: randomly select a fallback IP
        const randomIp = FALLBACK_IPS[Math.floor(Math.random() * FALLBACK_IPS.length)];
        console.log(`[ALERT] Upstream optimized pool unreachable. Degrading to fallback IP: ${randomIp}`);
        await updateDnsRecord({ type: "A", content: randomIp, proxied: false });
 
      } else if (isHealthy && isCurrentlyFallback) {
        // Trigger recovery: switch back to optimized CNAME
        console.log("[RECOVERED] Upstream restored. Switching back to optimized CNAME...");
        await updateDnsRecord(OPTIMAL_CONFIG);
 
      } else {
        console.log(`[OK] Upstream healthy: ${isHealthy} | Current record: ${currentRecord.content}`);
      }
 
    } catch (err) {
      console.error("Watchdog fatal error:", err);
    }
  }
};
 
async function updateDnsRecord(config) {
  const req = await fetch(
    `https://api.cloudflare.com/client/v4/zones/${ZONE_ID}/dns_records/${RECORD_ID}`,
    {
      method: "PUT",
      headers: {
        "Authorization": `Bearer ${API_TOKEN}`,
        "Content-Type": "application/json"
      },
      body: JSON.stringify({
        name: "cnd",     // Must exactly match the DNS record name (i.e., "cnd" in cnd.proxy.dev)
        type: config.type,
        content: config.content,
        proxied: config.proxied,
        ttl: 1           // Auto TTL ensures changes propagate quickly after switching
      })
    }
  );
 
  if (!req.ok) console.error("DNS record update failed:", await req.text());
}

Step 3: Configure the Cron Trigger

After deploying the Worker, navigate to Triggers → Cron Triggers for that Worker and add a schedule:

Field	Value	Notes
Cron Expression	`/5 * * *`	Runs a health check every 5 minutes

Complete Architecture Overview

Combining the optimized proxy from the previous article with the watchdog from this one, the full high-availability request flow is:

User Request
  → cnd.proxy.dev (DNS-only CNAME, managed by Watchdog)
    ├─ [Healthy] → opt.public-cf-ip.xyz (third-party optimized node) → SaaS Fallback Origin → Worker → Vercel
    └─ [Degraded] → CF Official Anycast IP → SaaS Fallback Origin → Worker → Vercel

The Watchdog probes upstream health every 5 minutes via Google DoH and automatically switches between the optimized CNAME (low-latency priority) and official CF IP (availability fallback) — no manual intervention required.