Production monitoring
Production monitoring runs your scenarios on a schedule against your real production URL. Guard uses synthetic users — marked at every layer — to execute full user journeys (signup, checkout, webhooks) without polluting analytics or charging real money.
When a flow breaks in prod, Guard opens incidents, routes alerts to your on-call stack, and hands failures to Mender for triage and fix PRs.
Dashboard: app.molar.it/dashboard/guard → Monitoring · Product page: guard.molar.it
How production monitoring differs from PR gating
| Dimension | PR gating | Production monitoring |
|---|---|---|
| Trigger | pull_request webhook | Cron (scheduled_checks) |
| Target | Preview URL or Clones | Real production URL |
| Users | Clone identities | Synthetic prod users |
| Goal | Block bad merges | Detect regressions live users would hit |
| Run type | pr | schedule |
The same scenario file powers both — no second deployment of tests.
Synthetic user infrastructure
Synthetic monitoring mistakes can contaminate billing and analytics. Guard uses defense in depth so synthetic activity is always identifiable and never triggers real charges.
Database marking
Add an is_synthetic column to your users table (or equivalent):
ALTER TABLE users ADD COLUMN is_synthetic BOOLEAN NOT NULL DEFAULT FALSE;
CREATE INDEX users_is_synthetic_idx ON users WHERE is_synthetic = TRUE;
-- Analytics views must exclude synthetics
CREATE VIEW analytics_users AS SELECT * FROM users WHERE is_synthetic = FALSE;
Guard provisions synthetic users per region and scenario shard. Email convention:
guard-monitor+{region}+{scenarioSlug}+{shard}@{yourDomain}.com
RFC 5233 sub-addressing routes tagged mail to your inbox; in sidecar mode, the Email Clone captures messages without real delivery.
Request-time signals (three orthogonal markers)
| Signal | Where | Used by |
|---|---|---|
X-Synthetic-Source: molar-guard | All outbound HTTP from Guard worker | Your middleware and observability filters |
window.__MOLAR_SYNTHETIC__ = true | Playwright addInitScript | Client-side analytics filters |
is_synthetic = true | Customer database | Application code paths |
Money-flow blocking
Guard ships middleware libraries (@molar/synthetic-stripe-middleware, etc.) for Express and Next.js:
| Provider | Protection |
|---|---|
| Stripe | Swap to sk_test_* SDK when user.is_synthetic — live charges impossible |
| Twilio | Route to test credentials (+15005550006) or Twilio Clone |
| Route to Email Clone SMTP sink — no real delivery | |
| Clerk | Pre-provisioned users with metadata.is_synthetic = true |
| S3 | Uploads to s3://bucket/_molar_synthetic/ with 7-day lifecycle |
Django and Rails Stripe middleware snippets ship in @molar/synthetic-middleware; first-class Express and Next.js helpers are the most complete today.
Nightly cleanup
Guard provides cleanup SQL templates per stack (Postgres, MySQL, MongoDB, DynamoDB), generated by Cartographer from your schema:
DELETE FROM orders
WHERE user_id IN (SELECT id FROM users WHERE is_synthetic = TRUE)
AND created_at < NOW() - INTERVAL '7 days';
Retention is customer-configurable.
Analytics onboarding (recommended)
Before enabling production schedules, complete the synthetic safety checklist in dashboard Settings. Guard blocks new schedules until a synthetic preflight audit event is recorded for the scenario.
Per-platform exclusion
Filter synthetic users in whatever analytics stack you use. Common patterns:
| Signal | Typical filter |
|---|---|
| User trait | is_synthetic: true on identify / user properties |
| Event property | $ignore: true or skip track when synthetic |
| Internal users | Mark synthetic emails or user IDs as internal/test |
| Billing | Test-mode customers only for synthetic users |
| APM / errors | Tag or exclude sessions with synthetic: true |
The onboarding wizard links to copy-paste snippets for your stack. Middleware install is verified before schedules go live.
Scheduled checks
Production monitors are stored as scheduled_checks rows — one per scenario (or scenario + policy combination).
Default cadence
Every 5 minutes per scenario, configurable from 1 minute to 1 hour per check.
Create a schedule
CLI:
pnpm molar-guard schedule create stripe-subscription-upgrade \
--cron "*/5 * * * *"
Scenario frontmatter:
---
id: stripe-subscription-upgrade
schedule:
cron: "*/5 * * * *"
regions: [us-east-1, eu-west-1, ap-south-1]
shadow_prod: true
---
Dashboard: app.molar.it/dashboard/guard → Monitoring → Schedules (create/edit cron, regions, alert policy, pause).
Scheduler architecture
- BullMQ repeatable jobs — same pattern as the Molar Molar API
- Each tick fans out by
regions[]on the check - Fresh Playwright
browserContextper run — no state bleed - Per-customer concurrency cap (default: 8 concurrent prod runs) prevents self-DoS
Run isolation
Every scheduled run gets:
- Fresh browser context
- Fresh synthetic-user session (per shard)
- Fresh logical clock
- Optional fresh Clone bundle (shadow-prod path)
Multi-region monitoring
Guard workers run in:
| Region | Availability |
|---|---|
us-east-1 | All tiers |
eu-west-1 | All tiers |
ap-south-1 | All tiers |
us-west-2, ap-southeast-1, sa-east-1 | Business tier |
Set regions on each scheduled check. The scheduler fans out one run per region per tick.
Regional comparison in dashboard
The Monitoring grid shows scenarios × regions. Cells display current status + sparkline. Highlight patterns:
- Single region red — likely regional outage or CDN edge issue
- 3+ regions fail within 60s — collapsed to one
global_outageincident
Incidents
When alert thresholds fire, Guard opens a guard_incident — deduplicated per scenario (and per region unless global collapse applies).
Incident lifecycle
threshold breached → incident opened → alerts sent
│
├── ack (human acknowledges)
├── suppress (duration + reason)
└── auto-resolve (2 consecutive successes)
| Status | Meaning |
|---|---|
open | Active failure |
acknowledged | Owner assigned, investigating |
suppressed | Snoozed (maintenance, known issue) |
resolved | Scenario recovered |
Smart dedup
- One open incident per scenario × region until resolved
- Same scenario failing in 4 regions = 4 alerts (real regional issue)
- Global collapse: 3+ regions fail within 60s on same scenario → single
global_outageincident
Dashboard
app.molar.it/dashboard/guard → Incidents
- Filter: open / acked / suppressed / resolved
- Types:
consecutive_failures,failure_rate,latency_p99,shadow_diff,global_outage - Actions: ack, suppress, link to root run artifacts, trigger Mender
Alerting
Configure alert policies per scheduled check (alert_policy JSONB):
| Rule | Default | Description |
|---|---|---|
| Consecutive failures | 2 | N failures in a row |
| Failure rate | 50% over 30 min | M% failure rate in window |
| Shadow-prod diff | on when shadow_prod: true | Third-party model drift |
| Self-healing | informational | Locator heal occurred |
Integrations
| Channel | Status |
|---|---|
| Generic webhook | Shipped |
| Slack | Shipped |
| PagerDuty | Shipped |
| Microsoft Teams | Shipped |
| Opsgenie | Shipped |
| Email (via notification webhook) | Shipped |
MCP (molar://incidents) | Shipped (standalone Guard MCP) |
Auto-resolve: when a scenario recovers (2 consecutive successes), incident closes and a "good news" notification is sent.
Production dashboard
Monitoring grid
Route: /monitoring
- Matrix: rows = scenarios, columns = regions
- Cell = current status + pass/fail sparkline
- Drill-down: last N runs, error message, shadow-diff flag
- Actions: pause region, snooze scenario, run check now, open incident
Check run detail
Click any scheduled run for:
- Step timeline with assertion messages
- Screenshot, video (5s pre-failure), HAR, console logs
- Clone state diff on failure
- Mender triage panel with Apply fix (suggestive mode)
- Open trace → Cartographer trace when
trace_idpresent
Status page
Guard exposes public health JSON at GET /v1/status. Hosted guard.molar.it/status pages.
Enable production monitoring (checklist)
- Complete onboarding — GitHub connected, scenarios imported
- Install synthetic middleware — Express/Next.js (required for money flows)
- Run analytics preflight — confirm synthetic order excluded from exports
- Create scheduled checks — pick scenarios, cron, regions
- Configure alerts — Slack or PagerDuty webhook
- Optional: enable
shadow_prod: trueon checks touching third-party APIs — see Shadow-prod diff - Verify —
GET /v1/statushealthy; first green runs in Monitoring grid
Example: full scenario with production schedule
id: stripe-subscription-upgrade
description: User on Free plan upgrades to Pro
tags: [billing, critical]
schedule:
cron: "0/5 * * * *"
regions: [us-east-1, eu-west-1, ap-south-1]
shadow_prod: true
mender:
mode: suggestive
cache: never
Stripe subscription upgrade
Steps
- Navigate to
/settings/billing - Click "Upgrade to Pro"
- Assert badge shows "Pro"
- Assert webhook
customer.subscription.createdreceived
---
## API reference (schedules and incidents)
| Method | Path | Purpose |
|--------|------|---------|
| `POST` | `/v1/scheduled_checks` | Create schedule |
| `PUT` | `/v1/scheduled_checks/:id` | Update cron, regions, alert policy |
| `POST` | `/v1/scheduled_checks/:id/pause` | Pause for N hours |
| `POST` | `/v1/runs/manual` | Trigger run now |
| `POST` | `/v1/incidents/:id/ack` | Acknowledge |
| `POST` | `/v1/incidents/:id/suppress` | Suppress with reason |
See [Webhooks & API](/docs/guard/webhooks-api) for auth and payloads.
---
## Usage limits
Production region count, concurrent prod runs, and minimum schedule interval are enforced per organization. See **Settings → Billing** on [app.molar.it](https://app.molar.it/billing) for your org's limits — tier tables are not published in this documentation.
---
## Next
- [Shadow-prod diff](/docs/guard/shadow-prod-diff) — parallel prod + Clone comparison
- [Mender auto-fix](/docs/guard/mender) — triage and fix PRs from prod failures
- [PR gating](/docs/guard/pr-gating) — pre-merge checks with the same scenarios
- [Configuration](/docs/guard/configuration) — `baseUrl.schedule`, frontmatter, alert JSON
- [Security](/docs/guard/security) — synthetic safety deep dive
- [Troubleshooting](/docs/guard/troubleshooting) — false alerts, middleware gaps