Production monitoring

Production monitoring runs your scenarios on a schedule against your real production URL. Guard uses synthetic users — marked at every layer — to execute full user journeys (signup, checkout, webhooks) without polluting analytics or charging real money.

When a flow breaks in prod, Guard opens incidents, routes alerts to your on-call stack, and hands failures to Mender for triage and fix PRs.

Dashboard: app.molar.it/dashboard/guard → Monitoring · Product page: guard.molar.it

How production monitoring differs from PR gating

Dimension	PR gating	Production monitoring
Trigger	`pull_request` webhook	Cron (`scheduled_checks`)
Target	Preview URL or Clones	Real production URL
Users	Clone identities	Synthetic prod users
Goal	Block bad merges	Detect regressions live users would hit
Run type	`pr`	`schedule`

The same scenario file powers both — no second deployment of tests.

Synthetic user infrastructure

Synthetic monitoring mistakes can contaminate billing and analytics. Guard uses defense in depth so synthetic activity is always identifiable and never triggers real charges.

Database marking

Add an is_synthetic column to your users table (or equivalent):

ALTER TABLE users ADD COLUMN is_synthetic BOOLEAN NOT NULL DEFAULT FALSE;
CREATE INDEX users_is_synthetic_idx ON users WHERE is_synthetic = TRUE;

-- Analytics views must exclude synthetics
CREATE VIEW analytics_users AS SELECT * FROM users WHERE is_synthetic = FALSE;

Guard provisions synthetic users per region and scenario shard. Email convention:

guard-monitor+{region}+{scenarioSlug}+{shard}@{yourDomain}.com

RFC 5233 sub-addressing routes tagged mail to your inbox; in sidecar mode, the Email Clone captures messages without real delivery.

Request-time signals (three orthogonal markers)

Signal	Where	Used by
`X-Synthetic-Source: molar-guard`	All outbound HTTP from Guard worker	Your middleware and observability filters
`window.__MOLAR_SYNTHETIC__ = true`	Playwright `addInitScript`	Client-side analytics filters
`is_synthetic = true`	Customer database	Application code paths

Money-flow blocking

Guard ships middleware libraries (@molar/synthetic-stripe-middleware, etc.) for Express and Next.js:

Provider	Protection
Stripe	Swap to `sk_test_*` SDK when `user.is_synthetic` — live charges impossible
Twilio	Route to test credentials (`+15005550006`) or Twilio Clone
Email	Route to Email Clone SMTP sink — no real delivery
Clerk	Pre-provisioned users with `metadata.is_synthetic = true`
S3	Uploads to `s3://bucket/_molar_synthetic/` with 7-day lifecycle

Django and Rails Stripe middleware snippets ship in @molar/synthetic-middleware; first-class Express and Next.js helpers are the most complete today.

Nightly cleanup

Guard provides cleanup SQL templates per stack (Postgres, MySQL, MongoDB, DynamoDB), generated by Cartographer from your schema:

DELETE FROM orders
  WHERE user_id IN (SELECT id FROM users WHERE is_synthetic = TRUE)
    AND created_at < NOW() - INTERVAL '7 days';

Retention is customer-configurable.

Analytics onboarding (recommended)

Before enabling production schedules, complete the synthetic safety checklist in dashboard Settings. Guard blocks new schedules until a synthetic preflight audit event is recorded for the scenario.

Per-platform exclusion

Filter synthetic users in whatever analytics stack you use. Common patterns:

Signal	Typical filter
User trait	`is_synthetic: true` on identify / user properties
Event property	`$ignore: true` or skip `track` when synthetic
Internal users	Mark synthetic emails or user IDs as internal/test
Billing	Test-mode customers only for synthetic users
APM / errors	Tag or exclude sessions with `synthetic: true`

The onboarding wizard links to copy-paste snippets for your stack. Middleware install is verified before schedules go live.

Scheduled checks

Production monitors are stored as scheduled_checks rows — one per scenario (or scenario + policy combination).

Default cadence

Every 5 minutes per scenario, configurable from 1 minute to 1 hour per check.

Create a schedule

CLI:

pnpm molar-guard schedule create stripe-subscription-upgrade \
  --cron "*/5 * * * *"

Scenario frontmatter:

---
id: stripe-subscription-upgrade
schedule:
  cron: "*/5 * * * *"
  regions: [us-east-1, eu-west-1, ap-south-1]
  shadow_prod: true
---

Dashboard: app.molar.it/dashboard/guard → Monitoring → Schedules (create/edit cron, regions, alert policy, pause).

Scheduler architecture

BullMQ repeatable jobs — same pattern as the Molar Molar API
Each tick fans out by regions[] on the check
Fresh Playwright browserContext per run — no state bleed
Per-customer concurrency cap (default: 8 concurrent prod runs) prevents self-DoS

Run isolation

Every scheduled run gets:

Fresh browser context
Fresh synthetic-user session (per shard)
Fresh logical clock
Optional fresh Clone bundle (shadow-prod path)

Multi-region monitoring

Guard workers run in:

Region	Availability
`us-east-1`	All tiers
`eu-west-1`	All tiers
`ap-south-1`	All tiers
`us-west-2`, `ap-southeast-1`, `sa-east-1`	Business tier

Set regions on each scheduled check. The scheduler fans out one run per region per tick.

Regional comparison in dashboard

The Monitoring grid shows scenarios × regions. Cells display current status + sparkline. Highlight patterns:

Single region red — likely regional outage or CDN edge issue
3+ regions fail within 60s — collapsed to one global_outage incident

Incidents

When alert thresholds fire, Guard opens a guard_incident — deduplicated per scenario (and per region unless global collapse applies).

Incident lifecycle

threshold breached → incident opened → alerts sent
        │
        ├── ack (human acknowledges)
        ├── suppress (duration + reason)
        └── auto-resolve (2 consecutive successes)

Status	Meaning
`open`	Active failure
`acknowledged`	Owner assigned, investigating
`suppressed`	Snoozed (maintenance, known issue)
`resolved`	Scenario recovered

Smart dedup

One open incident per scenario × region until resolved
Same scenario failing in 4 regions = 4 alerts (real regional issue)
Global collapse: 3+ regions fail within 60s on same scenario → single global_outage incident

Dashboard

app.molar.it/dashboard/guard → Incidents

Filter: open / acked / suppressed / resolved
Types: consecutive_failures, failure_rate, latency_p99, shadow_diff, global_outage
Actions: ack, suppress, link to root run artifacts, trigger Mender

Alerting

Configure alert policies per scheduled check (alert_policy JSONB):

Rule	Default	Description
Consecutive failures	2	N failures in a row
Failure rate	50% over 30 min	M% failure rate in window
Shadow-prod diff	on when `shadow_prod: true`	Third-party model drift
Self-healing	informational	Locator heal occurred

Integrations

Channel	Status
Generic webhook	Shipped
Slack	Shipped
PagerDuty	Shipped
Microsoft Teams	Shipped
Opsgenie	Shipped
Email (via notification webhook)	Shipped
MCP (`molar://incidents`)	Shipped (standalone Guard MCP)

Auto-resolve: when a scenario recovers (2 consecutive successes), incident closes and a "good news" notification is sent.

Production dashboard

Monitoring grid

Route: /monitoring

Matrix: rows = scenarios, columns = regions
Cell = current status + pass/fail sparkline
Drill-down: last N runs, error message, shadow-diff flag
Actions: pause region, snooze scenario, run check now, open incident

Check run detail

Click any scheduled run for:

Step timeline with assertion messages
Screenshot, video (5s pre-failure), HAR, console logs
Clone state diff on failure
Mender triage panel with Apply fix (suggestive mode)
Open trace → Cartographer trace when trace_id present

Status page

Guard exposes public health JSON at GET /v1/status. Hosted guard.molar.it/status pages.

Enable production monitoring (checklist)

Complete onboarding — GitHub connected, scenarios imported
Install synthetic middleware — Express/Next.js (required for money flows)
Run analytics preflight — confirm synthetic order excluded from exports
Create scheduled checks — pick scenarios, cron, regions
Configure alerts — Slack or PagerDuty webhook
Optional: enable shadow_prod: true on checks touching third-party APIs — see Shadow-prod diff
Verify — GET /v1/status healthy; first green runs in Monitoring grid

Example: full scenario with production schedule

id: stripe-subscription-upgrade
description: User on Free plan upgrades to Pro
tags: [billing, critical]
schedule:
  cron: "0/5 * * * *"
  regions: [us-east-1, eu-west-1, ap-south-1]
  shadow_prod: true
mender:
  mode: suggestive
cache: never

Stripe subscription upgrade

Steps

Navigate to /settings/billing
Click "Upgrade to Pro"
Assert badge shows "Pro"
Assert webhook customer.subscription.created received


---

## API reference (schedules and incidents)

| Method | Path | Purpose |
|--------|------|---------|
| `POST` | `/v1/scheduled_checks` | Create schedule |
| `PUT` | `/v1/scheduled_checks/:id` | Update cron, regions, alert policy |
| `POST` | `/v1/scheduled_checks/:id/pause` | Pause for N hours |
| `POST` | `/v1/runs/manual` | Trigger run now |
| `POST` | `/v1/incidents/:id/ack` | Acknowledge |
| `POST` | `/v1/incidents/:id/suppress` | Suppress with reason |

See [Webhooks & API](/docs/guard/webhooks-api) for auth and payloads.

---

## Usage limits

Production region count, concurrent prod runs, and minimum schedule interval are enforced per organization. See **Settings → Billing** on [app.molar.it](https://app.molar.it/billing) for your org's limits — tier tables are not published in this documentation.

---

## Next

- [Shadow-prod diff](/docs/guard/shadow-prod-diff) — parallel prod + Clone comparison
- [Mender auto-fix](/docs/guard/mender) — triage and fix PRs from prod failures
- [PR gating](/docs/guard/pr-gating) — pre-merge checks with the same scenarios
- [Configuration](/docs/guard/configuration) — `baseUrl.schedule`, frontmatter, alert JSON
- [Security](/docs/guard/security) — synthetic safety deep dive
- [Troubleshooting](/docs/guard/troubleshooting) — false alerts, middleware gaps

GitHub App & PR gate Shadow-prod diff