--- title: Driving MockServer from Chaos Orchestrators shortTitle: Chaos Orchestrators description: Drive MockServer's service-scoped chaos from Chaos Toolkit, AWS FIS, Azure Chaos Studio and LitmusChaos using the control-plane endpoint, with a TTL dead-man's switch for safe auto-revert. layout: page pageOrder: 3 section: 'Chaos Testing' subsection: true sitemap: priority: 0.8 changefreq: 'monthly' lastmod: 2026-05-31T08:00:00+00:00 keywords: chaos orchestrator integration, chaos toolkit mockserver, aws fis http, azure chaos studio webhook, litmuschaos byoc, chaos engineering control plane, fault injection api, dead mans switch chaos, ttl auto revert schema_faq: - question: "How do I drive MockServer chaos from an external chaos engineering tool?" answer: "Any orchestrator that can make an HTTP call (or run a shell step) can drive MockServer's service-scoped chaos through the control-plane endpoint PUT /mockserver/serviceChaos. Register a chaos profile for an upstream host at the start of the experiment, run your steady-state checks, then clear it at the end. Always include a ttlMillis so the fault auto-reverts even if the orchestrator never sends the clear." - question: "What happens if my chaos experiment crashes before it removes the fault?" answer: "Add a ttlMillis (time-to-live) to the registration. MockServer auto-reverts the chaos that many milliseconds after it is registered, so the injected fault self-heals even if the orchestrator crashes, times out, or its rollback step never runs. This is the dead-man's-switch pattern and is the recommended safety net for every orchestrator integration." - question: "Does AWS FIS or Azure Chaos Studio have a native MockServer fault?" answer: "No. Neither has a generic HTTP action, so you wrap a control-plane call in a step they can run — an SSM RunShellScript document for AWS FIS, or an Azure Automation runbook / pipeline step for Azure Chaos Studio. Because those tools cannot guarantee a rollback step runs, set a ttlMillis matched to the experiment duration so MockServer reverts the fault on its own." ---
MockServer is a controllable fault-injection point: its service-scoped chaos control plane lets you break an upstream host with a single HTTP call and revert it with another. That makes it straightforward to drive from any chaos-engineering orchestrator — the orchestrator owns the experiment timeline, MockServer applies the fault.
Every orchestrator integration is the same three moves against PUT /mockserver/serviceChaos:
ttlMillis.# 1. inject — break payments.svc for the planned 5-minute experiment window
curl -sf -X PUT http://mockserver:1080/mockserver/serviceChaos \
-H 'Content-Type: application/json' \
-d '{"host":"payments.svc","chaos":{"errorStatus":503,"errorProbability":0.3,"latency":{"timeUnit":"MILLISECONDS","value":500}},"ttlMillis":300000}'
# 2. verify — your orchestrator runs its steady-state probes against the app here
# 3. revert — clear the chaos (TTL already guarantees auto-revert as a backstop)
curl -sf -X PUT http://mockserver:1080/mockserver/serviceChaos \
-H 'Content-Type: application/json' \
-d '{"host":"payments.svc","remove":true}'
Always set ttlMillis when an orchestrator owns the experiment. Orchestrators crash, time out, get cancelled, and sometimes skip their rollback step. The TTL is a dead-man's switch: MockServer reverts the fault on its own after the window elapses, so a failed experiment can never leave a dependency permanently broken. Match the TTL to (or slightly above) the experiment's planned duration. See Service-scoped Chaos.
Chaos Toolkit experiments map cleanly onto the inject/verify/revert pattern: the method registers the chaos, the steady-state-hypothesis asserts the application copes, and rollbacks clears it. The built-in http provider sends the control-plane call directly — no custom extension needed.
{
"title": "Payments API tolerates a flaky upstream",
"description": "Inject 30% 503s + 500ms latency into payments.svc and assert the checkout still succeeds",
"steady-state-hypothesis": {
"title": "Checkout stays healthy",
"probes": [
{
"type": "probe",
"name": "checkout-returns-2xx",
"tolerance": 200,
"provider": { "type": "http", "url": "http://app:8080/checkout/health", "method": "GET" }
}
]
},
"method": [
{
"type": "action",
"name": "break-payments-upstream",
"provider": {
"type": "http",
"url": "http://mockserver:1080/mockserver/serviceChaos",
"method": "PUT",
"headers": { "Content-Type": "application/json" },
"arguments": {
"host": "payments.svc",
"chaos": { "errorStatus": 503, "errorProbability": 0.3, "latency": { "timeUnit": "MILLISECONDS", "value": 500 } },
"ttlMillis": 300000
}
},
"pauses": { "after": 60 }
}
],
"rollbacks": [
{
"type": "action",
"name": "heal-payments-upstream",
"provider": {
"type": "http",
"url": "http://mockserver:1080/mockserver/serviceChaos",
"method": "PUT",
"headers": { "Content-Type": "application/json" },
"arguments": { "host": "payments.svc", "remove": true }
}
}
]
}
The Content-Type: application/json header is what makes the built-in http provider send arguments as the JSON request body (rather than form data) on a PUT — so the chaos profile reaches MockServer as the JSON it expects. The ttlMillis of 300000 means that even if the run is interrupted before rollbacks executes, the chaos reverts after five minutes.
AWS FIS has no generic HTTP action, so wrap the control-plane call in an SSM RunShellScript document run against the host (EC2 instance or container task) that can reach MockServer. FIS stop conditions and timeouts halt the experiment but do not roll back an already-executed SSM command, so they will not undo a registered fault — the ttlMillis (matched to the action duration) is the backstop that guarantees recovery.
{
"description": "Break payments.svc via MockServer for 5 minutes",
"actions": {
"break-payments-upstream": {
"actionId": "aws:ssm:send-command",
"parameters": {
"documentArn": "arn:aws:ssm:us-east-1::document/AWS-RunShellScript",
"duration": "PT5M",
"documentParameters": "{\"commands\":[\"curl -sf -X PUT http://mockserver:1080/mockserver/serviceChaos -H 'Content-Type: application/json' -d '{\\\"host\\\":\\\"payments.svc\\\",\\\"chaos\\\":{\\\"errorStatus\\\":503,\\\"errorProbability\\\":0.3},\\\"ttlMillis\\\":300000}'\"]}"
},
"targets": { "Instances": "mockserver-host" }
}
},
"targets": {
"mockserver-host": {
"resourceType": "aws:ec2:instance",
"resourceTags": { "Name": "mockserver" },
"selectionMode": "ALL"
}
},
"stopConditions": [
{ "source": "aws:cloudwatch:alarm", "value": "arn:aws:cloudwatch:us-east-1:111122223333:alarm:checkout-error-rate" }
],
"roleArn": "arn:aws:iam::111122223333:role/fis-mockserver-role"
}
The PT5M action duration and the ttlMillis of 300000 are deliberately the same: when the CloudWatch stop condition fires (or the duration elapses), the TTL has MockServer revert the fault without needing a second SSM command.
Azure Chaos Studio likewise has no generic HTTP fault, so trigger the control-plane call from a step it can drive — an Azure Automation runbook, a Logic App, or a pipeline task wrapped around the Chaos Studio experiment. As with FIS, set a ttlMillis so the fault reverts even if the runbook or pipeline does not reach its cleanup step.
An Azure Automation PowerShell runbook step:
Invoke-RestMethod -Method Put `
-Uri "http://mockserver:1080/mockserver/serviceChaos" `
-ContentType "application/json" `
-Body '{"host":"payments.svc","chaos":{"errorStatus":503,"errorProbability":0.3},"ttlMillis":300000}'
Or an Azure Pipelines / GitHub Actions step around the experiment:
- script: |
curl -sf -X PUT http://mockserver:1080/mockserver/serviceChaos \
-H 'Content-Type: application/json' \
-d '{"host":"payments.svc","chaos":{"errorStatus":503,"errorProbability":0.3},"ttlMillis":300000}'
displayName: "Inject upstream chaos (auto-reverts in 5m)"
LitmusChaos probes can both drive and verify the experiment. A cmdProbe in SOT (Start Of Test) mode registers the chaos, an httpProbe verifies the application still serves traffic, and a second cmdProbe in EOT (End Of Test) mode clears it. The ttlMillis covers the case where the experiment pod is evicted before the EOT probe runs.
probe:
- name: inject-mockserver-chaos
type: cmdProbe
mode: SOT
cmdProbe/inputs:
command: "curl -sf -X PUT http://mockserver:1080/mockserver/serviceChaos -H 'Content-Type: application/json' -d '{\"host\":\"payments.svc\",\"chaos\":{\"errorStatus\":503,\"errorProbability\":0.3},\"ttlMillis\":300000}'"
source:
image: "curlimages/curl:latest"
comparator:
type: "int"
criteria: "=="
value: "0"
- name: checkout-stays-healthy
type: httpProbe
mode: Continuous
httpProbe/inputs:
url: "http://app:8080/checkout/health"
method:
get:
criteria: "=="
responseCode: "200"
runProperties:
probeTimeout: 5s
interval: 5s
- name: clear-mockserver-chaos
type: cmdProbe
mode: EOT
cmdProbe/inputs:
command: "curl -sf -X PUT http://mockserver:1080/mockserver/serviceChaos -H 'Content-Type: application/json' -d '{\"clear\":true}'"
source:
image: "curlimages/curl:latest"
comparator:
type: "int"
criteria: "=="
value: "0"
Because the integration point is a single HTTP call with a built-in safety timer, you do not need a dedicated chaos platform at all. A cron job, a CI pipeline, a k6 setup()/teardown(), an AWS Step Functions state, a Gremlin scenario hook, or a load-test harness can each drive a game-day exercise with one curl:
# a self-contained, self-healing game-day fault: one call, auto-reverts in 10 minutes
curl -sf -X PUT http://mockserver:1080/mockserver/serviceChaos \
-H 'Content-Type: application/json' \
-d '{"host":"payments.svc","chaos":{"errorStatus":503,"errorProbability":0.5,"latency":{"timeUnit":"MILLISECONDS","value":1000}},"ttlMillis":600000}'
While an experiment runs you can confirm what is active and watch the countdown:
# shows each registered host, its profile, and the remaining TTL countdown
curl -sf http://mockserver:1080/mockserver/serviceChaos
# -> { "services": { "payments.svc": { ... } }, "ttlRemainingMillis": { "payments.svc": 248123 } }
When metrics are enabled, the mock_server_active_service_chaos Prometheus gauge reports how many hosts currently have active service-scoped chaos. Because it drops to 0 as profiles are cleared or their TTLs lapse, mock_server_active_service_chaos > 0 is a natural alert for "a chaos experiment is still live" — a useful guard against an orchestrator that forgot to clean up. See Observability.
Any orchestrator that can make an HTTP call (or run a shell step) can drive MockServer's service-scoped chaos through PUT /mockserver/serviceChaos. Register a chaos profile for an upstream host at the start of the experiment, run your steady-state checks, then clear it at the end. Always include a ttlMillis so the fault auto-reverts even if the orchestrator never sends the clear. See The integration pattern above.
Add a ttlMillis (time-to-live) to the registration. MockServer auto-reverts the chaos that many milliseconds after it is registered, so the injected fault self-heals even if the orchestrator crashes, times out, or its rollback step never runs. This dead-man's-switch is the recommended safety net for every orchestrator integration.
No. Neither has a generic HTTP action, so you wrap a control-plane call in a step they can run — an SSM RunShellScript document for AWS FIS, or an Azure Automation runbook / pipeline step for Azure Chaos Studio. Because those tools cannot guarantee a rollback step runs, set a ttlMillis matched to the experiment duration so MockServer reverts the fault on its own.
The /mockserver/serviceChaos endpoint is protected by control-plane authentication when it is configured (mTLS or a JWT). Supply the same credentials your other control-plane calls use — for a JWT, add an Authorization: Bearer <token> header to the curl / HTTP request. See Configuration Properties.