Fabric Capacity Focus: Surge Protection, Notifications, and Capacity Overage

SYSM-358 - Getting issue details... STATUS

This page focuses on three Fabric capacity options that can improve operational resilience:

Surge Protection
Notifications / Alerts
Capacity Overage

For our platform, these options should be treated as operational controls, not as substitutes for proper capacity isolation and sizing. Since throttling is applied at the capacity level, the first protection for the Data Platform Core workspace (prod Capacity only here) remains capacity separation from Domain workloads

Version	Date	Description	Contributor
V0.1	15 Apr 2026	Initial document	COLOMBANI Théo

3. Surge Protection

What it is

Surge Protection helps limit overuse of a capacity by controlling background compute consumption. At capacity level, when it becomes active, background jobs are rejected. Microsoft also recommends using the Capacity Metrics app to tune thresholds and explicitly states that critical solutions should be isolated on a dedicated capacity for full protection.

What matters for us

Surge Protection is useful when interactive or user-facing workloads share capacity with background operations such as refreshes, AI jobs, or other heavy compute activity. Microsoft’s planning guidance recommends it in exactly that situation.

Important limitations

Surge Protection does not guarantee that interactive requests will never be delayed or rejected. It does not stop jobs already in progress. Some Fabric UI actions are treated as background operations and can also be rejected. Certain OneLake activities remain unaffected.

Workspace-level control

Workspace-level Surge Protection adds a second layer: it can enforce per-workspace CU limits, automatically detect and block noisy workspaces, and mark a workspace as Mission Critical or Blocked. A Mission Critical workspace ignores workspace-level blocking rules, while a Blocked workspace rejects all requests during the block period.

What we should put in place

For Data Platform Core capacity

do not rely on Surge Protection as the main protection
protect this workspace first through dedicated capacity
optionally keep Surge Protection available as a secondary control, but only after observing real usage patterns in Metrics App.

For Domain production capacity

enable capacity-level Surge Protection
enable workspace-level Surge Protection
define a rule that any Domain workspace showing repeated abnormal consumption can be temporarily blocked
reserve Mission Critical only for a very small number of justified workspaces.

Suggested policy text

Surge Protection should be enabled primarily on shared Domain capacities to reduce the impact of bursty background workloads. It must not replace dedicated capacity for the Data Platform Core workspace. Thresholds must be based on observed capacity metrics and reviewed periodically.

4. Notifications and Alerts

What it is

Fabric provides two practical approaches for capacity alerting:

classic capacity notification emails configured by capacity admins
newer Real-Time Hub / Capacity Overview Events that can trigger alerts when capacity thresholds are crossed. Capacity email notifications are checked every 15 minutes based on recent capacity activity, while Real-Time Hub supports near-real-time monitoring and threshold-based workflows.

What matters for us

For a governed platform, notifications should not be treated as optional. They are the minimum control that turns capacity health into an operational process rather than a reactive troubleshooting exercise. Capacity Overview Events are specifically intended to monitor capacity health and create automated alerts.

What we should put in place

Minimum baseline for every production capacity

one named operational owner / one shared distribution list or team channel
alerting for capacity health degradation

For Data Platform Core capacity

mandatory alerting for approach to throttling
alerts routed to central platform operations

For Domain production capacity

mandatory alerting as well
alerts should trigger investigation of the top consumer workspace or item
repeated alerts should lead to either threshold tuning, workload optimization, or workspace isolation.

Recommended implementation path

Option 1 — Simple baseline
Use capacity email notifications for a first layer of monitoring. These are configured by a capacity admin in capacity settings.

Option 2 — Recommended target
Use Fabric Capacity Overview Events in Real-Time Hub and create threshold-based alerts. Microsoft’s guidance for alert setup uses the event type Microsoft.Fabric.Capacity.Summary, selects monitoring by capacity, and recommends using numeric threshold conditions rather than connection-time filters.

Suggested policy text

Every production Fabric capacity must have an assigned operational owner, active alerting, and a documented escalation path. Real-Time Hub alerts should be the preferred target implementation for capacity health monitoring.

5. Capacity Overage

What it is

Capacity Overage allows Fabric to use extra compute beyond the purchased capacity limit to prevent throttling. It is available only for F SKUs, requires capacity admin permissions, and requires sufficient quota or Fabric capacity units to support the configured overage limit. Microsoft says it is turned off by default for existing capacities.

What it does well

Microsoft positions Capacity Overage as a way to handle rare unexpected spikes or small regular spikes where scaling up is not otherwise required. It helps prevent throttling and allows new jobs to run, reducing downstream user impact.

Important limitations

Capacity Overage does not improve performance. It mainly prevents throttling. It can also admit new large jobs, so it does not remove the need for governance. Microsoft also warns to use caution when scaling down a capacity with overage enabled, because automatic charges can become significant.

What we should put in place

For Data Platform Core capacity

consider enabling it only if uptime is critical
define a capped overage limit
require explicit approval from both platform owner and budget owner
review every overage event as an operational signal, not as normal behavior.

For Domain production capacity

use only if the business accepts the cost model
keep limits lower than on the Core capacity
do not use overage to hide repeated poor sizing or uncontrolled workloads.

Clear decision rule

Enable Capacity Overage when:

the capacity is business-critical
overload is occasional, not structural
the financial model is accepted
there is active monitoring behind it.

Do not enable it as the default response to recurring saturation. In that case, the right answer is usually resize, optimize, or isolate workloads.

Suggested policy text

Capacity Overage may be enabled on critical Fabric capacities as a controlled resilience mechanism. It must remain capped, monitored, and financially approved. It must not replace correct sizing or workload isolation.

6. Recommended Actions for Our Platform

Feature	Data Platform Core capacity	Domain production capacity	What to implement
Surge Protection	Secondary control only	Yes	Enable mainly on Domain capacity, tune with Metrics App, use workspace-level controls
Notifications / Alerts	Mandatory	Mandatory	Define owner, recipients, thresholds, escalation path
Capacity Overage	Optional, capped	Optional, tightly capped	Use only for rare peaks and with budget approval

This operating model aligns with Microsoft’s guidance: isolate critical workloads first, use Surge Protection to protect shared interactive workloads, use alerts for active monitoring, and use Capacity Overage as a safety net rather than a normal operating mode.

7. Checklist

Is the Data Platform Core workspace on a dedicated capacity?
Is Surge Protection enabled and tuned on Domain production capacity?
Is workspace-level Surge Protection enabled for Domain workspaces?
Are Mission Critical workspaces explicitly limited and documented?
Is there a named operational owner for each production capacity?
Are capacity alerts configured for both Core and Domain capacities?
Are Real-Time Hub alerts planned or implemented?
Is Capacity Overage enabled only where uptime justifies it?
Is every overage limit capped and financially approved?
Is each alert or overage event reviewed as part of run operations?

8. Final Recommendation

Data Platform Core: dedicated capacity, mandatory alerts, optional capped overage, no dependency on Surge Protection as the main guardrail
Domain production: shared capacity allowed, Surge Protection enabled, workspace-level controls enabled, mandatory alerts, optional tightly capped overage
Governance rule: any repeated alert or repeated overage should trigger a review of sizing, workload optimization, or workspace isolation.

Page tree

Fabric Capacity Focus: Surge Protection, Notifications, and Capacity Overage

3. Surge Protection

What it is

What matters for us

Important limitations

Workspace-level control

What we should put in place

Suggested policy text

4. Notifications and Alerts

What it is

What matters for us

What we should put in place

Recommended implementation path

Suggested policy text

5. Capacity Overage

What it is

What it does well

Important limitations

What we should put in place

Clear decision rule

Suggested policy text

6. Recommended Actions for Our Platform

7. Checklist

8. Final Recommendation