SYSM-358 - Getting issue details... STATUS

This page focuses on three Fabric capacity options that can improve operational resilience:

  • Surge Protection
  • Notifications / Alerts
  • Capacity Overage

For our platform, these options should be treated as operational controls, not as substitutes for proper capacity isolation and sizing. Since throttling is applied at the capacity level, the first protection for the Data Platform Core workspace (prod Capacity only here) remains capacity separation from Domain workloads


Version

Date

Description

Contributor

V0.1

 

Initial document

COLOMBANI Théo

V0.1

 

Update documentCOLOMBANI Théo



Key message

These three options should be used as operational safeguards, not as substitutes for proper capacity design.

For our platform, the priority remains:

  • protect the Data Platform Core workspace through capacity isolation

  • use Surge Protection mainly to control variable Domain workloads

  • make Notifications mandatory for production operations

  • use Capacity Overage only as a controlled safety net for rare peaks. (learn.microsoft.com)


What we should implement

FeatureData Platform Core capacityDomain production capacityClear recommendation
Surge ProtectionSecondary control onlyYesEnable mainly on Domain capacity
Notifications / AlertsMandatoryMandatoryDefine owner, recipients, thresholds, escalation
Capacity OverageOptional, cappedOptional, tightly cappedUse only for rare peaks and with budget approval
Recommended setup


For Data Platform Core

  • keep it on a dedicated capacity

  • make alerts mandatory

  • use Capacity Overage only if uptime is critical and cost is approved

  • do not rely on Surge Protection as the main protection layer. (learn.microsoft.com)

For Domain production

  • allow shared capacity if needed

  • enable Surge Protection

  • enable workspace-level controls

  • make alerts mandatory

  • use Capacity Overage only with strict limits. (learn.microsoft.com)

Decision guide

Enable Surge Protection when

Make Notifications mandatory when

Enable Capacity Overage when

  • the capacity is shared

  • Domain workloads are variable

  • noisy workspaces need runtime control. (learn.microsoft.com)

  • the capacity is production

  • the platform team is expected to operate it properly

That means: always for production capacities. (learn.microsoft.com)

  • the capacity is critical

  • peaks are occasional, not structural

  • the financial model is accepted

  • monitoring is already in place. (learn.microsoft.com)

Quick checklist

  • Is Surge Protection enabled on Domain production capacity?

  • Is workspace-level Surge Protection enabled for Domain workspaces?

  • Are Mission Critical workspaces explicitly limited and documented?

  • Does each production capacity have an owner?

  • Are alerts configured for each production capacity?

  • Is Real-Time Hub alerting planned or implemented?

  • Is Capacity Overage enabled only where justified?

  • Is every overage limit capped and approved?

  • Do repeated alerts or overage events trigger a review of sizing or isolation?


1. Surge Protection

What to understand

Surge Protection helps reduce the impact of heavy background activity on a capacity. It is especially useful when interactive or user-facing workloads share capacity with more variable background jobs. Microsoft recommends it for shared capacities, but also states that critical solutions should still be isolated on a dedicated capacity. (learn.microsoft.com)


 What it is good for

What it is not

  • protecting shared capacities from bursty workloads

  • reducing the impact of noisy workspaces

  • limiting background pressure on user-facing workloads. (learn.microsoft.com)

  • not a replacement for dedicated capacity

  • not a guarantee that all interactive requests will always succeed

  • not a fix for structural under-sizing. (learn.microsoft.com)


What we should put in place

  • enable it mainly on Domain production capacity

  • add workspace-level Surge Protection

  • define a rule for handling repeated noisy workspaces

  • keep Mission Critical status limited to a very small number of justified cases

  • tune thresholds using the Capacity Metrics App, not guesswork. (learn.microsoft.com)


Key message

Use Surge Protection to control shared Domain workloads, not to protect the Data Platform Core workspace instead of isolating it.


Example


2. Notifications / Alerts

What to understand

Notifications are the minimum control that turns capacity health into an operational process. Fabric supports both capacity notification emails and Real-Time Hub / Capacity Overview Events for monitoring and alerting. (learn.microsoft.com, learn.microsoft.com)


What we should put in place

  • For every production capacity:

    • one named operational owner

    • one shared distribution list or team channel

    • clear alert thresholds

    • one documented escalation path. (learn.microsoft.com)

Suggested implementation path

Minimum setup

  • enable capacity notification emails

Recommended target

Key message

Alerts should be mandatory on all production capacities. A production capacity without an owner and alerting is not operationally governed.


Example


3. Capacity Overage

What to understand

Capacity Overage allows Fabric to use extra compute beyond the purchased limit to avoid throttling. Microsoft positions it as a way to absorb rare unexpected spikes or small regular peaks, not as a substitute for proper sizing. It is available only on F SKUs. (learn.microsoft.com, learn.microsoft.com)


 What it is good for

What it is not

  • reducing the risk of disruption during occasional overload

  • protecting critical production continuity

  • avoiding throttling for short, unexpected peaks. (learn.microsoft.com)

  • not a performance booster
  • not a strategy for permanent under-sizing
  • not something to leave effectively unlimited. (learn.microsoft.com)


What we should put in place

For Data Platform Core:

  • enable only if uptime is critical

  • cap the limit

  • require platform and budget owner approval

  • review every overage event

For Domain production:

  • use only if the business accepts the cost model

  • keep tighter limits

  • do not use it to hide repeated saturation. (learn.microsoft.com)


Key message

Use Capacity Overage as a controlled safety net, not as a normal operating model.


Example

  • No labels