SYSM-354 - Getting issue details... STATUS
This page defines the capacity-level configuration options that must be evaluated for our Microsoft Fabric platform, with a focus on:
- production stability
- workload isolation
- controlled scaling
- governance of shared resources
- independence of the IT production workspace
Our target operating model uses Fabric primarily as a data storage and exposure platform based on Lakehouse and Warehouse, serving both BI consumption and external data exposure.
INFO Fabric capacity configuration is a platform governance topic, not only an infrastructure topic.
RECOMMENDATION The primary control for protecting IT production is capacity isolation.
WARNING Shared capacity between IT production and Domain production creates shared operational risk.
DECISION A dedicated capacity for IT production is the recommended baseline for this architecture.
Version | Date | Description | Contributor |
V0.1 |
| Initial document | COLOMBANI Théo |
Fabric Capacity Configuration for Our Data Platform
1. Objective
This page defines the capacity-level configuration options that must be assessed for our Microsoft Fabric platform.
The objective is to ensure:
production stability
workload isolation
controlled scalability
governance of shared compute resources
operational independence of the IT production workspace
In our architecture, Microsoft Fabric is primarily used as a storage and exposure platform, based on Lakehouse and Warehouse, for both BI consumption and external data exposure.
2. Platform Context
Our target operating model is structured as follows:
IT workspace
bronze layer
silver layer
core production data preparation and controlled exposure foundation
Domain workspaces
gold layer
business-oriented and BI-ready data products
domain-level exposure for reporting and consumption
Key requirement
The IT production workspace must remain operational independently from Domain workspaces, including in situations where Domain workloads generate higher or less predictable compute consumption.
3. Design Principle
Recommendation
Capacity design must be driven by isolation first, then by optimization.
Rationale
In our context, the main purpose of capacity governance is not only to size compute correctly. It is primarily to:
protect critical IT production workloads
separate critical and non-critical workloads
reduce cross-workspace contention
create predictable operating conditions
support controlled platform growth
Decision statement
For our platform, capacity is an architecture boundary, not only a billing or administration object.
4. Recommended Target Model
Target architecture
Capacity A — IT Production
Used only for:
IT bronze
IT silver
core production ingestion / preparation / exposure foundations
Capacity B — Domain Production
Used for:
Domain gold workspaces
business-facing data products
BI-oriented workloads
potentially more variable usage patterns
Capacity C — Non-Production
Used for:
development
testing
experimentation
validation before production promotion
Recommendation
Do not place IT production and Domain production on the same capacity if IT production must remain operational independently.
Why this matters
A shared capacity creates a shared risk envelope. Even if workspaces are logically separated, they still depend on the same underlying capacity behavior.
5. Capacity-Level Settings to Document
5.1 Workspace-to-capacity assignment
What it is
The assignment of workspaces to specific Fabric capacities.
Why it matters
This is the most important configuration decision in our model because it determines whether workloads share the same compute risk domain.
Recommendation
assign IT production to a dedicated capacity
assign Domain production to a separate capacity whenever possible
isolate non-production from all production capacities
avoid mixing critical platform workloads with variable business workloads
Confluence panel text
Recommendation
Workspace assignment is the primary mechanism used to guarantee production isolation and operational independence.
5.2 Capacity administration and reassignment governance
What it is
The set of permissions allowing administrators to manage a capacity and move workspaces into or out of it.
Why it matters
Even with a good target architecture, weak governance can reintroduce risk if workspaces are moved without control.
Recommendation
restrict capacity admin rights to the central Data Platform or IT team
restrict workspace reassignment rights on critical capacities
require formal approval for any workspace added to the IT production capacity
prevent self-service reassignment into critical production capacity
Confluence panel text
Warning
A dedicated production capacity loses most of its value if workspace assignment is not tightly governed.
5.3 Surge protection
What it is
A protection mechanism used to manage overload situations and reduce the impact of excessive background activity on a capacity.
Why it matters
It can help protect shared capacities, especially where Domain workspaces may generate bursty or uneven usage patterns.
Recommendation
consider enabling surge protection on shared Domain production capacities
use it as a protection layer for variable workloads
do not rely on it as the sole protection for IT production
Position
Surge protection is a supporting control, not a substitute for proper isolation.
Confluence panel text
Recommendation
Use surge protection on shared capacities.
Do not use it as a replacement for dedicated capacity when a workspace is mission-critical.
5.4 Capacity sizing and scaling
What it is
The sizing of Fabric capacity and the ability to adjust it as workload volume evolves.
Why it matters
Even a well-isolated architecture can fail operationally if the capacity is persistently undersized.
Recommendation
size IT production with stability and operational headroom in mind
review Domain production more frequently, as usage can be less predictable
use monitoring trends to drive scaling decisions
avoid reactive resizing without understanding the underlying workload pattern
Practical interpretation
IT production should be sized for continuity first
Domain capacities can be managed more elastically
Confluence panel text
Decision
IT production capacity sizing must prioritize service continuity over cost minimization.
5.5 Capacity overage
What it is
A mechanism that allows excess usage beyond the purchased capacity threshold, subject to billing and governance.
Why it matters
It can reduce the risk of operational disruption during rare peaks.
Recommendation
consider enabling overage for IT production only with explicit financial approval
define a capped and governed usage threshold
treat overage as a resilience mechanism, not a normal operating model
Position
Overage is a safety net, not a sizing strategy.
Confluence panel text
Warning
Do not use overage to compensate for structural under-sizing.
5.6 Monitoring and operational visibility
What it is
The monitoring of capacity usage, saturation patterns, top consumers, and operational degradation signals.
Why it matters
Capacity governance is only effective if usage and saturation can be observed and acted upon.
Recommendation
For each production capacity, define:
monitoring owner
review cadence
alert thresholds
escalation path
expected remediation actions
Minimum baseline
monitor recurring peaks
identify top consuming workspaces and items
review saturation or degradation patterns
correlate operational issues with refresh, ingestion, or usage spikes
Confluence panel text
Recommendation
Capacity monitoring must be part of normal run operations, not only incident management.
5.7 Disaster recovery
What it is
The capacity-level disaster recovery posture associated with production data continuity.
Why it matters
The IT production workspace supports bronze and silver foundations, which makes it a core dependency for downstream exposure.
Recommendation
perform an explicit DR assessment for IT production
document whether DR is enabled or not
document expected recovery assumptions and limitations
ensure this is an explicit architecture decision
Position
For IT production, DR should never be left undocumented.
Confluence panel text
Decision
Disaster recovery for IT production must be assessed explicitly and recorded as an approved architecture choice.
5.8 Notifications and alerting
What it is
The definition of who is informed when capacity issues occur and how operational response is triggered.
Why it matters
Without alert ownership, capacity incidents tend to be handled too late or inconsistently.
Recommendation
Define:
alert recipients
severity levels
response expectations
operational communication path
Confluence panel text
Recommendation
Every production capacity must have a clearly assigned operational owner and alerting path.
5.9 Data Engineering and Spark-related settings
What it is
Capacity-level settings related to Spark and Data Engineering workloads.
Why it matters
These settings are relevant if Spark-based processing is materially used in the IT workspace.
Recommendation
keep Spark governance centralized
avoid uncontrolled compute sprawl
document Spark rules separately if Spark is not a central workload in the platform
Position
This is a secondary topic in our model unless Spark becomes a major production dependency.
6. Recommended Configuration Matrix
| Setting | IT Production | Domain Production | Non-Production | Recommendation |
|---|---|---|---|---|
| Dedicated capacity | Yes | Preferred | Separate | Mandatory for IT production |
| Shared with IT production | No | No | No | Not allowed |
| Workspace reassignment rights | Very restricted | Restricted | Controlled | Govern centrally |
| Surge protection | Optional complement | Recommended | Optional | Primarily for shared/variable workloads |
| Capacity overage | Optional, capped | Optional, capped | Usually not required | Safety net only |
| Monitoring | Mandatory | Mandatory | Recommended | Standard operating baseline |
| DR assessment | Mandatory | Case by case | Not priority | Explicit decision required |
| Spark governance | Case by case | Case by case | Flexible | Only where relevant |
| Scaling review cadence | Regular | Regular | Periodic | Metrics-driven |
7. Operational Rules
Rule 1
Protect IT production by design.
Critical IT workloads must not depend on the same shared capacity behavior as variable domain workloads.
Rule 2
Use isolation before optimization.
Do not try to solve structural contention only with reactive tuning or protection features.
Rule 3
Treat overage as an exception mechanism.
It may improve resilience, but it must not become the default operating mode.
Rule 4
Make monitoring part of standard operations.
Capacity review must be proactive and periodic.
Rule 5
Separate production from experimentation.
Development and testing workloads must not compete with critical production capacity.
8. Proposed Architecture Decision
Recommended decision
The recommended target state for our platform is:
one dedicated Fabric capacity for IT production
one separate Fabric capacity for Domain production
one separate non-production capacity
centralized control of workspace assignment
standardized monitoring and alerting
optional capped overage for resilience
explicit DR assessment for IT production
Architecture conclusion
This is the most coherent model for a Fabric platform used primarily as a storage and exposure layer, where the IT production workspace must remain stable independently from Domain activity.
9. Configuration Decisions to Validate
Checklist
Has a dedicated capacity been confirmed for IT production?
Has Domain production been isolated from IT production?
Has non-production been separated from production capacities?
Have capacity admin roles been limited to the central platform team?
Have workspace reassignment rights been formally governed?
Has surge protection been evaluated for shared Domain capacities?
Has capacity overage been evaluated and financially approved where relevant?
Has a monitoring owner been assigned for each production capacity?
Have alert thresholds and escalation paths been defined?
Has disaster recovery been explicitly assessed for IT production?
Have Spark-related settings been reviewed, if applicable?
Has the target capacity model been approved as part of platform governance?
10. Callout Blocks
INFO Fabric capacity configuration is a platform governance topic, not only an infrastructure topic.
RECOMMENDATION The primary control for protecting IT production is capacity isolation.
WARNING Shared capacity between IT production and Domain production creates shared operational risk.
DECISION A dedicated capacity for IT production is the recommended baseline for this architecture.
Si tu veux, je peux maintenant te faire une version encore plus compacte, vraiment au format wiki exécutif, avec moins de texte narratif et davantage de blocs “Decision / Recommendation / Rationale”.