You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

Every data movement follows the same principle, source data is transformed and made available in the target and along the line some side effect data is generated also.

This applies to ETL flows, where a program is called at specific times reads the change data from the source, applies transformations and the result of the transformations are stored in the target.

But the same can be said about data federation (Fabric Shortcuts), where data is read from the target and the request is re-routed to the source dynamically.

Or realtime replication where the source pushes changes to a consumer, the transformation is a pure 1:1 mapping and the data physically copied into the target.

Note that "source" does not mean necessarily a remote source system. It could also mean that two tables in Fabric Lakehouse are joined and put as a fact table into Fabric Warehouse.

Having said all of that, one decision was that the target is always a physical persistence, a table, for query performance reasons.


When we now consider the different tools and methods, this leaves to some obvious and less obvious requirements. Each method must be able to .... 

  • connect to any kind of source system. For example, StarTek is an onPrem Rest API, Labware an onPrem Oracle database, SAP Datasphere a cloud SaaS offering, ... A method that does not allow for all current and future sources cannot be used.
  • support any kind of transformation. SQL like transformation like joins, projections, filter, aggregation, window functions. But also transformations and functions outside of the SQL realm, like calling a google maps API to get back long/lat for a given address or transformations implemented in 3GL languages. A method that does not support 100% of all current and future transformation requirements cannot be used.
  • load the target tables with a commit timestamp and a soft delete flag. The commit instead of the change timestamp is important, because otherwise if a change was made at 09:00 and committed at 09:15 would not be visible for a delta run at 09:10. And the next delta run will read all changes from 09:10 onward and hence skip the row changed at 09:00. The requirement is simply, we copied the sales order header and sales order line item table into Fabric and now those two tables get joined to load the sales order fact table. This must be implemented via a delta logic.
  • be implemented quickly, for simple and complex logic. Transformation logic has the same requirements as classic programming languages: modularization, reuse of common code, inline documentation.
  • copy not only the data but also the metadata. For all 1:1 mappings the column metadata like exact datatype (NVARCHAR(1), not just String) is preserved as well as column comments, for more complex mappings the metadata can be derived (case when the col1 else col2 end; data types and comments can be derived). At table level the metadata is table comments, partition information, primary key, foreign keys.
  • generate the impact/lineage information on table and column level. The user must be able to answer questions like "Where is the source table being used?", "If the source modifies the definition of a column, what target tables and columns are impacted?", "The value in the target looks wrong, where does it come from and what kind of transformations have been applied?"
  • support reading and transforming deeply nested structures, e.g. a Json document with multiple levels.
  • process the data fast, throughput should be in 10k per second and better.
  • process the data cheaply, meaning the cost of running the dataflow should be fractions of cents per million of source rows. For example, starting an entire Spark cluster just to check that the source has no new records is quite an overhead.


  • No labels