Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Data Science Studio (DSS) from Dataiku is a complete Data Science software tool for data scientists, analysts and machine learning experts  to perform data analysis and modelling more efficiently. DSS significantly shortens the time-consumed during data cleaning, model buildings and other statistical processes. 

DSS enables direct and fast connection to the most common sources of data with strong integration capabilities. Analysts can leverage these smart data types to validate and transform the data in an automated way.

They can also perform more mundane tasks such as replacement, grouping, splitting, calculating, and others on DSS’s interface which shows them an instant visual feedback of any operation. 

Some Basic Concepts of Data Science Studio:    

There are few main concepts that are important to understand the terms used to get more familiar with DSS.

Projects 

Each task in DSS is organized in individual projects to manage the data and associated tasks. The main dashboard is called as 'Universe' and projects arranged will look like this:

Image Added

Flow  Flow navigation barImage Added

A DSS project is structured in the form of a flow. It visually represents a data pipeline and how datasets are 

Image Added

Datasets Datasets navigation barImage Added

Datasets

The dataset is the core object you will be manipulating in Data Science Studio. A dataset It is a series of records rows with the same schema. It is quite analogous to a table in the SQL world.

Data Science Studio supports various kinds of datasets. For example :

  • A SQL table or custom SQL query
  • A collection in MongoDB
  • A folder with data files on your server
  • A folder with data files on a Hadoop cluster.

Recipes

data structure.

Recipes

Any pre-processing or data manipulation on the datasets are managed using recipes. Recipes Recipes are the building blocks of your data applications. Each time you make a transformation, an aggregation, a join, … with the Data Science Studio, you will be creating a recipe.

Recipes have input datasets and output datasets, and they indicate how to create the output datasets from the input datasets.

Data Science Studio supports various kind of recipes :

  • Executing a data preparation script defined visually within the Studio
  • Executing a SQL query
  • Executing a Python script (with or without the use of the Pandas library)
  • Executing a Pig script
  • Executing a Hive query
  • Synchronizing the content of input to output datasets

Building datasets

Recipes and datasets together create the graph of the relationships between the datasets and how to build them. This graph is called the Flow. It is used by the dependencies management engine to automatically keep your output datasets up to date each time your input datasets or recipes are modified.

 
 Managed and external datasets

Data Science Studio reads data from the outside world in “external” datasets. On the other hand, when you use the Data Science Studio to create new datasets from recipes, these new datasets are “managed” datasets. This means that Data Science Studio “takes ownership” of these output datasets. For example, if the managed dataset is a SQL dataset, Data Science Studio will be able to drop / create the table, change its schema, ...

Managed datasets are created by Data Science Studio in “managed connections”, which act as data stores. Managed datasets can be created:

  • On the filesystem of the Data Science Studio server
  • On Hadoop HDFS
  • In a SQL database
  • On Amazon S3
  • ...

There are two types of recipes used widely in DSS:

Visual recipes: Provide basic manipulation functionalities like data cleaning, filtering, grouping etc. 

Image Added

Code recipes: Used for integrating technical programming like R, Python etc. 

Image Added

Dashboard  Dashboards navigation barImage Added

The dashboard communicates result and give insights based on the analysis performed on the datasets.

Image Added

 

Analysis  Analyses navigation barImage Added

This provides visual analysis of the dataset prior to the implementation on the flow which helps to dive deep into the data directly. 

Other concepts 

Jobs: Every build on the dataset is recorded as jobs to keep track of activities in the flow

Scenarios: Helps in automating and scheduling the tasks in the flow

Jobs navigation barImage Added

Lab - Notebooks: DSS allows to draft code in interactive programming environment to make the analysis easy and efficient

Web Apps: Users with Web coding skills can create advanced custom Web Apps using our dedicated editor and REST API

Notebooks navigation barImage Added 

For more introduction on concepts of DSS, please navigate here .

Partitioning

 

Image Removed

Partitioning refers to the splitting of the dataset along meaningful dimensions. Each partition contains a subset of the dataset.

For example, a dataset representing a database of customers could be partitioned by country.

There are two kinds of partitioning dimensions :

  • “Discrete” partitioning dimension. The dimension has a small number of values. For example : country, business unit
  • “Time” partitioning dimension. The dataset is divided in fixed periods of time (year, month, day or hour). Time partitioning is the most common pattern when dealing with log files.

A dataset can be partitioned by more than one dimension. For example, a dataset of web logs could be partitioned by day and by the server which generated the log line.

Whenever possible, the Data Science Studio uses underlying native mechanisms of the dataset backend for partitioning. For example, if a SQL dataset is hosted on a RDBMS engine which natively supports partitioning, Data Science Studio will map the partitioning of the dataset to the SQL partitions.

Partitioning serves several purposes in Data Science Studio.

Incrementality

Partitions are the unit of computation and incrementality. When a dataset is partitioned, you don’t build the full dataset, but instead you build it partition by partition.

Partitions are fully recomputed. When we build partition X of a dataset, the previous data for this partition is removed and is replaced by the output of the recipe that generated the dataset. Recomputing a partition of a dataset is idempotent : computing it several times won’t create duplicate records.

This is especially important when processing times series data. If you have a day-partitioned log dataset as input, and a day-partitioned enriched log dataset as output, you want to build the partition X of the output dataset each day.

Partition-level dependencies

Partitioning a dataset allows you to have partition-level dependencies management. Instead of just having the recipe specify that an output dataset depends from an input dataset, you can define what partitions of the input are required to compute a given partition of the output.

Let’s take an example :

  • The “logs” dataset is partitioned by day. Each day, an upstream system adds a new partition with the logs of the day.
  • The “enriched logs” dataset is also partitioned by day. Each day, we need to compute the enriched logs using the “same” partition of the logs.
  • The “sliding report” dataset is also partitioned by day. Each day, we want to compute a report using data of the 7 previous days.
Image Removed

To achieve that, we will declare: An “equals” dependency between “logs” and “enriched logs” A “sliding days” dependency between “enriched logs” and “sliding report”.

Image Removed

When you ask Data Science Studio to compute the partition X of “sliding report”, it will compute that it needs to have:

  • The partitions X, X-1, X-2, … X-6 of “enriched logs”
  • The partitions X, X-1, X-2, … X-6 of “logs”

Data Science Studio will then check which partitions are available and up-to-date, and automatically compute all missing partitions. Data Science Studio will automatically parallelize the computation of enriched logs for each missing day, and then compute the sliding report.

Performance

Generally speaking, when a dataset is partitioned, it can improve querying performance on this dataset. This is especially true for SQL datasets when the underlying RDBMS engine natively supports partitioning.