Datasets    Datasets navigation bar

The dataset is the core object you will be manipulating in Data Science Studio. A dataset is a series of records with the same schema. It is quite analogous to a table in the SQL world.

Data Science Studio supports various kinds of datasets. For example :

Recipes

Recipes are the building blocks of your data applications. Each time you make a transformation, an aggregation, a join, … with the Data Science Studio, you will be creating a recipe.

Recipes have input datasets and output datasets, and they indicate how to create the output datasets from the input datasets.

Data Science Studio supports various kind of recipes :

Building datasets

Recipes and datasets together create the graph of the relationships between the datasets and how to build them. This graph is called the Flow. It is used by the dependencies management engine to automatically keep your output datasets up to date each time your input datasets or recipes are modified.

 
Managed and external datasets

Data Science Studio reads data from the outside world in “external” datasets. On the other hand, when you use the Data Science Studio to create new datasets from recipes, these new datasets are “managed” datasets. This means that Data Science Studio “takes ownership” of these output datasets. For example, if the managed dataset is a SQL dataset, Data Science Studio will be able to drop / create the table, change its schema, ...

Managed datasets are created by Data Science Studio in “managed connections”, which act as data stores. Managed datasets can be created:

Partitioning

Partitioning serves several purposes in Data Science Studio. Partitioning refers to the splitting of the dataset along meaningful dimensions. Each partition contains a subset of the dataset.

For example, a dataset representing a database of customers could be partitioned by country.

There are two kinds of partitioning dimensions :

A dataset can be partitioned by more than one dimension. For example, a dataset of web logs could be partitioned by day and by the server which generated the log line.