Note: The .NET Jupyter notebook for this blog post can be found here.
The Structure of Daany.DataFrame
The main part of Daany
project is Daany.DataFrame
– an c# implementation of a data frame. A data frame is a software component used for handling tabular data, especially for data preparation, feature engineering, and analysis during the development of machine learning models. The concept of Daany.DataFrame
implementation is based on simplicity and .NET coding standard. It represents tabular data consisting of columns and rows. Each column has name and type and each row has its index and label.
Usually, rows indicate a zero
axis, while columns indicate axis one
.
The following image shows a data frame structure:
The basic components of the data frame are:
header
– list of column names,index
– list of object representing each row,data
– list of values in the data frame,missing value
– data with no values in data frame.
The image above shows the data frame components visually, and how they are positioned in the data frame.
Create Data Frame from a text based file
The data we used are stored in files, and they must be load into application memory in order to be analyzed and transformed. Loading data from files by using Daany.DataFrame
is as easy as calling one method.
By using static method DataFrame.FromCsv
a user can create data frame object
from the csv
file. Otherwise, data frame can be persisted on disk by calling
static method DataFrame.ToCsv
.
The following code shows how to use static methods ToCsv
and FromCsv
to show persisting and loading data to data frame:
string filename = "df_file.txt";
//define a dictionary of data
var dict = new Dictionary<string, List>
{
{ "ID",new List() { 1,2,3} },
{ "City",new List() { "Sarajevo", "Seattle", "Berlin" } },
{ "Zip Code",new List() { 71000,98101,10115 } },
{ "State",new List() {"BiH","USA","GER" } },
{ "IsHome",new List() { true, false, false} },
{ "Values",new List() { 3.14, 3.21, 4.55 } },
{ "Date",new List() { DateTime.Now.AddDays(-20) , DateTime.Now.AddDays(-10) , DateTime.Now.AddDays(-5) } },
};
//create data frame with 3 rows and 7 columns
var df = new DataFrame(dict);
//first Save data frame on disk and load it
DataFrame.ToCsv(filename, df);
//create data frame with 3 rows and 7 columns
var dfFromFile = DataFrame.FromCsv(filename, sep:',');
//show dataframe
dfFromFile
First, we created a data frame from the dictionary collection. Then we store the data frame to file. After successfully saving, we load the same data frame from the CSV file. The end of the code snippet put asserts in order to prove everything is correctly implemented. The output of the code cell is:
In case the performance is important, you should pass column types to the FromCSV
method in order to achieve up to 50% of loading time.
For example the following code loads the data from the file, by passing predefined column types:
//defined types of the column
var colTypes1 = new ColType[] { ColType.I32, ColType.IN, ColType.I32, ColType.STR, ColType.I2, ColType.F32, ColType.DT };
//create data frame with 3 rows and 7 columns
var dfFromFile = DataFrame.FromCsv(filename, sep: ',', colTypes: colTypes1);
And we got the same result:
Loading Real Data from the Web
Data can be loaded directly from the web storage by using FromWeb
static method. The following code shows how to load the Concrete Slump Test
data from the web. The data set includes 103 data points. There are 7 input variables, and 3 output variables in the data set: Cement
, Slag
, Fly ash
, Water
, SP
, Coarse Aggr.
,Fine Aggr.
, SLUMP (cm)
, FLOW (cm)
, Strength (Mpa)
.
The following code load the Concrete Slump Test
data set into Daany DataFrame:
//define web url where the data is stored
var url = "https://archive.ics.uci.edu/ml/machine-learning-databases/concrete/slump/slump_test.data";
//
var df = DataFrame.FromWeb(url);
df.Head(5)
Once we have the data into the application memory, we can perform some statistical calculations. First, let’s see the structure of the data by calling the Describe
method:
df.Describe(false)
Now, we see we have a data frame with 103
rows and all columns are of numerical type. The frequency of the data indicated that values are mostly not repeated. From the maximum and minimum values, we can see the data have no outlines since distributions of the values are tends to be normal.
Data Visualization
Let’s perform some visualization just to see how visually data look like. As first let’s see the Slump
distribution with respect of SP
and Fly ash
:
var chart = Chart.Plot(
new Graph.Scatter()
{
x = df["SP"],
y = df["Fly ash"],
mode = "markers",
marker = new Graph.Marker()
{
color = df["SLUMP(cm)"].Select(x=>x),
colorscale = "Jet"
}
}
);
var layout = new Layout.Layout(){title="Slump vs. Cement and Slag"};
chart.WithLayout(layout);
chart.WithXTitle("Cement");
chart.WithYTitle("Slag");
display(chart);
From the chart above, we cannot see any relation between those two columns. Let’s see the chart made between Slump
and Flow
:
var chart = Chart.Plot(
new Graph.Scatter()
{
x = df["SLUMP(cm)"],
y = df["FLOW(cm)"],
mode = "markers",
}
);
var layout = new Layout.Layout(){title="Slump vs. Cement and Slag"};
chart.WithLayout(layout);
chart.WithLegend(true);
chart.WithXTitle("Slump");
chart.WithYTitle("Flow");
display(chart);
We can see some relation in the chart and the relation is positive. This means as Slupm
is growing, Flow
value grows as well. If we want to measure the relation between the columns we can do that with the following code:
var x1= df["SLUMP(cm)"].Select(x=>Convert.ToDouble(x)).ToArray();
var x2= df["FLOW(cm)"].Select(x=>Convert.ToDouble(x)).ToArray();
//The Pearson coefficient is calculated by
var r=x1.R(x2);
r
The correlation is 0.90 which indicates a strong relationship between those two columns.
The complete .NET Jupyter Notebook for this blog post can be found here