Variable selection
The tab "Variable selection" allows to select the variables of the model and the training period (fig 5).
Figure 5: Variable selection tab
Variable selection frame
In the "Variable selection" frame, the list on the left contains all the variables of the imported dataset, while the list on the right contains all the variables of the model to be estimated. To add a variable to the model, click on its name in the left list and then click on the "=>" button.
The variable then appears in the list on the right. To remove a variable from the model, click on its name in the list on the right and then on the "<=" button. It is possible to quickly integrate all the variables into the model by clicking on "- All variables -" in the list on the left and then clicking on the "=>" button.
The "remove all" button is used to remove all the variables that have been added.
Exploratory analysis
The "Plots , _"Boxplots" , "Correlations" and "PCA" buttons are used to explore the data in order to help the user select the variables of his model.
The "Plots" button allows to display the temporal evolution of the variables of the data set (fig 6). A constant variable over time will not provide much information to the model. The plots obtained are saved in the directory "./DonneesBrutes/Figures/".
Figure 6: Temporal evolution of the E__TA variable (water temperature)
Figure 7: Boxplot of the ETCO1 variable(air temperature)
Figure 8: Corrélations matrix of the example dataset variables
The "PCA" button displays the correlation circles (fig 9), resulting from the Principal Component Analysis6 performed on the variables of the model, in the different principal planes. The variables not included are used as additional variables (they are represented (in blue) on the circles, but they have no impact in the calculations that allowed to obtain the axes). The results of the PCA are saved in the "./DonneesBrutes/ACP_ date " file.
Figure 9: Correlation circle of the PCA performed on the example dataset
Learning period selection frame
By default, the program considers the whole period for which data are available to detect states. The user can choose to use only a part of the data to perform the classification by editing the "From" (starting date of the training sample) and "To" (ending date of the training sample) fields.
This allows us to split the data into two samples: a training sample 7 (composed of the oldest available observations) and a validation sample model (most recent available observations).
As previously mentioned, in this documentation we will use the fluorescence variable to validate the model. However, validation of results in the context of unobservable states/classes is often difficult.
Once the variables and the learning period have been chosen, a click on the "Run" button allows to go to the "Classification" tab.
-
http://www.seanoe.org/data/00286/39754/: cthis dataset has been cleaned (temporal alignment, correction of outliers, replacement of outliers, replacement of some missing values...). ↩
-
MAREL = Mesures Automatisées en Réseau pour l’Environnement Littoral (Networked Automated Measurements for the Coastal Environment) ↩
-
The fluorescence is indeed a parameter which allows to judge the quality of the considered environment. ↩
-
A boxplot is a diagram on which are plotted the main dispersion characteristics of a univariate sample: 1st and 3rd quartiles (Q1 and Q3), median, whiskers (Wmin=max{ value_min ; Q1-1.5(Q3-Q1)} and Wmax=min{ value_max ; Q3+1.5(Q3-Q1)}) and extreme values (points outside the whiskers). ↩
-
Two variables are said to be correlated when their correlation coefficient is close to 1 or -1, and uncorrelated when it is close to 0. ↩
-
PCA is a method that aims to project a point scatter plot located in a high-dimensional space into a lower-dimensional space with as little distortion as possible [6]. ↩
-
It is recommended to use a training sample with at least 70% of the available data. ↩