Classification and modeling

3 Classification

After selecting the variables and the training sample, the "Classification" tab (fig 10 and 11) allows to proceed to the detection of states in the data.

Figure 10: Classification tab for the standard user

Figure 11: Classification tab for the expert user

When the interface is used in standard mode, it is enough to click on the "Start" button to start the procedure. In this case, the data are normalized and the method used is spectral classification, with an automatic selection of the number of states using the gap criterion ¹.

In expert mode, the user can choose to normalize or not the data, select the classification method to use (K-means ² or spectral ³ ) and its parameters.

Parameters to be selected in expert mode

First of all, the user has to decide whether the variables should be normalized⁴ or not.

When the data are not normalized, variables with a very wide range of values will influence the search for states more than other variables. Normalization avoids this effect of scale : all the variables have the same weight during the search for states. The risk, by using normalized variables, is to give weight to a variable that actually brings few information.

The K-means algorithm used by the interface is the Hartigan-Wong algorithm. The user will choose to set :

the number of state ;
or the explained variance (for a number of states = 0): this is the stopping criteria of the algorithm, used to compute the optimal number of states.

The spectral classification algorithm used by the interface is the Ng-Jordan-Weiss algorithm. The user will choose to set :

the number of state ;
or the stopping criteria to use (for a number of states = 0): gap criterion or principal eigenvalues;
and, if the number of data points is greater than 2000 (high memory cost for eigenvalue and eigenvector computations), then the user can specify the explained variance used in the data reduction and the calculation of the optimal number of symbols.

Once the method and the parameters have been chosen, a click on the button "Start" allows to launch the calculations. Warnin , these calculations can take up to several hours (depending on the size of the data, the capacities of the computer used ...).

When the calculations are completed, the raw results of the clustering are saved in the "./Classification/FichiersR/Classification_ date " file. Descriptive statistics by state are saved in the "./Classification/Tableaux/summaryTableClassification.xls" file (tab 1). We can see for example, that the average value of the variable C_NI1 of the observations of state/cluster 2 is 12.35.

Min Q1 Median Mean Q3 Max NA
C_NI1 Cluster_
0.02 14.45 22.98 26.5 35.6 99.54 0
C_NI1 Cluster_
0.15 4.85 9.27 12.35 18.04 54.29 0
C_NI1 Cluster_
0.01 7.13 19.79 20.39 28.21 86.23 0
C_PO1 Cluster_
0 0.42 0.7 1.428 1.08 24.54 0
... ... ... ... ... ... ...

Table 1: Example of descriptive statistics by state

For each variable of the dataset, the program generates and saves, in the "./Classification/Figures/" directory, a boxplot by state (fig 12) and a plot representing the evolution of the variable in time, by coloring each observation according to its state (fig 13). A plot allowing to visualize the sequencing of the states is also saved in this directory (fig 14) and is displayed in the graphic window of the interface.

We notice that the observations including a missing value for one of the variables of the model are classified in a separate state labelled "NA" (for "Not Available").

This sequence is also saved, along with the Dates and Hours variables, in the "./Classification/Tableaux/Classification_sequendement_etat date .xls" file. If the data has columns named "latitude" and "longitude", then these are also saved in the file.

Figure 12: Boxplot by state of the variable E__TA, with highlighting of the states by their colors

Figure 13: Temporal evolution of the variable E__TA, with highlighting of the states by their colors

Figure 14: Sequencing of the states obtained by classification, with highlighting of the states by their colors

4 Time series modeling

The "Time series modeling" tab allows to estimate⁵ the parameters of a hidden Markov model, from the states detected by unsupervised clustering.

Figure 15: Time Series Modeling Tab

The "Import results from a previous classification" button is used to import the results of a clustering. If a clustering has just been performed, its results are already loaded in the current session and it is therefore not necessary to import them again.

However, if the clustering was done in a previous session, you must use this button to load the results. The file to import is the "./Classification/FichiersR/Classification_date" file qwhich was saved in the directory selected during the previous session. Warning , the imported clustering must have been performed on the imported dataset during this session.

Once the results are loaded, you just have to click on the "Run" button to start the estimation of the model parameters. When the estimation is finished, tables and plots of the same type as those obtained from the clustering are saved in the "./Modelisation/" sub-directory.

The plot showing the sequencing of the states obtained with the model is also displayed in the graphical window of the interface. This sequencing may be slightly different from the one obtained after clustering.

5 Prediction

This tab allows to use an already estimated model to predict the states of a new dataset⁶.

Figure 16: Prediction tab

Import results from a previous modeling button

This button is used to import the results of a modeling session performed previously. These results have been saved in the "./Modelisation/FichiersR/MarkovModelEstimation_ date " file of the directory selected in the previous session.

If the model parameters have been estimated in this session, then the results are already loaded and it is not necessary to use this button.

Import another TXT dataset button

This button allows you to import the dataset on which to perform the prediction. For users who go directly from the "Import" tab to the "Prediction" tab (and then have imported a previously estimated model), it is not necessary to use this button. It is the dataset imported in the "Import" tab that will be used.

The user can also reuse the imported file for modeling by checking the checkbox "Use the same dataset as for previous step".

The variables used are automatically selected from those used in the model. Any variable from the model building set not present in the data file to be predicted will give a warning message that will not allow the prediction.

The additional variables will not affect the prediction but will help the interpretation via the generated plot.s

Prediction period selection frame

This frame is used to select the period on which to perform the prediction. This allows to select the validation sample for users who have chosen to split their data in 2 samples (learning/validation).

Predict new data status button

Once the estimated model is loaded and the data are selected, the "Predict new data states" button is used to start the prediction. When the prediction is finished, tables and plots of the same type as those obtained from the classification are saved in the "/Prediction/" sub-directory. The plot showing the sequence of predicted states is also displayed in the interface's graphic window.

References

[1] Rousseeuw, K. (2014). Modélisation de signaux temporels hautes fréquences, multicapteurs à valeurs manquantes. Application à la prédiction des efflorescence phytoplanctoniques dans les rivières et les écosystèmes marins côtiers. Manuscrit de thèse soutenue le 11 décembre 2014.

[2] Rousseeuw, K., Poisson Caillault, E., Lefebvre, A., & Hamad, D. (2015). Hybrid hidden Markov model for marine environment monitoring. Selected Topics in Applied Earth Observations and Remote Sensing, IEEE Journal of, 8(1), 204-213. http://dx.doi.org/10.1109/JSTARS.2014.

[3] Lefebvre, A. (2015). MAREL Carnot data and metadata from Coriolis Data Centre. SEANOE. http://doi.org/10.17882/39754.

[4] Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1), 100-108.

[5] Ng, A. Y., Jordan, M. I., & Weiss, Y. (2002). On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems, 2, 849-856.

Gap between eigenvalues ↩
Method capable of separating groups of non-overlapping globular shapes of data [4] ↩
Mthod for clustering non-linearly separable data using the spectrum of the similarity matrix between each measure [5] ↩
centered/reduced ↩
Using the Viterbi algorithm ↩
This dataset must have the same structure as the one used to estimate the model parameters. ↩