Introduction

Segmenting a series in an unsupervised way to extract its important, recurrent or extreme events, and then predict these events is a topic of concern in several fields. This desire to characterize the dynamics of a series is related to :

the desire to increase the understanding of the processes involved in this series;
adapt operational sampling strategies.

The succession of stages leading to a recurrent or extreme event (a peak in metal concentration in a river, or a peak in phytoplankton concentration in fresh or marine waters, for example) can be seen as a path through environmental states guided by both the observations and their sequence (with a high level of dependence in the succession of observations). We can then represent the dynamics of an events thanks to a connected graph where a node represents an environmental state and a edge represents the possibility to move from one state to another. Often, environmental states are not directly observable events, unlike physico-chemical and biological parameters.

The use of an ergodic Hidden Markov Model (HMM) seems to be the natural approach to characterize the dynamics of events from the only observations that are the physico-chemical and biological parameters. The creation of a hidden Markov model requires the estimation of all its parameters. The parameters of the HMM to be defined are :

the number of states ;
the transition laws between states and the emission laws of these states ;
the characterization of these states.

Usually, the parameters of the HMM are learned with a labeled or fixed database with a priori information. Here, we address the issue of prediction of extreme events using a hybrid unsupervised hidden Markov model built from a multidimensional database acquired at high frequencies (with respect to the process under study) or at low frequencies (but over a sufficient period of time).

The main idea was to build an automatic system for estimating the characteristic environmental states from measures acquired at high temporal resolution with the hazards of missing or outlier data. No knowledge on the states, their characterization and their sequencing is supplied in the construction of the system, it has to learn automatically this information from the only measures.

The number of states is obtained by a criteria related to the data geometry according to a spectral clustering. The characterization of the states is performed by vector quantization¹ of the known data, in order to be free from hypothesis on their distribution. A partir d’un modèle MMC-NS construit, on peut alors prédire les états d’une autre série si celle-ci a une structuration identique (mêmes variables, mêmes types d’événements).

The purpose of this interface is to allow standard users, non-statisticians, to model a complex physical phenomenon by a probabilistic graph of finite states from multi-parameter temporal observations without any a priori knowledge. The illustrations presented in this documentation respond to a problem of modeling the dynamics of phytoplankton blooms in the coastal zone of Boulogne-sur-Mer, without knowledge of the seasonal succession of phytoplankton taxa and biomass in general, from Marel-Carnot data ².

mapping of the attribute space by representative states ↩
http://www.seanoe.org/data/00286/39754/: this dataset has been cleaned (temporal alignment, correction of outliers, replacement of some missing values...) ↩