Skip to content

Set your working directory (folder where the raw data is located).

rm(list=ls(all=TRUE))
graphics.off()
repDonnees="~/Protocole_uHMM/MAREL-Carnot"
The "~" symbol corresponds to the operating system under Linux. For an operating system under Windows, it will be of type "C:\Program Files\dossier_du_programme\".

The concerned file (located in the working directory) will be imported. For this, the "data " vector corresponds to the folder that contains the raw data. Then the second line of code corresponds to the raw data that will be exploited. This folder is recognized thanks to the argument "pattern" which is equal to a part (=c) of the name of the desired folder.

The reading of the file in R is done. In our case the separator is a comma and the decimal and a point, which must be specified. The argument "header", asks if the data have the names of the parameters in the first line. We call it df.

data=list.files(path=repDonnees)
files.csv=list.files(path=repDonnees, full.names=TRUE, pattern=c("2216.csv"))
df=read.csv(files.csv, header=TRUE, sep=",", dec=".")

Conservation d'une partie des données.

nCol=ncol(df)
nLig=nrow(df)
N=ceiling(0.01*nLig); # takes 1% of the data, to be modulated according to the number of data and the desired alignment.
df=df[1:N,] # data from 1st value to N (1% of total data).

Alignments

Only the column containing the date will be extracted. The index of the date column is then retrieved, "names(df)" gives the names of all the columns of df. The pattern argument must be equal to the exact name of the column in df.

names(df) # Visualization of the names of all df columns.
idate=grep(pattern="DATE..yyyy.mm.ddThh.mi.ssZ", names(df),ignore.case=T); idate
The date column is located in the 2nd row.

The extraction of this column is then done.

date=as.character((df)[,idate]); head(date)

Then, it is put in the POSIXt format (YYYY-MM-DD hh:mm:ss). The "format" argument takes exactly the initial format of the date column.
%Y is the years
%m is the month
%d is the day
%H is the hours
%M is the minutes
%S is the secondes

df$temps=strptime(date, format="%Y-%m-%dT%H:%M:%SZ"); head(df$temps)

Make the series regular: Several examples illustrated (1 month, 1 week, 1 day, 1 hour, 30 minutes, 20 minutes, 10 minutes)

1 month

Extraction of the date (thanks to the "%F" which corresponds to YYYY-MM-DD) in a vector named date. And extraction of the month of each acquisition in another vector named month.

date=format(df$temps,"%F"); head(date)
mois=as.numeric(format(df$temps,"%m")); head(mois)

We consider the moment of acquisition equal to the 15th day of each month, then it is put in POSIXt format.

df$temps=strptime(format(df$temps, "%Y-%m-15"),format="%Y-%m-%d"); head(df$temps)

The creation of the sequence "a", which goes from the oldest to the most recent date. In addition, the frequency is inserted by the "by" argument.

min.date=min(df$temps,na.rm=T); min.date
max.date=max(df$temps,na.rm=T); max.date
a=seq(min.date,max.date,by="1 month"); head(a); tail(a)

Creation of the table, including only the vector "a".

df2=data.frame(temps=as.POSIXct(a)); df2[1:3,]

Merge the table with the initial raw data table (df1).

df1=merge(df,df2, all=TRUE); df1[1:7,1:3]

Selection of the columns containing the data that you want to aggregate. Attention these columns must be of numeric type.

Here is how to see the class of parameters

classe=sapply(df, FUN=class); head(classe)
numPara=5:26 # colonnes séléctionnées : 5 à 26

From this new table only the desired columns are extracted, and only the maximum value of the created duplicates is kept.

df1.M=aggregate(df1[,numPara],by=list(temps=as.character(df1$temps)),
                FUN=function(x){ out=NA; if(sum(is.na(x))<length(x))
                {out=max(x,na.rm=T);}; out}); df1.M[1:7,1:5]

Removing unnecessary data for further processing.

rm(date); rm(mois); rm(min.date); rm(max.date); rm(a); rm(df2); rm(df1); rm(numPara); rm(df1.M)

1 week

Extraction of the year for all acquisitions.

annees=format(df$temps,"%Y"); head(annees)

The number of the day is located. That is to say gives the day in Julian (example: February 1 corresponds to the 32nd day of the year).

numJour=as.numeric(strftime(df$temps, "%j")); head(numJour)

The data acquired every 7 days are entered, always starting with the first day of the year.

numJourSemaine=seq(1,364,by=7); numJourSemaine

The values registered between the 1st and the 7th day will be considered as acquired on the 1st day. The values recorded between the 8th and the 15th day will be considered as acquired on the 8th day, etc. A constraint is added for the last days, as it is fixed that there can be no more than 52 weeks per year. The values recorded beyond the 51st week (358 days) will then be assigned to the last week. That is to say the days going from 358 to 366.

numJourT=sapply(numJour,function(x){index=max(which(x>=numJourSemaine));return(numJourSemaine[index])}); head(numJourT)

Creation of a vector, where the year, the month and the day corresponding to the first day of each week will be integrated.

df$temps=strptime(paste(numJourT,annees,sep=" "), "%j %Y"); head(df$temps)

Recover all the years present in the data, but in a way to have only one copy.
Then we divide each recovered year by the number of weeks (52). This vector will be our date.

a=as.character(sort(unique(as.numeric(annees)))); head(a)
date=rep(a,each=52); head(date)

We create a vector containing the 1st days (in Julian) of each "beginning" of weeks. Then we paste it with the date.

jour=rep(seq(1,364,by=7), length(a)); head(jour)
d=paste(date,jour,sep="-"); head(d)

Creation of the table containing only the dates. These dates created previously, contain only the days of the 1st weeks, so there will be 52 values for each year.

df2=data.frame(temps=strptime(d,format="%Y-%j")); df2[1:10,]

Merge the previous table with the initial table.

df1=merge(df,df2, by=intersect(names(df),names(df2)),all=TRUE); df1[1:7,1:3]

Sort columns containing the data you want to aggregate. Attention these columns must be of numerical type.

Here is how to see the parameters class

classe=sapply(df, FUN=class); head(classe)
numPara=5:26 # colonnes séléctionnées : 5 à 26

From this new table only the desired columns are extracted, and only the maximum value of the created duplicates is kept.

df.W=aggregate(df1[,numPara],by=list(temps=as.character(df1$temps)),
               FUN=function(x){ out=NA; if(sum(is.na(x))<length(x))
               {out=max(x,na.rm=T);}; out}); df.W[1:7,1:5]

Unnecessary data suppression for the future.

rm(df1); rm(df2); rm(numPara); rm(df.W); rm(d); rm(jour); rm(a); rm(date)

1 day

Extraction of the column containing the date and setting in POSIXt format.

date=as.character((df)[,idate]); head(date)
df$temps=strptime(date, format="%Y-%m-%dT%H:%M:%SZ"); head(df$temps)

The date is extracted again (thanks to "%F" which corresponds to YYYY-MM-DD). The day is also extracted alone, then it is put back in POSIXt format by deleting the hours, minutes and seconds.

date=format(df$temps,"%F"); head(date)
jours=as.numeric(format(df$temps,"%d")); head(jours)
df$temps=strptime(format(df$temps, "%F"),format="%Y-%m-%d"); head(df$temps)

A sequence "a" is created that starts from the oldest date and ends at the most recent date.

min.date=min(df$temps,na.rm=T); min.date
max.date=max(df$temps,na.rm=T); max.date
a=seq(min.date,max.date,by="1 day"); head(a); tail(a)

From this sequence is then removed the hours which are inserted automatically.

d=format(a,"%F"); head(d)

Table creation, including only the previously created sequence (d).

df2=data.frame(temps=strptime(d,format="%Y-%m-%d")); df2[1:10,]

Merge the table with the initial raw data table (df1).

df1=merge(df,df2, all=TRUE); df1[1:7,1,3]

Selection of the columns containing the data that you want to aggregate. Attention these columns must be of numeric type.

Here is how to see the class of parameters

classe=sapply(df, FUN=class); head(classe)
numPara=5:26 # selected columns: 5 to 26

From this new table only the desired columns are extracted, and only the maximum value of the created duplicates is kept.

df.J=aggregate(df1[,numPara],by=list(temps=as.character(df1$temps)),
                FUN=function(x){ out=NA; if(sum(is.na(x))<length(x))
                {out=max(x,na.rm=T);}; out}); df.J[1:7,1:5]

Unnecessary data suppression for the future.

rm(date); rm(jours); rm(min.date); rm(max.date); rm(a); rm(df2); rm(df1); rm(d); rm(df.J)

1 hour

Extraction of the column containing the date and setting in POSIXt format.

date=as.character((df)[,idate]); head(date)
df$temps=strptime(date, format="%Y-%m-%dT%H:%M:%SZ"); head(df$temps)

The date and time are extracted, and added in a vector named "date", as well as the time (HH) which is placed in a vector named "hours".

date=format(df$temps,"%F"); head(date)
heures=as.numeric(format(df$temps,"%H")); head(heures)

The "date" vector is pasted with the "hours" vector, and at the same time the minutes and seconds are added which are considered equal to 00:00.
The POSIXt format is imposed again.

df$temps=strptime(format(df$temps, "%F %H:00:00"),format="%Y-%m-%d %H:%M:%S"); head(df$temps)

This date column, which has a desired frequency, will start from the oldest date and end at the most recent date. For this, here are the dates concerned.

min.date=min(df$temps,na.rm=T); min.date
max.date=max(df$temps,na.rm=T); max.date

The creation of the sequence "a", which goes from the oldest to the most recent date. In addition, the frequency is inserted by the "by" argument.

a=seq(min.date,max.date,by="1 hour"); head(a); tail(a)

Creation of the table, including only the "date" vector (with the newly created hour, minutes and seconds).

df2=data.frame(temps=strptime(a,format="%Y-%m-%d %H:%M:%S")); df2[1:10,]

Merge the table with the initial raw data table (df1).

df1=merge(df,df2, all=TRUE); df[1:7,1:3]

Selection of the columns containing the data that you want to aggregate. Attention these columns must be of numeric type.

Here is how to see the class of parameters

classe=sapply(df, FUN=class); head(classe)
numPara=5:26 # colonnes séléctionnées : 5 à 26

From this new table only the desired columns are extracted, and only the maximum value of the created duplicates is kept.

df10=aggregate(df1[,numPara],by=list(temps=as.character(df1$temps)),
             FUN=function(x){ out=NA; if(sum(is.na(x))<length(x))
            {out=max(x,na.rm=T);}; out}); df10[1:7,1:5]

Unnecessary data suppression for the future.

rm(date); rm(heures); rm(min.date); rm(max.date); rm(a); rm(df2); rm(df1); rm(numPara); rm(df10)

30 minutes

Extraction of the column containing the date and setting in POSIXt format.

date=as.character((df)[,idate]); head(date)
df$temps=strptime(date, format="%Y-%m-%dT%H:%M:%SZ"); head(df$temps)

The date and time are extracted, and added in a vector named "date".
The "minutes" vector will simply contain the minutes that will be extracted.

date=format(df$temps,"%F %H"); head(date)
minutes=as.numeric(format(df$temps,"%M")); head(minutes)

We consider that [hh:00, hh:29[= hh:15 and [hh:30, hh:59[= hh:45.

minute=rep("45",length(minutes));
minute[minutes<30]="15"; head(minute)

Adding "minute", newly created, in the date column (which contain the year, month, day and time, which were extracted before the modification). It is then put in the POSIXt format.

d=paste(date,minute,"00",sep=":"); head(d)
df$temps=strptime(d,format="%Y-%m-%d %H:%M:%S"); head(df$temps)

This date column, which has a desired frequency, will start from the oldest date and end at the most recent date. For this, here are the dates concerned.

min.date=min(df$temps,na.rm=T); min.date
max.date=max(df$temps,na.rm=T); max.date

The creation of the sequence "a", which goes from the oldest to the most recent date. Also the frequency is inserted by the argument "by".

a=seq(min.date,max.date,by="1 hour"); head(a); tail(a)
Warning, the seq function does not allow you to go below one hour. It will then be enough to divide this hour by 2 to obtain the 30 minutes of the desired frequency.

%F stands for year, month and day.

date=rep(format(a,"%F %H"),each=2); head(date)

The minutes are put back in alternation between 10, 30 and 50, on all the length of the created sequence.
The previously created minutes vector is pasted with the "date" vector by inserting the separator and adding the seconds (00).

minutes=rep(c("15","45"), length(a)); head(minutes)
d=paste(date,minutes,"00",sep=":"); head(d)

Creation of the table, including only the "date" vector (with the newly created hour, minutes and seconds).

df2=data.frame(temps=strptime(d,format="%Y-%m-%d %H:%M:%S")); df2[1:10,]

Merge the table with the initial raw data table (df1).

df1=merge(df,df2, all=TRUE); df1[1:7,1:3]

Selection of the columns containing the data that you want to aggregate. Attention these columns must be of numeric type.

Here is how to see the class of parameters

classe=sapply(df, FUN=class); head(classe)
numPara=5:26 # colonnes séléctionnées : 5 à 26

From this new table only the desired columns are extracted, and only the maximum value of the created duplicates is kept.

df30=aggregate(df1[,numPara],by=list(temps=as.character(df1$temps)),
             FUN=function(x){ out=NA; if(sum(is.na(x))<length(x))
            {out=max(x,na.rm=T);}; out}); df30[1:7,1:5]

Unnecessary data suppression for the future.

rm(date); rm(minutes); rm(minute); rm(min.date); rm(max.date); rm(a); rm(d); rm(df2); rm(df1); rm(numPara); rm(df30)

20 minutes

Extraction of the column containing the date and setting in POSIXt format.

date=as.character((df)[,idate]); head(date)
df$temps=strptime(date, format="%Y-%m-%dT%H:%M:%SZ"); head(df$temps)

The date and time are extracted, and added in a vector named "date".
The "minutes" vector will simply contain the minutes that will be extracted.
F" corresponds to the year, month and day.

date=format(df$temps,"%F %H"); head(date)
minutes=as.numeric(format(df$temps,"%M")); head(minutes)

We consider that [hh:00, hh:20[= hh:10, [hh:20, hh:40[= hh:30 and [hh:40, hh:59]= hh:50.

minute=rep("10",length(minutes))
minute[minutes>19]="30"
minute[minutes>40]="50"; head(minute)

Added "minute", newly created, in the date column (which contain the year, month, day and time, which were extracted before the modification), and added the seconds set to "00". It is then put in the POSIXt format.

d=paste(date,minute,"00",sep=":"); head(d)
df$temps=strptime(d,format="%Y-%m-%d %H:%M:%S"); head(df$temps)

This date column, which has a desired frequency, will start from the oldest date and end at the most recent date. For this, here are the dates concerned.

min.date=min(df$temps,na.rm=T); min.date
max.date=max(df$temps,na.rm=T); max.date

The creation of the sequence "a", which goes from the oldest to the most recent date. Moreover the frequency is inserted by the argument "by".

a=seq(min.date,max.date,by="1 hour"); head(a); tail(a)
Warning, the seq function does not allow you to go below one hour. It will be enough to divide this hour by 3 to obtain the 20 minutes of the desired frequency.

F stands for year, month and day.

date=rep(format(a,"%F %H"),each=3); head(date)

The minutes are put back in alternation between 10, 30 and 50, on the whole length of the created sequence.
The previously created minutes vector is pasted with the "date" vector by inserting the separator and adding the seconds (00).

minutes=rep(c("10","30","50"), length(a)); head(minutes)
d=paste(date,minutes,"00",sep=":"); head(d)

Creation of the table, including only the "date" vector (with the newly created hour, minutes and seconds).

df2=data.frame(temps=strptime(d,format="%Y-%m-%d %H:%M:%S")); df2[1:10,]

Merge the table with the initial raw data table (df1).

df1=merge(df,df2, all=TRUE); df1[1:7,1:3]

Selection of the columns containing the data that you want to aggregate. Attention these columns must be of numeric type.

Here is how to see the class of parameters

classe=sapply(df, FUN=class); head(classe)
numPara=5:26 # colonnes séléctionnées : 5 à 26

From this new table only the desired columns are extracted, and only the maximum value of the created duplicates is kept.

df20=aggregate(df1[,numPara],by=list(temps=as.character(df1$temps)),
             FUN=function(x){ out=NA; if(sum(is.na(x))<length(x))
            {out=max(x,na.rm=T);}; out}); df20[1:7,1:5]

Unnecessary data suppression for the future.

rm(date); rm(minutes); rm(minute); rm(min.date); rm(max.date); rm(a); rm(d); rm(df2); rm(df1); rm(numPara); rm(df20)

10 minutes

Extraction of the column containing the date and setting in POSIXt format.

date=as.character((df)[,idate]); head(date)
df$temps=strptime(date, format="%Y-%m-%dT%H:%M:%SZ"); head(df$temps)

The date and time are extracted, and added in a vector named "date".
The "minutes" vector will simply contain the minutes that will be extracted.

date=format(df$temps,"%F %H"); head(date)
minutes=as.numeric(format(df$temps,"%M")); head(minutes)

We consider that [hh:00, hh:10[= hh:05, [hh:10, hh:20[= hh:15, [hh:20, hh:30[= hh:25, [hh:30, hh:40[= hh:35, [hh:40, hh:50]= hh:45 and [hh:50, hh:59]= hh:55.

minute=rep("05",length(minutes))
minute[minutes>10]="15"
minute[minutes>20]="25"
minute[minutes>30]="35"
minute[minutes>40]="45"
minute[minutes>50]="55"; head(minute)

Addition of newly created "minute" in the date column (which contain the year, month, day and time, which were extracted before the modification). It is then put in the POSIXt format.

d=paste(date,minute,"00",sep=":"); head(d)
df$temps=strptime(d,format="%Y-%m-%d %H:%M:%S"); head(df$temps)

This date column, which has a desired frequency, will start from the oldest date and end at the most recent date. For this, the relevant dates are marked and recorded.

min.date=min(df$temps,na.rm=T); min.date
max.date=max(df$temps,na.rm=T); max.date

The creation of the sequence "a", which goes from the oldest to the most recent date. Moreover the frequency is inserted by the argument "by".

a=seq(min.date,max.date,by="1 hour"); head(a); tail(a)
Warning, the seq function does not allow you to go below one hour. It will then be enough to divide this hour by 6 to obtain the 10 minutes of the desired frequency.

date=rep(format(a,"%F %H"),each=6); head(date)

The minutes are put back in alternation between 05, 15, 25, 35, 45, and 55, on the whole length of the created sequence.
The previously created minutes vector is pasted with the "date" vector by inserting the separator and adding the seconds (00).

minutes=rep(c("05", "15","25", "35", "45", "55"), length(a)); head(minutes)
d=paste(date,minutes,"00",sep=":"); head(d)

Creation of the table, including only the "date" vector (with the newly created hour, minutes and seconds).

df2=data.frame(temps=strptime(d,format="%Y-%m-%d %H:%M:%S")); df2[1:10,]

Merge the table with the initial raw data table (df1).

df1=merge(df,df2, all=TRUE); df1[1:7,1:3]

Selection of the columns containing the data that you want to aggregate. Attention these columns must be of numeric type.

Here is how to see the class of parameters

classe=sapply(df, FUN=class); head(classe)
numPara=5:26 # colonnes séléctionnées : 5 à 26

From this new table only the desired columns are extracted, and only the maximum value of the created duplicates is kept.

df10=aggregate(df1[,numPara],by=list(temps=as.character(df1$temps)),
             FUN=function(x){ out=NA; if(sum(is.na(x))<length(x))
            {out=max(x,na.rm=T);}; out}); df10[1:7,1:5]

Creation of the file compatible with uHMM

The uHMM interface requires some additional condition of the alignment. It is necessary to have a column containing only the date and another column containing the time. For this, the creation of 2 vectors is essential, one containing the date and the other containing the time.
These 2 lines of code allow it:

colonneDate=as.character(format(df$temps,"%F"))
colonneHeure=as.character(format(df$temps,"%X"))
"df" is your data table that you want the interface to read. "%F" removes the years, month and day (YYYY-MM-DD) and "%X" removes the hours, minutes and seconds (hh:mm:ss).

The index of the time column will then be removed to make room for the 2 columns created previously. The time column is then identified.

iTemps=grep(pattern="temps",names(df))

Then we create a new data table with as first variable the date, and as second variable the time. These 2 columns will have to be named exactly "Dates" and "Hours" (requirement of the interface). Moreover we remove the time column.

df.cor=data.frame(Dates=colonneDate,Hours=colonneHeure,(df[,-iTemps]))

All that remains is to transform it into a text format (.txt).

write.table(df.cor,file="uHMMcorrectedQuentin.txt",dec='.',sep="\t", quote=FALSE, row.names=FALSE)