Data formatting protocol

This protocol allows you to format your data in csv format according to the protocol required by the interface. We assume that the time information (dates and times) are in a single cell. The time is not required.

You will need to enter the codes for the 3 steps below in an R script or console: read, align, create. You can use the square on the right of the code ("copy to clipboard"). Step 2 of the alignment depends on your sampling wish, we give you some examples: 10 minutes, 20 minutes, 30 minutes, 1 day, 1 week, 1 month...

For more explanations on the lines of code, see the protocol with the MAREL-Carnot example (MAREL-Carnot example).

File reading

Define its working directory (folder in which the raw data is located) and the information associated with your data, file name, format of the dates in this file, name of the column containing the dates, ...

Please fill in the "to be filled in" fields.

rm(list=ls(all=TRUE))
graphics.off()
repDonnees="to be filled in"  #linux example :~/Protocole_uHMM/MAREL-Carnot"
setwd(repDonnees)
nomfichier="to be filled in" #File name ex MarelCarnot.csv
nomDate="to be filled in" # column name of the time information

# format of your column d=day, m=month, y=year in 2 digits Y for 4
formatDate="to be filled in " # ex %d/%m/%y %H:%M:%S"
separateur="," # to be changed if needed
decimale="." # to be changed if needed

#file reading
df=read.csv(nomfichier,header=T,sep=separateur, dec=decimale)
dim(df) # visualization of the number of observations and columns
names(df) # Visualization of the names of all df columns.

#time formatting in POSIXct machine format
idate=grep(pattern=nomDate, names(df),ignore.case=T)
date=as.character((df)[,idate])
df$temps=strptime(date, format=formatDate)

Note that the date is read and put in the POSIXt format (AAAA-MM-JJ hh:mm:ss). The "format" argument takes exactly the original format of the date column.
%Y corresponds to years described by 4 digits like 2017 otherwise %y for 17
%m is the month
%d is the day
%H is the time
%M corresponds to minutes
%S corresponds to seconds

Alignments

The purpose of this step is to make the series regular. Any sensor will be synchronized to a single step (max rule if the sensor responds twice in the same interval). Moreover, if one (or more) sensor(s) did not respond to some steps, the data for this missing time step will be noted NA (not Available). This will allow to compute missing values rates over a period of time, to better estimate the probabilities of occurrence, ... Several examples of alignment with different time steps are detailed: $(1 month, 1 week, 1 day, 1 hour, 30 minutes, 20 minutes, 10 minutes)$ . Choose according to your original data and the interpretation to be done.

1 month alignment

#a1 step
date=format(df$temps,"%F")
mois=as.numeric(format(df$temps,"%m"))
df$temps=strptime(format(df$temps, "%Y-%m-15"),format="%Y-%m-%d")
#a2 step
min.date=min(df$temps,na.rm=T)
max.date=max(df$temps,na.rm=T)
a=seq(min.date,max.date,by="1 month")
#a3 step
df2=data.frame(temps=as.POSIXct(a))
df1=merge(df,df2, all=TRUE)
#a4 step
classe=sapply(df1, FUN=class)
numPara=which(classe == "numeric") # possible to manually select ex numPara=3:6
df1.M=aggregate(df1[,numPara],by=list(temps=as.character(df1$temps)),
                FUN=function(x){ out=NA; if(sum(is.na(x))<length(x))
                {out=max(x,na.rm=T);}; out})
df1.M$temps=strptime(df1.M$temps,format="%Y-%m-%d")
dfa=df1.M
#Deletion of unnecessary data for further processing.
rm(date); rm(mois); rm(min.date); rm(max.date); rm(a); rm(df2); rm(df1); rm(numPara); rm(df1.M)

Step by step explanations

(a1) Extraction of the date (thanks to the "%F" which corresponds to YYYY-MM-DD) in a vector named date. And extraction of the month of each acquisition in another vector named month.
We consider the moment of the acquisition equal to the 15^th day of each month, it is then put under format POSIXt.

(a2) The creation of the sequence "a", which goes from the oldest to the most recent date. In addition, the frequency is inserted by the "by" argument.

(a3) Creation of the array, including only the vector "a".
Then merge the array with the initial raw data array (df1).

(a4) Selection of the columns containing the data that you want to aggregate. Attention these columns must be of numeric type. Here is how to see the class of parameters. From this new table only the desired columns are extracted, and we keep only the maximum value of the created duplicates.

1 week alignment

Extraction of the year for all acquisitions.
The number of the day is located. That is to say it gives the day in Julian (example: the 1^st of February corresponds to the 32^nd day of the year).
The data acquired every 7 days are entered, always starting with the 1^st day of the year.

annees=format(df$temps,"%Y")
numJour=as.numeric(strftime(df$temps, "%j"))
numJourSemaine=seq(1,364,by=7)
numJourT=sapply(numJour,function(x){index=max(which(x>=numJourSemaine));return(numJourSemaine[index])})
df$temps=strptime(paste(numJourT,annees,sep=" "), "%j %Y")
a=as.character(sort(unique(as.numeric(annees))))
date=rep(a,each=52)
jour=rep(seq(1,364,by=7), length(a))
d=paste(date,jour,sep="-")

df2=data.frame(temps=strptime(d,format="%Y-%j"))
df1=merge(df,df2, by=intersect(names(df),names(df2)),all=TRUE)
classe=sapply(df1, FUN=class)
numPara=which(classe == "numeric")
df1.W=aggregate(df1[,numPara],by=list(temps=as.character(df1$temps)),
               FUN=function(x){ out=NA; if(sum(is.na(x))<length(x))
               {out=max(x,na.rm=T);}; out})
df1.W$temps=strptime(df1.W$temps,format="%Y-%m-%d")
dfa=df1.W
rm(df1); rm(df2); rm(numPara); rm(df1.W); rm(d); rm(jour); rm(a); rm(date)

Detail: Values recorded between the 1^st and 7^th day will be considered earned on the 1^st day. The values recorded between the 8^th and the 15^th day will be considered as acquired on the 8^th day, etc. A constraint is added for the last days, as it is fixed that there can be no more than 52 weeks per year. The values recorded beyond the 51^st week (358 days) will then be assigned to the last week. That is to say the days going from 358 to 366.

1 day alignment

Only the date is also useful it is put in POSIXt format by removing the hours, minutes and seconds if present.

date=format(df$temps,"%F")
jours=as.numeric(format(df$temps,"%d"))
df$temps=strptime(format(df$temps, "%F"),format="%Y-%m-%d")

min.date=min(df$temps,na.rm=T)
max.date=max(df$temps,na.rm=T)
a=seq(min.date,max.date,by="1 day")
d=format(a,"%F")

df2=data.frame(temps=strptime(d,format="%Y-%m-%d"))
df1=merge(df,df2, all=TRUE)
classe=sapply(df1, FUN=class)
numPara=which(classe == "numeric") 
df.J=aggregate(df1[,numPara],by=list(temps=as.character(df1$temps)),
                FUN=function(x){ out=NA; if(sum(is.na(x))<length(x))
                {out=max(x,na.rm=T);}; out})
df.J$temps=strptime(df.J$temps,format="%Y-%m-%d")
dfa=df.J
rm(date); rm(jours); rm(min.date); rm(max.date); rm(a); rm(df2); rm(df1); rm(d); rm(df.J)

1 hour alignment

The date and time are formatted so that the minutes and seconds are considered zero HH:00:00. In addition, the POSIXt format is imposed again.

date=format(df$temps,"%F")
heures=as.numeric(format(df$temps,"%H"))
df$temps=strptime(format(df$temps, "%F %H:00:00"),format="%Y-%m-%d %H:%M:%S")

min.date=min(df$temps,na.rm=T)
max.date=max(df$temps,na.rm=T)
a=seq(min.date,max.date,by="1 hour")

df2=data.frame(temps=strptime(a,format="%Y-%m-%d %H:%M:%S"))
df1=merge(df,df2, all=TRUE)
classe=sapply(df1, FUN=class)
numPara=which(classe == "numeric") # colonnes séléctionnées : 5 à 26

df.H=aggregate(df1[,numPara],by=list(temps=as.character(df1$temps)),
             FUN=function(x){ out=NA; if(sum(is.na(x))<length(x))
            {out=max(x,na.rm=T);}; out})
df.H$temps=strptime(df.H$temps,format="%Y-%m-%d %H:%M:%S")
dfa=df.H
rm(date); rm(heures); rm(min.date); rm(max.date); rm(a); rm(df2); rm(df1); rm(numPara); rm(df10)

30 minutes alignment

On considère que [hh:00, hh:29[= hh:15 et [hh:30, hh:59[= hh:45.

date=format(df$temps,"%F %H")
minutes=as.numeric(format(df$temps,"%M"))
minute=rep("45",length(minutes));
minute[minutes<30]="15"
d=paste(date,minute,"00",sep=":")
df$temps=strptime(d,format="%Y-%m-%d %H:%M:%S")

min.date=min(df$temps,na.rm=T)
max.date=max(df$temps,na.rm=T)
a=seq(min.date,max.date,by="1 hour")
date=rep(format(a,"%F %H"),each=2)
minutes=rep(c("15","45"), length(a))
d=paste(date,minutes,"00",sep=":")

df2=data.frame(temps=strptime(d,format="%Y-%m-%d %H:%M:%S"))
df1=merge(df,df2, all=TRUE)
classe=sapply(df1, FUN=class)
numPara=which(classe == "numeric") # colonnes séléctionnées : 5 à 26

df30=aggregate(df1[,numPara],by=list(temps=as.character(df1$temps)),
             FUN=function(x){ out=NA; if(sum(is.na(x))<length(x))
            {out=max(x,na.rm=T);}; out})
df30$temps=strptime(df30$temps,format="%Y-%m-%d %H:%M:%S")
dfa=df30
rm(date); rm(minutes); rm(minute); rm(min.date); rm(max.date); rm(a); rm(d); rm(df2); rm(df1); rm(numPara); rm(df30)

20 minutes alignment

On considère que [hh:00, hh:20[= hh:10, [hh:20, hh:40[= hh:30 et pour [hh:40, hh:59]= hh:50.

date=format(df$temps,"%F %H")
minutes=as.numeric(format(df$temps,"%M"))
minute=rep("10",length(minutes))
minute[minutes>19]="30"
minute[minutes>40]="50"
d=paste(date,minute,"00",sep=":")
df$temps=strptime(d,format="%Y-%m-%d %H:%M:%S")

min.date=min(df$temps,na.rm=T)
max.date=max(df$temps,na.rm=T)
a=seq(min.date,max.date,by="1 hour")

date=rep(format(a,"%F %H"),each=3)
minutes=rep(c("10","30","50"), length(a))
d=paste(date,minutes,"00",sep=":")

df2=data.frame(temps=strptime(d,format="%Y-%m-%d %H:%M:%S"))
df1=merge(df,df2, all=TRUE)
classe=sapply(df1, FUN=class)
numPara=which(classe == "numeric") # colonnes séléctionnées : 5 à 26

df20=aggregate(df1[,numPara],by=list(temps=as.character(df1$temps)),
             FUN=function(x){ out=NA; if(sum(is.na(x))<length(x))
            {out=max(x,na.rm=T);}; out})

df20$temps=strptime(df20$temps,format="%Y-%m-%d %H:%M:%S")
dfa=df20
rm(date); rm(minutes); rm(minute); rm(min.date); rm(max.date); rm(a); rm(d); rm(df2); rm(df1); rm(numPara); rm(df20)

10 minutes alignment

On considère que [hh:00, hh:10[= hh:05, [hh:10, hh:20[= hh:15, [hh:20, hh:30[= hh:25, [hh:30, hh:40[= hh:35, [hh:40, hh:50]= hh:45 et pour [hh:50, hh:59]= hh:55.

date=format(df$temps,"%F %H")
minutes=as.numeric(format(df$temps,"%M"))
minute=rep("05",length(minutes))
minute[minutes>10]="15"
minute[minutes>20]="25"
minute[minutes>30]="35"
minute[minutes>40]="45"
minute[minutes>50]="55"

d=paste(date,minute,"00",sep=":")
df$temps=strptime(d,format="%Y-%m-%d %H:%M:%S")

min.date=min(df$temps,na.rm=T)
max.date=max(df$temps,na.rm=T)
a=seq(min.date,max.date,by="1 hour")

date=rep(format(a,"%F %H"),each=6)
minutes=rep(c("05", "15","25", "35", "45", "55"), length(a))
d=paste(date,minutes,"00",sep=":")

df2=data.frame(temps=strptime(d,format="%Y-%m-%d %H:%M:%S"))
df1=merge(df,df2, all=TRUE)
classe=sapply(df1, FUN=class)
numPara=which(classe == "numeric") # colonnes séléctionnées : 5 à 26
dfa=aggregate(df1[,numPara],by=list(temps=as.character(df1$temps)),
             FUN=function(x){ out=NA; if(sum(is.na(x))<length(x))
            {out=max(x,na.rm=T);}; out})
dfa$temps=strptime(dfa$temps,format="%Y-%m-%d %H:%M:%S")

Creation of the uHMM compatible file

The uHMM interface requires some additional condition of the alignment. It is necessary to have a column containing only the date and another column containing the time. For this, the creation of 2 vectors is essential, one containing the date and the other containing the time.
"dfa" being your array of aligned data that you want the interface to read. "%F" removes the years, month and day (YYYY-MM-DD) and "%X" removes the hours, minutes and seconds (hh:mm:ss). It remains then to create the file compatible with the interface in the compatible format.

colonneDate=as.character(format(dfa$temps,"%F"))
colonneHeure=as.character(format(dfa$temps,"%X"))
iTemps=grep(pattern="temps",names(dfa))
df.uHMM=data.frame(Dates=colonneDate,Hours=colonneHeure,(dfa[,-iTemps]))
write.table(df.uHMM,file="uHMMformattedData.txt",dec='.',sep="\t", quote=FALSE, row.names=FALSE)