If your script has performance issues due to the amount of data you are using, there are some things you can do to decrease the execution time of your script.
The most important thing is to avoid loops.
Why can a loop be a problem?
- accessing single values of a data frame is time consuming
- function call in every iteration costs more than one function call to many values
- extending vectors in every interation, requires reallocation of memory
The solution is to use vectorized function calls. The packages dplyr and tidyr are specially designed to do this comfortabely.
#do not forget to load the necessary package library(tidyr) library(dplyr)
There is no receipt to convert a arbitrary loop into a "piped" command, but the following examples should help to get you started.
Example 1
Lets start with an easy loop, which was used bacause "if" needs a logical of length 1 as an argument.
for(i in 1:length(x)){ if(x[i]>1){ x[i] <- x[i] + 1} else {x[i] <- x[i] + 2} }
Instead you can use the "ifelse" command which takes a vector of logicals.
x <- ifelse(x > 1, x + 1, x + 2)
Lets execute both commands 100 times with a vector of length 1.000.000.
As you can see the "ifelse" command is more than twice as fast.
Example 2
The for loop counts the different appearances of "Type" where "Criterion" is TRUE in the data frame df.
If there is a row with "Type" = NA or = 0 then it adds to the "Type" of the last entry with a value.
dfnew <- data.frame(Type1 = 0, Type2 = 0, Type3 = 0, Type4 = 0, Type5 = 0, Type6 = 0) for (row in 1:nrow(df)) { if(!is.na(df[row,"Type"]) & df[row,"Type"] != 0){ lastType = df[row,"Type"] } if(df[row,"Criterion"] == TRUE){ if(lastType == 1){ dfnew[1,"Type1"] <- dfnew[1,"Type1"] + 1 } else if(lastType == 2){ dfnew[1,"Type2"] <- dfnew[1,"Type2"] + 1 } else if(lastType == 3){ dfnew[1,"Type3"] <- dfnew[1,"Type3"] + 1 } else if(lastType == 4){ dfnew[1,"Type4"] <- dfnew[1,"Type4"] + 1 } else if(lastType == 5){ dfnew[1,"Type5"] <- dfnew[1,"Type5"] + 1 } else if(lastType == 6){ dfnew[1,"Type6"] <- dfnew[1,"Type6"] + 1 } } }
An alternative could look like this:
df <- df %>% mutate( Type = ifelse( Type == 0, NA, Type)) %>% fill(Type, .direction = "down") %>% filter( Criterion == TRUE) %>% filter(!is.na(Type)) %>% group_by(Type = factor(Type, levels = c(1,2,3,4,5,6)), .drop=F) %>% summarize(Count = n())
In the mutate call we convert 0 into NAs so we can use the fill command to change the "Type" of the NAs and 0s to the value of the last not NA entry.
Then we filter for entries only with "Criterion == TRUE" and cut off the top which did not got touched by the fill function (and has still NA values).
In grouping by "Type" as a factor the levels with count 0 do not get droped (because we chose .drop = FALSE) in the summarize command.
Again we want to see how much faster our alternative command is
The data frame we tested on was 100.000 entries long and the piped command was more than 70 times faster!
Example 3
In the next loop the vectors "start" and "stop" get extended in every interation. This is something you should avoid by all means!
start <- c() stop <- c() for(i in 1:nrow(df)) { if(df$type[i]==0){ start <- c(start, df$timestamp[i]) stop <- c(stop, df$timestamp[i+1]) } }
Alternatively,
dfnew <- df %>% mutate( stop = lead( timestamp)) %>% filter(type == 0) %>% mutate(start = timestamp)
Lets check the performance on a data frame with 1.000.000 entries!
Thats why this should be avoided in any case! Because the vectors "start" and "stop" change in length, new memory space gets allocated, which is not well performing as you can see.
If you want to check out the performances yourself the code to do so is attached to the link below. Note that the exact execution time of course depends on your machine!