replication of parallel in r

Introduction

Today I was studying maching learning. Ng mentioned about the usage of multicore in ML, which draw my attention. Because in another course, Bioinformatics Algorithms (Part 1), the calculation of clump finding is a complex project that it may spent hours to find correct result. So I slightly dug the parallel use in R by google. One of the interesting result is coming from r-bloggers. So I repeated Daniel’s code on my machine.

parallel(multicore)

This is one of the most popular packages of parallel computing. Since R 2.14.0, it had been included in R and called Parallel. Daniel’s code still use the old name, so I had to rename all multicore as parallel.

snow

The snow (Simple Network of Workstations) package by Tierney et al. can use PVM, MPI, NWS as well as direct networking sockets. It provides an abstraction layer by hiding the communications details. The snowFT package provides fault-tolerance extensions to snow.

snowfall

The snowfall package by Knaus provides a more recent alternative to snow. Functions can be used in sequential or parallel mode.

the above two are quoted from CRAN.

lapply (function)

It is a commonly used loop function in R. We’ll use it to conduct benchmark test.

code

The codes below are originally written by Daniel Marcelino.

Preparation for Packages

if(!require(rbenchmark)){
   install.packages("rbenchmark")
   require(rbenchmark)
   } else{
      require(rbenchmark)
      }
if(!require(parallel)){
   install.packages("parallel")
   require(parallel)
   } else{
      require(parallel)
      }
if(!require(snow)){
   install.packages("snow")
   require(snow)
   } else{
      require(snow)
      }
if(!require(snowfall)){
   install.packages("snowfall")
   require(snowfall)
   } else{
      require(snowfall)
      }


if(!require(compare)){
   install.packages("compare")
   require(compare)
   } else{
      require(compare)
      }

Create Benchmark Function

r stands for the replication times.
n is the number of single calculation.
v stands for verification, you could verify whether the results are identical.

res <-function(n = 1e3, r = 100, v = F) {
   set.seed(51)
   process <- data.frame(id=sample(r, n, rep=T), x=rnorm(n), y=runif(n), z=rpois(n, 1), pexp(2, rate=1/3))
   all <- parallel:::detectCores(all.tests=TRUE)

   result <- benchmark(
      replications=10, order="user.self",
      lapply = {
         cat('---> distributing lapply...n')
         lapply_ <- data.frame(lapply(split(process[-1], process[1]), colMeans))
         },
      parallel = {
         cat('---> distributing mclapply...n')
         parallel_ <- data.frame(mclapply(split(process[-1], process[1]), colMeans, mc.cores = all))
         },
      snow = {
         cat('---> distributing parLapply...n')
         cl <- makeCluster(all, type = "SOCK")
         snow_ <- data.frame(parLapply(cl, split(process[-1], process[1]), colMeans))
         stopCluster(cl)
         },
      snowfall = {
         cat('---> distributing sfLapply...n')
         cl <- sfInit(parallel=TRUE, all, type = "SOCK")
         snowfall_ <- data.frame(sfLapply(split(process[-1], process[1]), colMeans))
         sfStop()
         }
      )
   if (v) message(all.equal(lapply_, parallel_), all.equal(lapply_, snow_), all.equal(lapply_, snowfall_))
   result$trial=as.factor(n)
   result
   }

Test Benchmark

Here we gonna use 10 as replication times due to the limited time.

res1 = res(1e3, 10)
res2 = res(1e4, 10)
res3 = res(1e5, 10)
res4 = res(1e6, 10)
res5 = res(1e7, 10)

top1 <- rbind(res1,res2)
top2 <- rbind(res2,res3)
top3 <- rbind(res3,res4)
top4 <- rbind(res4,res5)
top5 <- rbind(res1,res2,res3,res4,res5)

Plot the results

library(ggplot2)
library(gridExtra)
### Full Plot
### Notice that a plot with log scale on y would be better than as is.
b <- theme(panel.background = element_blank(), panel.grid.minor = element_blank(),
           panel.grid.major = element_blank(), plot.background = element_blank())
p <-ggplot(top5, aes(x=trial, y=user.self, group = test, colour=test)) +
   geom_line(size=3, alpha = I(0.7)) + xlab('Vector Size') + ylab('Time in seconds') + b


p1 <-ggplot(top1, aes(x=trial, y=user.self, group = test, colour=test)) +
   geom_line(size=3, alpha = I(0.7)) + ggtitle('From 1e3 to 1e4 rows') +
   xlab('Vector Size') + ylab('Time in seconds') + b

p2 <-ggplot(top2, aes(x=trial, y=user.self, group = test, colour=test)) +
   geom_line(size=3, alpha = I(0.7)) + ggtitle('From 1e4 to 1e5 rows') +
   xlab('Vector Size') + ylab('Time in seconds') + b

p3 <-ggplot(top3, aes(x=trial, y=user.self, group = test, colour=test)) +
   geom_line(size=3, alpha = I(0.7)) + ggtitle('From 1e5 to 1e6 rows') +
   xlab('Vector Size') + ylab('Time in seconds') + b

p4 <-ggplot(top4, aes(x=trial, y=user.self, group = test, colour=test)) +
   geom_line(size=3, alpha = I(0.7)) + ggtitle('From 1e6 to 1e7 rows') +
   xlab('Vector Size') + ylab('Time in seconds') + b

png('/Volumes/RamDisk/replication_parallel_R_1.png', width = 800, height = 600, bg = 'transparent')
p
png('/Volumes/RamDisk/replication_parallel_R_1.png', width = 800, height = 600, bg = 'transparent')
dev.off()

png('/Volumes/RamDisk/replication_parallel_R_2.png', width = 800, height = 600, bg = 'transparent')
grid.arrange(arrangeGrob(p1, p2, p3, p4, widths=unit.c(unit(0.5, "npc"), unit(0.5, "npc")), heights=unit.c(unit(0.5, "npc"), unit(0.5, "npc")), nrow=2))
png('/Volumes/RamDisk/replication_parallel_R_2.png', width = 800, height = 600, bg = 'transparent')
dev.off()

Conclusion

According to this result, we should use default function such as lapply until the loop number is bigger than a hundred thousand. Also as the data size increase, parallel computing is relatively more cheap.