Monday, August 29, 2011

Machine Learning Ex2 - Benchmarks

In my previous post, I implemented the algorithm for linear regression using gradient descent in Scala using two different methods: standard builtin mathematical methods and Scalala, a Scala linear algebra library.

Shortly after writing the solution I started to wondering if using Scalala had any performance impact on the runtime cost of the solution. While Scalala does have the overhead of object creation, it also makes heavy use of specialized classes, which should provide a considerable improvement.

I decided to do some naive benchmarking. These benchmarks are nowhere near scientific, but should provide a general sense of the solution's runtime. Since I was benchmarking the two Scala solutions, I decided to look at also the MATLAB/Octave and R solutions.


All benchmarking was done on the same 64-bit laptop. The time for loading the data from a file/connection was not included in the benchmarks. Averages are based on 100 iterations of each solution. The value of each runtime is placed into a vector/list called runtimes, which is then used to determine the average.
 $ octave --version  
 GNU Octave, version 3.4.0  
   
 $ r --version  
 R version 2.13.1 (2011-07-08)  
   
 $ scala -version  
 Scala code runner version 2.9.0.1 -- Copyright 2002-2011, LAMP/EPFL  
First, the original MATLAB solution provided by Andrew Ng. Slight modifications were made to wrap the original code as a function. Octave was used as the interpreter for these tests.
 > mean(runtimes)  
 ans = 0.033590  
 > min(runtimes)  
 ans = 0.032610  
 > max(runtimes)  
 ans = 0.038699  
The MATLAB solution runs fairly consistently with little variation (the standard deviation is 9.5084e-04).

Onto Alexandre Martins' solution in R:
 > summary(runtimes)  
   Min. 1st Qu. Median  Mean 3rd Qu.  Max.   
 0.03900 0.04100 0.04200 0.04254 0.04325 0.05200  
R provides the very useful summary function which provides basic statistical information of a vector. Both R and MATLAB are interpreted languages, but in this case the solution in R is slower than MATLAB. The minimum runtime is R is larger than the maximum in MATLAB.

Next is Scala. We will look at first the iterative solution. For both solutions we will use a simple method to calculate the mean
 def mean[T](s: Seq[T])(implicit n: Fractional[T]) = n.div(s.sum, n.fromInt(s.size))  
 scala> mean(runtimes)  
 res: Double = 0.02949999999999998  
   
 scala> runtimes.min  
 res: Double = 0.025  
   
 scala> runtimes.max  
 res: Double = 0.159  
The average runtime is faster than even MATLAB, however the max runtime is quite large. Forgetting about the standard deviation for now, the runtimes for the Scala solution follow a simple pattern: the first pass is the maximum followed by values that more eventually match the actual average:
 scala> runtimes.head  
 res: Double = 0.159  
   
 scala> runtimes.tail.max  
 res: Double = 0.075  
   
 scala> runtimes.take(10)  
 res: scala.collection.immutable.IndexedSeq[Double] = Vector(0.159, 0.075, 0.053, 0.052, 0.033, 0.03, 0.035, 0.031, 0.027, 0.027)  
   
 scala> mean(runtimes.take(10))  
 res: Double = 0.0558
The average runtime for the first 10 iterations is almost double the average of 100 runtimes. The HopSpot VM is supposed to optimize code over time and this scenario could be a result of such optimization. Or perhaps is there some caching going on in Scala?

Let us see if the Scalala solution has the same issues:
 scala> mean(runtimes)  
 res: Double = 0.0223  
   
 scala> runtimes.min  
 res: Double = 0.014  
   
 scala> runtimes.max  
 res: Double = 0.424  
The average runtime using Scalala is even better, but once again, the max runtime is almost double the average. Looking at the values, the Scalala solution follows the same pattern:
 scala> runtimes.head  
 res: Double = 0.424  
   
 scala> runtimes.tail.max  
 res: Double = 0.084  
   
 scala> runtimes.take(10)  
 res: scala.collection.immutable.IndexedSeq[Double] = Vector(0.424, 0.084, 0.072, 0.052, 0.044, 0.032, 0.033, 0.043, 0.017, 0.016)  
   
 scala> mean(runtimes.take(10))  
 res: Double = 0.08170000000000002  
Even after the initial performance hit, the solution does not improve until the 9th iteration.

Where do most of the inefficiencies lie? What would happen if we increased the number of steps used to find the local minimum from 1500 to 15000?

MATLAB
 > mean(runtimes)  
 ans = 0.33541  
 > min(runtimes)  
 ans = 0.32408  
 > max(runtimes)  
 ans = 0.39594  
R
 > summary(runtimes)  
   Min. 1st Qu. Median  Mean 3rd Qu.  Max.   
  0.4250 0.4450 0.4525 0.4563 0.4603 0.5310  
Increasing the number of iterations by 10, increase the average runtime by almost the same amount (once again, these benchmarks are very unscientific).

Scala iterative solution
 scala> mean(runtimes)  
 res: Double = 0.27539999999999987  
   
 scala> runtimes.min  
 res: Double = 0.264  
   
 scala> runtimes.max  
 res: Double = 0.564  
   
 scala> runtimes.head  
 res: Double = 0.564  
   
 scala> runtimes.tail.max  
 res: Double = 0.321  
   
 scala> runtimes.take(10)  
 res: scala.collection.immutable.IndexedSeq[Double] = Vector(0.564, 0.29, 0.267, 0.268, 0.268, 0.276, 0.267, 0.269, 0.268, 0.27)  
   
 scala> mean(runtimes.take(10))  
 res: Double = 0.3007  
Scalala solution
 scala> mean(runtimes)  
 res: Double = 0.16492  
   
 scala> runtimes.min  
 res: Double = 0.151  
   
 scala> runtimes.max  
 res: Double = 0.811  
   
 scala> runtimes.head  
 res: Double = 0.811  
   
 scala> runtimes.tail.max  
 res: Double = 0.188  
   
 scala> runtimes.take(10)  
 res: scala.collection.immutable.IndexedSeq[Double] = Vector(0.811, 0.159, 0.161, 0.155, 0.159, 0.157, 0.156, 0.157, 0.16, 0.155)  
   
 scala> mean(runtimes.take(10))  
 res: Double = 0.223  
The Scala solutions still provide a better average runtime than R or MATLAB over the course of 10+ iterations, but the first pass always has terrible performance.

What do all these number ultimately mean? Very little. Nowadays, developer productivity is one of the most important factors to pay attention to. If a developer is more productive in creating a solution in R, than perhaps a performance penalty is not as important. With the Scala solutions, using a linear algebra library helped simplify the code and focus on the mathematical concepts involved. Having a better runtime is only an added benefit.

UPDATE: benchmarking code https://gist.github.com/1201581

UPDATE 2:
The kind folks over at MathWorks were kind enough to do some benchmarking between MATLAB and Octave.  All tests done on their system, so a comparison between those numbers and the R/Scala solutions will not be accurate. Then again, these benchmarks were never comprehensive in the first place.

Octave (3.4.0)
 Mean = 0.029570
 Min =  0.029214
 Max  =  0.030383

MATLAB (R2011b)
 Mean = 0.0126
 Min =  0.0125
 Max =  0.0127

Impressive improvement for this small test.

11 comments:

  1. The first several passes on the JVM will always be slow. IMHO the jvm hotspot/jit etc. is optimized for a scenario where the same code is running for a long time.

    So if you don't want to benchmark the startup time then let your algorithm run for 100 iterations (or even more i don't know that exactly) and then start your benchmark.

    ReplyDelete
  2. If the jvm jit is similar to the clr jit the you will always pay the byte code to native compile and optimilization cost the first time a routine is executed. The fully optimilized code is then used for all further ititations.

    ReplyDelete
  3. Anonymous,

    I suspected that the hotspot optimization was kicking in, but I have never profiled code at such a micro-level, therefore I have seen the optimization in action. Interesting stuff. The JVM is one part of why Scala has become my goto language.

    However, ignoring the startup time might result in an apples to oranges comparison. If a code is only meant to be executed once, then startup time is important if doing side-by-side evaluations. It would be interesting to see if the code can be bootstrapped by using a small set of fake data and a low number of iterations.

    ReplyDelete
  4. Because profiling in matlab is so easy, it might be interesting to profile the routine and see where the hot spots are in the matlab version at least. If it's all linear algebra kernels (* or inv or whatever) then that would explain why the floor is where it is.

    >> profile on; run_the_code; profile report

    is all you need.

    ReplyDelete
  5. what you going to do in that 0.01 second? :)

    ReplyDelete
  6. I was wondering why these comments were appearing days after posting the article. Google Analytics tells me Hacker News is linking to me. Welcome!

    I encourage everyone to read the original post. The goal is to learn the math involved in the algorithms, not to simply create the fastest version. I also encourage everyone to create your own implementation in numpy/scipy/Ruby/Clojure/C++/whatever and share with the world.

    ReplyDelete
  7. Are you using MATLAB or Octave? At the top you show the Octave version, but then refer to MATLAB throughout the rest of the post. Octave and MATLAB are not the same things -- MATLAB is likely to be much more optimized than Octave. Please clarify.

    ReplyDelete
  8. a pet peeve but ... if you want to compare things ... a graph or table putting the data side by side does wonders for readability and flow of arguments.

    my 2 cents.

    -- Anonymous Coward

    ReplyDelete
  9. RJN,
    Octave was used in this example. I wished whoever posted the story to Hacker News did not used the title he/she did since the intent was never to be an absolute benchmark. Do not have MATLAB, so I cannot benchmark it in any way. Perhaps there is an EC2 instance with MATLAB already installed?

    Anon 3:13,
    Graphs or tables would have only helped the notion that the benchmark was significant. I do not think the resulting numbers can be used to create a definitive benchmark. However, I did find the results of the Scala solutions to be interesting since it highlighted the optimization process of the JVM and I did want to plot those values.

    ReplyDelete
  10. (I'm the first Anonymous)

    If you have code which have to be benchmarked then this code will probably run for a long time. And then the warm up time of the jvm isn't interesting any more.

    ReplyDelete