## Monday, August 29, 2011

### Machine Learning Ex2 - Benchmarks

In my previous post, I implemented the algorithm for linear regression using gradient descent in Scala using two different methods: standard builtin mathematical methods and Scalala, a Scala linear algebra library.

Shortly after writing the solution I started to wondering if using Scalala had any performance impact on the runtime cost of the solution. While Scalala does have the overhead of object creation, it also makes heavy use of specialized classes, which should provide a considerable improvement.

I decided to do some naive benchmarking. These benchmarks are nowhere near scientific, but should provide a general sense of the solution's runtime. Since I was benchmarking the two Scala solutions, I decided to look at also the MATLAB/Octave and R solutions.

All benchmarking was done on the same 64-bit laptop. The time for loading the data from a file/connection was not included in the benchmarks. Averages are based on 100 iterations of each solution. The value of each runtime is placed into a vector/list called runtimes, which is then used to determine the average.
`````` \$ octave --version
GNU Octave, version 3.4.0

\$ r --version
R version 2.13.1 (2011-07-08)

\$ scala -version
Scala code runner version 2.9.0.1 -- Copyright 2002-2011, LAMP/EPFL
``````
First, the original MATLAB solution provided by Andrew Ng. Slight modifications were made to wrap the original code as a function. Octave was used as the interpreter for these tests.
`````` > mean(runtimes)
ans = 0.033590
> min(runtimes)
ans = 0.032610
> max(runtimes)
ans = 0.038699
``````
The MATLAB solution runs fairly consistently with little variation (the standard deviation is 9.5084e-04).

Onto Alexandre Martins' solution in R:
`````` > summary(runtimes)
Min. 1st Qu. Median  Mean 3rd Qu.  Max.
0.03900 0.04100 0.04200 0.04254 0.04325 0.05200
``````
R provides the very useful summary function which provides basic statistical information of a vector. Both R and MATLAB are interpreted languages, but in this case the solution in R is slower than MATLAB. The minimum runtime is R is larger than the maximum in MATLAB.

Next is Scala. We will look at first the iterative solution. For both solutions we will use a simple method to calculate the mean
`````` def mean[T](s: Seq[T])(implicit n: Fractional[T]) = n.div(s.sum, n.fromInt(s.size))
``````
`````` scala> mean(runtimes)
res: Double = 0.02949999999999998

scala> runtimes.min
res: Double = 0.025

scala> runtimes.max
res: Double = 0.159
``````
The average runtime is faster than even MATLAB, however the max runtime is quite large. Forgetting about the standard deviation for now, the runtimes for the Scala solution follow a simple pattern: the first pass is the maximum followed by values that more eventually match the actual average:
`````` scala> runtimes.head
res: Double = 0.159

scala> runtimes.tail.max
res: Double = 0.075

scala> runtimes.take(10)
res: scala.collection.immutable.IndexedSeq[Double] = Vector(0.159, 0.075, 0.053, 0.052, 0.033, 0.03, 0.035, 0.031, 0.027, 0.027)

scala> mean(runtimes.take(10))
res: Double = 0.0558
``````
The average runtime for the first 10 iterations is almost double the average of 100 runtimes. The HopSpot VM is supposed to optimize code over time and this scenario could be a result of such optimization. Or perhaps is there some caching going on in Scala?

Let us see if the Scalala solution has the same issues:
`````` scala> mean(runtimes)
res: Double = 0.0223

scala> runtimes.min
res: Double = 0.014

scala> runtimes.max
res: Double = 0.424
``````
The average runtime using Scalala is even better, but once again, the max runtime is almost double the average. Looking at the values, the Scalala solution follows the same pattern:
`````` scala> runtimes.head
res: Double = 0.424

scala> runtimes.tail.max
res: Double = 0.084

scala> runtimes.take(10)
res: scala.collection.immutable.IndexedSeq[Double] = Vector(0.424, 0.084, 0.072, 0.052, 0.044, 0.032, 0.033, 0.043, 0.017, 0.016)

scala> mean(runtimes.take(10))
res: Double = 0.08170000000000002
``````
Even after the initial performance hit, the solution does not improve until the 9th iteration.

Where do most of the inefficiencies lie? What would happen if we increased the number of steps used to find the local minimum from 1500 to 15000?

MATLAB
`````` > mean(runtimes)
ans = 0.33541
> min(runtimes)
ans = 0.32408
> max(runtimes)
ans = 0.39594
``````
R
`````` > summary(runtimes)
Min. 1st Qu. Median  Mean 3rd Qu.  Max.
0.4250 0.4450 0.4525 0.4563 0.4603 0.5310
``````
Increasing the number of iterations by 10, increase the average runtime by almost the same amount (once again, these benchmarks are very unscientific).

Scala iterative solution
`````` scala> mean(runtimes)
res: Double = 0.27539999999999987

scala> runtimes.min
res: Double = 0.264

scala> runtimes.max
res: Double = 0.564

res: Double = 0.564

scala> runtimes.tail.max
res: Double = 0.321

scala> runtimes.take(10)
res: scala.collection.immutable.IndexedSeq[Double] = Vector(0.564, 0.29, 0.267, 0.268, 0.268, 0.276, 0.267, 0.269, 0.268, 0.27)

scala> mean(runtimes.take(10))
res: Double = 0.3007
``````
Scalala solution
`````` scala> mean(runtimes)
res: Double = 0.16492

scala> runtimes.min
res: Double = 0.151

scala> runtimes.max
res: Double = 0.811

res: Double = 0.811

scala> runtimes.tail.max
res: Double = 0.188

scala> runtimes.take(10)
res: scala.collection.immutable.IndexedSeq[Double] = Vector(0.811, 0.159, 0.161, 0.155, 0.159, 0.157, 0.156, 0.157, 0.16, 0.155)

scala> mean(runtimes.take(10))
res: Double = 0.223
``````
The Scala solutions still provide a better average runtime than R or MATLAB over the course of 10+ iterations, but the first pass always has terrible performance.

What do all these number ultimately mean? Very little. Nowadays, developer productivity is one of the most important factors to pay attention to. If a developer is more productive in creating a solution in R, than perhaps a performance penalty is not as important. With the Scala solutions, using a linear algebra library helped simplify the code and focus on the mathematical concepts involved. Having a better runtime is only an added benefit.

UPDATE: benchmarking code https://gist.github.com/1201581

UPDATE 2:
The kind folks over at MathWorks were kind enough to do some benchmarking between MATLAB and Octave.  All tests done on their system, so a comparison between those numbers and the R/Scala solutions will not be accurate. Then again, these benchmarks were never comprehensive in the first place.

Octave (3.4.0)
`````` Mean = 0.029570
Min =  0.029214
Max  =  0.030383
``````

MATLAB (R2011b)
`````` Mean = 0.0126
Min =  0.0125
Max =  0.0127
``````

Impressive improvement for this small test.

1. 2. The first several passes on the JVM will always be slow. IMHO the jvm hotspot/jit etc. is optimized for a scenario where the same code is running for a long time.

So if you don't want to benchmark the startup time then let your algorithm run for 100 iterations (or even more i don't know that exactly) and then start your benchmark.

3. If the jvm jit is similar to the clr jit the you will always pay the byte code to native compile and optimilization cost the first time a routine is executed. The fully optimilized code is then used for all further ititations.

4. Anonymous,

I suspected that the hotspot optimization was kicking in, but I have never profiled code at such a micro-level, therefore I have seen the optimization in action. Interesting stuff. The JVM is one part of why Scala has become my goto language.

However, ignoring the startup time might result in an apples to oranges comparison. If a code is only meant to be executed once, then startup time is important if doing side-by-side evaluations. It would be interesting to see if the code can be bootstrapped by using a small set of fake data and a low number of iterations.

5. Because profiling in matlab is so easy, it might be interesting to profile the routine and see where the hot spots are in the matlab version at least. If it's all linear algebra kernels (* or inv or whatever) then that would explain why the floor is where it is.

>> profile on; run_the_code; profile report

is all you need.

6. what you going to do in that 0.01 second? :)

7. I was wondering why these comments were appearing days after posting the article. Google Analytics tells me Hacker News is linking to me. Welcome!

I encourage everyone to read the original post. The goal is to learn the math involved in the algorithms, not to simply create the fastest version. I also encourage everyone to create your own implementation in numpy/scipy/Ruby/Clojure/C++/whatever and share with the world.

8. Are you using MATLAB or Octave? At the top you show the Octave version, but then refer to MATLAB throughout the rest of the post. Octave and MATLAB are not the same things -- MATLAB is likely to be much more optimized than Octave. Please clarify.

9. a pet peeve but ... if you want to compare things ... a graph or table putting the data side by side does wonders for readability and flow of arguments.

my 2 cents.

-- Anonymous Coward

10. RJN,
Octave was used in this example. I wished whoever posted the story to Hacker News did not used the title he/she did since the intent was never to be an absolute benchmark. Do not have MATLAB, so I cannot benchmark it in any way. Perhaps there is an EC2 instance with MATLAB already installed?

Anon 3:13,
Graphs or tables would have only helped the notion that the benchmark was significant. I do not think the resulting numbers can be used to create a definitive benchmark. However, I did find the results of the Scala solutions to be interesting since it highlighted the optimization process of the JVM and I did want to plot those values.

11. (I'm the first Anonymous)

If you have code which have to be benchmarked then this code will probably run for a long time. And then the warm up time of the jvm isn't interesting any more.