Monday, 22 January 2018

Java performance tutorial – How fast are the Java 8 streams?



We’re migrating our code base to Java 8. We’ll replace everything by functions. Throw out design patterns. Remove object orientation. Right! Let’s go!

 Performance – you will lose on it

In many cases, performance is not critical, and you shouldn’t do any premature optimisation – so you may claim that this argument is not really an argument per se. But I will counter this attitude in this case, saying that the overhead of Stream.forEach() compared to an ordinary for loop is so significant in general that using it by default will just pile up a lot of useless CPU cycles across all of your application. If we’re talking about 10%-20% more CPU consumption just based on the choice of loop style, then we did something fundamentally wrong. Yes – individual loops don’t matter, but the load on the overall system could have been avoided.
the designers of the Java programming language did not extend Java and its JDK to allow functional programming in Java or to turn Java into a hybrid “object-oriented & functional” programming language. The actual motivation for inventing streams for Java was performance or – more precisely – making parallelism more accessible to software developers (see Brian Goetz, State of the Lambda). This goal makes a lot of sense to me, considering the way in which hardware evolves. Our hardware has dozens of cpu cores today and will probably have hundreds some time in the future.
                           In order to effectively utilize the hardware capabilities and thereby achieve state-of-the-art execution performance we must parallelize. After all – what is the point to running a single thread on a multicore platform? At the same time, multithread programming is considered hard and error-prone, and rightly so. Streams, which come in two flavours (as sequential and parallel streams), are designed to hide the complexity of running multiple threads. Parallel streams make it extremely easy to execute bulk operations in parallel – magically, effortlessly, and in a way that is accessible to every Java developer.
So, let’s talk about performance. How fast are the Java 8 streams? A common expectation is that parallel execution of stream operations is faster than sequential execution with only a single thread. Is it true? Do streams improve performance?
In order to answer questions regarding performance we must measure, that is, run a micro-benchmark. Benchmarking is hard and error-prone, too. You need to perform a proper warm-up, watch out for all kinds of distorting effects from optimizations applied by the virtual machine’s JIT compiler (dead code elimination being a notorious one) up to hardware optimizations (such as increasing one core’s cpu frequency if the other cores are idle). In general, benchmark results must be taken with a grain of salt. Every benchmark is an experiment. Its results are context-dependent. Never trust benchmark figures that you haven’t produced yourself in your context on your hardware. This said, let us experiment.
First, access to array elements is very fast. It is an index-based memory access with no overhead whatsoever. In other words, it is a plain down-to-the-metal memory access. Elements in a collection such as ArrayList on the other hand are accessed via an iterator and the iterator inevitably adds overhead. Plus, there is the overhead of boxing and unboxing collection elements whereas int-arrays use plain primitive type ints. Essentially, the measurements for the ArrayList are dominated by the iteration and boxing overhead whereas the figures for the int-array illustrate the advantage of for-loops.
Secondly, had we seriously expected that streams would be faster than plain for-loops? Compilers have 40+ years of experience optimizing loops and the virtual machine’s JIT compiler is especially apt to optimize for-loops over arrays with an equal stride like the one in our benchmark. Streams on the other hand are a very recent addition to Java and the JIT compiler does not (yet) perform any particularly sophisticated optimizations to them.
Thirdly, we must keep in mind that we are not doing much with the sequence elements once we got hold of them. We spend a lot of effort trying to get access to an element and then we don’t do much with it. We just compare two integers, which after JIT compilation is barely more than one assembly instruction. For this reason, our benchmarks illustrate the cost of element access – which need not necessarily be a typical situation. The performance figures change substantially if the functionality applied to each element in the sequence is cpu intensive. You will find that there is no measurable difference any more between for-loop and sequential stream if the functionality is heavily cpu bound.

No comments:

Post a Comment