Monday, 15 June 2009
Health Center 1.0 released
To celebrate the whole team baked and brought in carrot cake, brownies, chocolate chip cookies, lemon cake, rocky road squares, and flapjacks. We even had balloons. It was very nice but I've never eaten so much at work and I think we were all ready to explode by the afternoon - in future I think we should delegate and have only half the team bake per release.
Tuesday, 12 May 2009
How to interpret a method profile
Once an application has been identified as CPU-bound, either by using the Health Center or CPU monitoring, the next step is to figure out what is eating CPU. In a Java application, this will usually be Java code, but it could be native code. Profiling native code usually requires platform specific tools; on linux, I use tprof. Profiling Java code is a lot easier, and is more likely to yield big performance improvements, so I usually start with a Java profile and only profile native code if I didn't get anywhere with the Java profile. For Java profiling, I use the Health Center. It's got a few advantages, one of which is that there's no bytecode instrumentation needed, there's no need to specify only a few packages to profile, and the overhead is very low, so it won't affect the performance characteristics of what you're trying to profile.
So what does a method profile tell you? Simply put, it tells you what your application is spending its time doing. More precisely, it tells you what code your application is spending its time running - it doesn't tell you when your application is waiting on a lock instead of running your code, and it doesn't tell you when the JVM is collecting garbage instead of running your code. Assuming locking and GC aren't the cause of the performance problem (see triaging a performance problem), the method profile will give you the information you need to make your application go faster.
The application is doing too much work, and that's slowing it down. Your aim in performance tuning is to make the application do less work. There are lots of ways to make code more efficient. Sometimes people start performance tuning by code inspection - they read through the code base looking for obvious inefficiencies. I've done this myself lots of times, but it's not a particularly efficient technique. Say I find a method which is pretty carelessly implemented, and I double its speed with a bit of refactoring. Then I triumphantly re-run my application, only to discover nothing's changed. What's going on? The problem is that a big performance improvement on a method which is rarely called isn't going to change much of anything. For example, if I double the speed of a method which uses 0.5% of my CPU time, I've sped my application up by an imperceptible 0.25%. If, on the other hand, I shave 10% of the time of a method which is using 20% of my CPU, my application will go 2% faster. So the first rule of performance tuning is to optimise the methods at the top of the profile and ignore the ones near the bottom.
This is example from a method profiler, in this case the one in the Health Center. One method is clearly using more CPU than the rest, and so it's coloured red. In this case, 60% of the time the JVM checked what the application was doing, it was executing the FireworkParticle.animate() method. This is what's shown by the left-hand 'Self' column. The 'Tree' column on the right shows how much time the application spent in both the animate() method and its descendants. Some profilers call this column 'descendants' instead. Usually the Self figures are more useful for optimising an application.

What makes a method appear near the top of a method profile? It's taking up a lot of CPU time, but why? There are two reasons; either the method is being called too often, or the method is doing too much work when it's called. Sometimes it happens that a method really is doing the right amount of work the right number of times, but this is usually only the case after a fair amount of work. In their natural state, most programs can - and should - contain inefficiencies. (Remember that premature optimisation is the root of all evil.)
Some profilers can distinguish between a method which is called several times, and one which is called once and then spends a long time executing, but many cannot. The reason is that some profilers operate by tracing - that is, recording every entry and exit of a method. This gives very precise information, but usually carries a fairly heavy performance cost. The IBM JVM can be configured with launch parameters to count or time method executions, but it's only advisable to do this for a restricted subset of methods. An alternate method of collecting profiling information is to sample - that is, check periodically what method is executing. This is much less expensive but doesn't give as much detail as tracing profilers. The Health Center uses method sampling already built into the JVM to allow profiling with extremely low overhead.
Often it will be obvious when inspecting a hot method if it's being called frequently or is slow to run. Code with loops, particularly nested loops, is probably expensive to run. Code which doesn't seem to do much but which is at the top of a profile is probably being called a lot. This leads neatly to the next steps in optimization: eliminate loops and do less work inside loops for expensive methods, and call inexpensive method less frequently.
How do you go about making sure a method is called less? Method profilers which also record stack traces can make calling method less pretty easy. For example, this is the output of the Health Center, showing where calls to one of the top methods in the profile have come from:

In this case, 98% of the time the doSomeWork() method was sampled, it was animate() that called it. 2% of the time, it was draw() that called it. In this case, the next step is to inspect the animate() method and see why it's calling doSomeWork(). Often, at least in the first passes of optimisation, most of the calls to the top method are totally unnecessary and can be trivially eliminated.
Monday, 11 May 2009
How do you solve a performance problem?
This is the methodology I recommend.
All performance problems are caused by a limited resource. Your job as a performance analyst (or developer who's suddenly required to be a performance analyst) is to identify what resource is limited - what's the bottleneck for this application? Often after fixing the first bottleneck, a second bottleneck will become apparent - the process of performance tuning is the process of eliminating bottlenecks, one by one, until the performance is good enough for you and your stakeholders.
Computational resources fall into a few basic categories. Different people count them differently, but I like to think of four types of resource: the CPU, memory, I/O, and locks.
A CPU-bound application can't get enough processor time to complete its work. A memory-bound application needs more memory than is available. An I/O-bound application is trying to do I/O faster than the system can handle. Finally, a lock-bound application is being held up by the fact that multiple threads are contending on the same locks. 'Lock-bound' isn't a terribly common term, but I think it's really important to consider lock contention when analyzing performance problems. As systems become more and more parallel, synchronization are increasingly a limiting factor in the scalability of the system.
So how do you identify which of these resources is the cause of the hold-up? Some heuristics can help as a first step. If the CPU is at or near 100%, the CPU is likely to be the culprit. If the CPU isn't near 100%, it's probably locking or I/O. The rule of thumb becomes a bit muddier when it comes to memory. Sometimes memory problems can show up as low CPU usage, because the system is waiting on a heap lock (which is properly locking) or physical memory access - paging in the worst case (technically, I/O). However, in a garbage collected system, excessive memory consumption will often manifest as lots of processing time being spent in the garbage collector (CPU).
Tools can help turn these fairly fuzzy heuristics into a more precise diagnosis. There are lots of tools available, both free and not-free. My favourite tool is IBM's Health Center. This is probably because I'm the technical lead for the Health Center, so I think what it does is pretty sensible and cool. :) The Health Center is free, but it can only monitor IBM JVMs. I do think it's one of the best ways to investigate garbage collection, locking, and collect method profiles from an IBM JVM.
The Health Center tries to automate the process of identifying the root cause of a performance problem. The front page shows a dashboard with a bunch of status indicators. If one of them is red or orange, that's a pretty good indicator of where to start tweaking the performance. The one area the Health Center doesn't cover at the moment is I/O. It can identify garbage collection and locking bottlenecks pretty accurately, and the method profiler can identify the root cause of much excessive CPU usage.

So let's say you've used the Health Center (or your tool of choice) and you're seeing some red crosses. What next? In the Health Center, clicking on the link next to a red cross will bring up more information and more detailed recommendations about how to fix the problem. In later posts I'll give a bit more background about how to go about fixing locking issues, memory issues, and CPU issues.
Forward links:
Friday, 8 May 2009
More Health Center features
Ever wondered what the classpath of that application which is behaving strangely is? Or why all the core files are truncated? The environment perspective provides details of the Java version, Java classpath, boot classpath, environment variables, and system properties. This can be really handy for debugging problems, particularly problems on remote systems or systems where you don't control the configuration. It also shows the ulimit, which is a common cause of strange behaviour on linux and unix systems. If the Health Center detects misconfigured applications, it will provide recommendations on how to fix it.

Another new feature is the ability to export and import data. This means one person can collect the data, and if there's a problem, they can send it to someone else for more analysis. For example, if a method is unusually hot in the method profile, a system operator could send the exported data to the developer responsible for that area of the code.

Wednesday, 8 October 2008
Garbage collection flavours
Generational collectors
Generational collectors exploit the observed properties that most objects tend to die young, and that young objects are more likely to reference old objects than the reverse. Together these are known as the weak generational hypothesis.Generational collectors divide the heap up into multiple generations. Young generations can be collected without collecting old generations. These partial collections are quicker than full heap collections, and are likely to produce a good return of free space relative to the area collected, since most objects die young. At least the younger generations tend to employ copying collectors, since these collectors are very efficient in heaps with high attrition rates.
Objects which survive a given number of collections are tenured and move up to an older generation. In order to avoid collecting objects in younger generations which are referenced by older generations, a remembered set is maintained of references from old generations to young generations. A write barrier between the generations is used to catch changes to these references.
Incremental collection
Incremental collectors allow single collections to be divided into smaller collections. This allows pause times to be limited. Incremental techniques almost always require one of a write barrier or a read barrier to prevent changes to the object connectivity graph mid-way through a collection causing the reachability results to be incorrect.
Concurrent collection
A concurrent collector is one which can execute concurrently with a mutator thread. Even on moderately multi-processor systems, it usually does not make sense to dedicate an entire processor to running garbage collection. Therefore the typical concurrent collection is perhaps more accurately described as a highly incremental collection; each thread is assigned small units of garbage collection work to do along with its application work, so the garbage collection work is finely interleaved with application work. One way of thinking of this garbage collection work is as a 'tax' - threads have to do some garbage collection work in exchange for being able to allocate. As with incremental techniques, concurrent collections need a write- or read-barrier.Parallel collection
A parallel collector is one which divides the collection work so that multiple collector threads are collecting concurrently. This is counter-productive on single-threaded processors, but increasingly necessary as the number of processors in a system increases. On systems with many processors a single-threaded collector is unable to keep up with the amount of garbage the processors can produce.Similar techniques are used to achieve incrementality, concurrency, and parallelism, so many collectors which have one of these properties also have some of the others.
Tuesday, 7 October 2008
Garbage collection algorithms
Garbage collection has been the subject of much academic research . In particular, the volume of new techniques with various claimed properties is high enough that it is tempting to coin the phrase YAGA (Yet Another GC Algorithm). For a pointer to many of these articles, see Richard Jones' garbage collection bibliography.
Garbage collection algorithms
Despite the large number of variants, most garbage collection algorithms fall into a few simple categories.
Reference counting
Yes, reference counting is a form of garbage collection. It's sometimes seen as an alternative to garbage collection, and sometimes even as a morally superior alternative to garbage collection, but really, it's just a type of garbage collection algorithm.
Reference counting garbage collectors track the number of references to each object. Often the count is maintained in and by the object itself. When a new reference to an object is added, the reference count is incremented. When a reference is removed, the count is decremented. When the count reaches zero, the object is destroyed and the memory released. Reference counting fails when objects reference one another in a cycle. In these cases each object will be seen to be referenced and will never be freed, even if nothing references the cyclic structure.
Reference counting is unable to deal with cycles unless augmented with occasional invocations of another form of garbage collection. Reference counting has a number of other disadvantages, mostly to do with performance and bounding of work, and so it is rarely used in modern garbage collection systems. However, smart pointers in C++, while not traditionally considered a form of garbage collection, do make use of reference counting. Ad hoc garbage collectors introduced to manage object reachability in complex environments also tend to use reference counting since it is easily implemented in a hostile environment.
Tracing collectors
Most garbage collectors perform some sort of tracing to determine object liveness. Tracing does not have the same problem with collecting cycles as reference counting does. However, tracing collectors are more sensitive to changes in the object graph and usually require the suspension of the application threads or close monitoring of application activity to ensure correctness.
Mark-sweep
Mark-sweep collectors collect in two phases. In the first phase, all reachable objects are traced, begining from a set of roots. The roots are all objects which are guaranteed reachable, including objects referenced from the stack and static variables. Every object directly or indirectly reachable from a root is marked. Objects which do not end up marked are unreachable and therefore cannot be used again by the application, so they may safely be freed.
In the second phase of a mark-sweep collection, all unmarked objects are added to a free-list. Contiguous areas of free memory are merged on the free list.
Mark-compact collectors
Mark-compact mark live objects using identical techniques to those used by mark-sweep collectors. However, mark-compact collectors do not use a free list. Instead, in the second phase all reachable objects are moved so that they are stored as compactly as possible. New allocation is then performed from the empty area of the heap.
Many collectors hybridise these two approaches, combining frequent sweeping with occasional compaction.
Copying collectors
Copying collectors divide the heap into two areas, a to-space and from-space. All objects are allocated in the to-space. When the new space is full, a collection is performed and the spaces are swapped. All reachable objects in the to-space are copied to the from-space, which is declared the new to-space. New allocation is then performed in the new to-space.
Copying collectors have a number of advantages over marking collectors. Because copying is done
in the same phase as tracing, it is not necessary to maintain a space-consuming list of which objects have been marked. Because no sweep is performed, the cost of the collection is proportional only to the amount of live data.
If most objects are unreachable, a copying collecton can be very efficient. The corollary is that if most objects remain reachable, or if some very large objects remain reachable, the collection will be woefully inefficient, because a large amount of memory will need to be copied every collection.
Copying collectors also keep the heap very compact, and this can boost allocation and application performance. However, because a completely empty from-space must be maintained at all times, copying collectors are not space-efficient. Modern collectors tend to estimate an object death-rate and maintain less than half the heap for the from-space accordingly. Copying collectors are also known as semispace collectors.
For a good overview of how this relates to the IBM JVM's garbage collection, see Mattias Persson's developerWorks article.
Sunday, 5 October 2008
Garbage collection myths
But, to start off with, what is garbage collection? Garbage collection is a system of automatic memory management. Memory which has been dynamically allocated but which is no longer in use is reclaimed for future re-use without intervention by the application. Garbage collection solves the otherwise difficult problem of determining object liveness by freeing memory only when it becomes unreachable.
Garbage collection is pretty ubiquitous in modern languages. Garbage collected languages include Java, the .Net languages, Lisp, Python, Perl, PHP, Ruby, Smalltalk, ML, Self, Modula-3, and Eiffel. Some languages which are not traditionally garbage collected offer garbage collection as a pluggable or configurable extension. For example, collectors are available for C++, and Objective-C was recently extended to allow garbage collection. Understanding the garbage collector is an important part of performance tuning in these languages.