Recap the Java(JVM) : after reading Optimizing Java
Java was designed to be “fast enough” initially though benefitted greatly from Moore’s Law and now capable of high-performance
Key aspects of the platform
- Interpreter(interprete compiled byte codes)
- JIT Compiler(HotSpot JVM, is used for OracleJDK and OpenJDK, has it, only compiles the most commonly executed(==important) parts of the program)
- Garbage Collection
Java environment defined by 2 specifications, implemented based on stack, every time we execute a new method then a new evaluation stack is created which is used as temporary working area.
- Java language specification
- Java Virtual Machine specification (VMspec), define execution of JVM bytecode
JIT compilation must maintain correct execution, Benchmarking individual compiled methods is very hard, in general most engineers should not try to benchmark individual single method or small pieces of code, it is much better and easier to do performance analysis, tuning on entire applications, small pieces of Java code may not react in the way people expect because of the way JIT compiler works, which work by widening it’s scope, as it compiles it brings in more code into it’s scope so optimization decision may different the more code it sees. In general micro-benchmarking of just few lines of code should be avoided unless you know the use-case for it.
JVM has classloading time not just only compile & run time. Some what similar to Unix dynamic linking but considerably more flexible & powerful. Only way to get new code into a running system. provide “pinch(defenssive) point” for Java security model. Classloading does not typically have too much impact on most Java applications. Class Loader
Automatic reclamation of memory no longer in use, not defined by the VM or language spec, pluggable subsystem, has tradeoffs between application concerns.
Modern Processor Architecture
All data must be fetched back over the Northbridge into the CPU to be used. What is fetched back is actually a cache line rather than the individual piece of data. You can see at line32, a increment is every 16 elements and Java’s `int` is 4bytes so what effectively touching in, at line 31 touchEveryline() method, is one `int` every 64bytes. So regardless to the fact we are only accessing and modifying one in every 16 elements of the array, we still fetch entire array out of main memory and into cache. So the amount of work which is being done in order to move memory around across Northbridge is the same for both methods. And this example tells us,
- the vast majority of work being done is about moving memory around rather than actually incrementing the individual element.
- how noisy measurement in the JVM can become, the example benchmarked speed of Northbridge not the code.
Operating Systems and Code Execution
Memory layout of a Unix process and JVM
Text section(segment) : executable code for the program and libraries will be placed
Data section : pre-initialized data contained in the binary
Heap section : dynamically allocated memory which C and C++ programmers will typically use using malloc() and free()
Stack section : individual user threads within the application will allocate new frames as new functions of code
Every Java process runs within a JVM and JVM is unix process. Java heap is subsection of overall heap which is provided by unix process model. This mapping is not required by any of standard but major implementations tend to follow it like HotSpot. Java heap for HotSpot is not typically resized, instead when JVM process starts up an amount of memory within overall C heap is reserved as Java heap. Java Byte code will be placed into the C heap not the Java heap.
Each individual thread will have it’s own stack and share access to the heap. Java thread is the same thing as an operating system thread. Thread.start() makes the call to the OS which creates a new thread provide with new stack segment and starts executing. This means Java application thread can and will be scheduled as an individual unit of execution alongside any other threads which exist that will include other application threads and JVM threads things like JIT Compiler thread.
OS scheduler, State model of a thread, Context switches
overhead of OS scheduler, put this into sleep, waking this up one millisecond later, transferring the thread to the back of run queue, waiting for it come to front again and then placing it on to the core. essentially there will be some additional time over above than expected one second.
JVM Objects at Runtime
Objects in the heap
- Every object in the heap is represented by an OOP
- Ordinary Object Pointer
- A pointer in the C/C++ sense
- VM allocates from free space
- Technically an instaceOop
- Every Java reference points at an oop location or
- Oop location consist of, 2 machine words of header, Mark(instance specific data)and Klass world(type specific data, metadata about type of this object), in case of
array, has a third world of header, array’s length, primitives does not have object headers of course and cannot live in the heap without boxing(generation of an object header to wrap the value).
Big blue box pointer points to the start of metadata, first and second hashed boxes, two words of header, Mark word and Klass word, third box is field with object int and then two reference fields which are themselves represented as Oops so there are two pointers to two other objects.
- for a 32-bit JVM the object header is 8 bytes
- for a 64-bit JVM the object header is 16 bytes
- while its trivial, the average Java application contains millions of objects, so to reduce this overhead, use CompressedOops, represent a 64-bit address in 32 bits, scale by a factor of 8 and add to a 64 bit base, address ~4billion objects rather than ~4billion addresses
is used during Garbage Collection and for locking the object, possible states are
is pointer which points at the metadata for the class(type) and includes bytecode for methods and points into metaspace(is within non-Java heap but within C heap, place where metadata lives, from Java8), or permgen(is within Java heap, below Java8) and is used to implement type checks at VM level by checking equality of Klass word
Class objects & Klass Oops
Class objects(e.g. String.class) are Oops, they are just regular Java objects which have regular Java behaviors
Klass Oops are VMs representation of class metadata which carry the methods of the class in a vtable structure and cannot be directly addressed by a Java reference
connection between Class and Klass which involves Java reflection. Oop types in JDK.
people who are interested in performance, it is important to have the notion of what bytecodes are as background information if you like.
It is called bytecode cause each operation code(opcode) represented by 1 byte. Bytecode is architectural dependent, processors typically have a convention how they interpret numbers they see(big endian || little endian) and bytecode is big endian. There are about 200 opcodes in use. Bytecode is typed(different operations with same operation type). is an abstract representation. runs on a stack machine. is NOT “machine code for an imaginary CPU”(Real CPUs are register-based not stack based). can be further compiled to machine code by JIT, AOT compilers. is not machine code but an intermediate representation(IR) of a program which you can say “Halfway” between human readable source and machine code. easy to construct an Abstract Syntax Tree(AST) from bytecode, many tools exist to analyse and modify bytecode, ASM and CGlib are very common choices.
AST example with Fibonacci
Stack Machine is useful way to implement interpreter.
a tool for peeking inside classfiles to show the disassembled bytecode and constant pool information to analyze many aspects of the classfile. standard tool which shipped within the
bin directory of the JDK, by default javap shows public, protected and default methods but with
-p it will also show private methods.
note : many modern languages have notion that every expression, including every method call, must have a return value of some type. this is expressed as the design principle within a language that everything is expression. and because everything is expression, everything has a type. and this is a principle which the Java language and especially JVM does not implement.
The constant pool
Similar to a symbol table in Unix or other languages. is a list of the “useful things” present in a class such as Strings, Numeric constants, Classes & Methods & Fields referred to including methods & fields in other classes and description of method signatures. forms a self-consistent set and aim for this is to define what this class refers to not what is available from this class. can show it by using the
-v switch to javap
The Interpreter is the initial JVM code execution environment. stack-based not register based, Simplest implementation is a “switch inside a while loop”, while there no more bytecodes to be executed take one bytecode at a time and switch based on which instruction it is.
called Template interpreter, builds a dynamic dispatch table when VM start, heavy use of native assembly code which improves performance of interpreter
- Latency — end to end time at given workload, often quoted only for average load, useful to consider across a range of load
- Throughput — # of units on known resource levels
- Utilization — % of resources used to service requests
- Efficiency — Through / Resource
- Capacity — # of units in flight though the system
- Scalability — Delta in Throughput or latency as resource is added
Main areas that are at root cause responsible for any source of slowness
- External Systems
Use a thread profiler to identify waiting threads, what these threads are waiting on, DB query not performing well, Poor network latency, Slow SOAP/REST call not performing well, Poor throughput from an RMI service, Authentication / Authorization System overwhelmed etc.
Apply a code or architectural change — e.g. improve the throughput of the external resource like spinning up more db servers, communicate with the external system less often like caching the results of previous calls, interact with the external system in a more efficient manner like batching updates.
- Operating Systems
Disk R/W problems, slow hardware, context switching, OS spending too much time switching between tasks, Threads contending for software lock, Contention for hardware resources.
Get SSDs, User an in-memory cache, Batch writes, Investigate overall usage profile(are there large spikes at particular times), move contending processes, investigate locking,
- JVM / GC
usual JVM-caused performance issue is GC, occasionally other subsystems can cause bottlenecks, solving these issues & tuning GC is non-trivial. Use an up to date JVM, Oracle JDK 6 & 7 are EOL(& 6 is critical PROD risk), JDK 7 is 20–30% faster than 6(& JDK 8 is 20% faster again), JDK 8 has features with major performance increases.
- Application Code
How often is this the cause of performance problems? mostly engineers would say 80~90% but lower then 10%.
- Too much traffic
Basic analysis principles
CPU is expensive, scarce resource, Aim to keep busy as much as possible, Measure CPU in # of executed(retired) per second. Keep data as close to the CPU as possible(caches), OS & VM tuning.
Measuring Effectiveness of Memory/CPU
vmstat command provides information at regular intervals
Check that the machine is not pushed into swap(vmstat 1), watch memory and CPU utilization with machine under load, is memory utilization healthy? is CPU utilization close to 100%?
Freeing of memory that is no longer live otherwise known as “collecting dead objects”
Mark & Sweep
a process to find dead objects, uses 2 internal data structures, Oop table(like LinkedList) which is code references every objects ever created by VM and not yet garbage collected, sizes of chunked list in Free List. First iterate through Oop table clearing the mark bit, Second GC roots which are pointers point into the heap from outside typically from stack frames and evaluation stacks belonging to each application thread …
Stop the world
GC requires exclusive accesses to the heap and code is halted at a safepoint by the runtime while garbage being collected. All application threads must pause whilst garbage is collected.
GC threads can run whilst application threads are running, fully concurrent is not possible but can have multi phase garbage collector where some phases stop the world but some phases can be concurrent.
multiple threads are used to execute garbage collection
The WGH means that “most objects die young”, want to collect “young” & “old” objects separately, collecting young objects often, using an evacuating collector, only collect the old objects when necessary with using compacting collector.
Simple generational algorithm
“Eden” (Young) & “Tenured” (Old), first evacuate any survivors from Eden to Tenured during Sweep phase, second wipe Eden completely by filling it with zeros, last clear the mark bit on any survivors.
Survivor spaces are additional areas of Young space, Used as a “temporary holding area” before Tenured. Solve the “young object created just before a GC” problem. Hotspot has 2 survivor spaces, threshold to be moved into Tenured is 4.
Each thread has private allocation areas(TLABs) in area of Eden, object allocation is pointer-bump O(1), very short-lived object will never live anywhere else(dies before next GC cycle). JVM will allocate new TLAB if a thread exhausts, young GC triggered if a new TLAB can’t be handed out.
is thrown when 98%+ time is spent in GC, <2% of Heap is freed in a collection, if direct allocation into Tenured fails(e.g. a large byte array), usually a sign that the GC cycles can’t keep up with allocation. The majority of VM’s heap is usually tenured(default 3/4).
Allocation & Lifetime
two main drivers of JVM memory behavior
one of primary tuning technique is to adjust allocation rate. which measure amount of memory used by newly created objects over some time period usually in MB/s not directly recorded by the JVM. Absolute maximum allocation rate available will be set by Northbridge which is responsible for moving all memory to and from the processor so we will not be able to allocate faster than Northbridge can shuffle memory to and from CPU into the actual store ships.
Java tried to reduce the importance of object lifetime
Metrics to watch
Pause time, Throughput(%age), Pause frequency, Reclamation efficiency, Pause consistency
Hotspot’s Parallel Collectors
ParallelGC & ParallelOld is default collectors as of version 8, from version 9 will replace to new collector and which is fully Stop The World collector and it rely upon two different points, the young collections happen very often, are very short, evacuating to the other Survivor Space & Old, also compacting onto the to-space. Old collections are much less frequent and is compacting but not evacuating.
Drawbacks of Parallel Collectors
Inherently STW, not so bad for young collections, pause time is O(# live objects), short pause times but potentially bad for Tenured. ParallelOld collector allow it efficiently use as many CPUs as possible and has very low overhead, it has two potential drawbacks, first the pause time scales linearly not with the number of live objects but with the heap size, O(heap size). This is characteristic of compacting collector. so as heap size get bigger and bigger the overall stop the world time for full collection will continue to grow linearly, at some point the Parallel collector will prove to be not suitable for our applications as the heap size get bigger and bigger, therefore for large heaps ParallelOld is bad fit. Second potentially more serious problem is the algorithm used for ParallelOld is all-or-nothing, once GC cycle is started in Tenured it must be allowed to finish, there is no mechanism for terminating GC early or for doing a partial tidy up of OldGen and then allowing application threads to resume.
Concurrent Marking Algorithm
If we want to move beyond purely parallel STW collection. we going to need to start introducing some concurrent aspects into different phases of GC work. The simplest place to do this is actually in the mark phase. Tri-color marking, rather than simply having a simple bit to flip to indicate whether object is mocked or not, we have three states. The way algorithm proceeds if by first of all by finding the GC Roots as initial mark(colored grey) and everything else within the heap is colored white. We now keep track of set of grey objects and it will examine whether or not the node has any children which still colored white, if it does then it will color the child grey and put grey node back into the set of grey node. On the other hand if the grey node has been selected by marking thread has no white children of any kind then the grey node will itself be colored black and will be removed from set of grey objects. Over time this means we will at some point because of boundary size of heap reach a maximum number of grey nodes. After that point the number of grey nodes will start to descend and at some point the only thing will be left in the heap(OldGen) will be white and black nodes. At this time the white nodes are eligible for collection. However Tri-color marking is still a STW algorithm and problem in Tri-color marking with concurrent mutation that possibly collect live white objects. To fix this there are couple of different approaches can be used, the one taken in JVM is reuse card tables to store extra information, essentially what we do is when we are updating a black object to point at a white object we mark the card of the black object which indicates that requires rescan. There is then second phase of marking known as the remark where we step through and analyze each of the possible cases where black object could be made to point to white one, if that has occurred the object graph needs to be fixed up to prevent any breakage or deletion of live objects, for simplicity and to avoid any possibility of the algorithm might not terminate, the remark phase is also STW phase. Tri-color marking as implemented in JVM typically has two STW phases, both phases typically very short compare to overall STW phase for young generation and what this means is we can potentially run the majority of GC work require to mark the heap in a concurrent phase.
JVM is not actually a fully pre-emptive multithreading environment. The general concept of pre-emptive multithreading, multitasking is that, OS scheduler being a privileged component can choose to remove a user space’s thread from a core at anytime. In addition to this mechanism JVM supports a form of cooperative multithreading. The basic underline principle is VM needs to point to perform these coordinated activities which is called
safepoints. In general there are two rules, JVM cannot force a thread to safepoint the reason for this the JVM threads are not anything special only the kernel is capable of forcing thread to leave cores. JVM can prevent a thread leaving safepoint. The overall mechanism is that JVM sets a global flag(“time to safepoint”), when application threads see the flag has been set they must pause and wait to be woken up again, overtime all app threads must stop, threads that stop quickly must wait for slower stoppers, this stopping time may not be fully accounted for within the overall GC or STW time. Should also note that it is possible for a thread to take a long time to safepoint.
Concurrent Mark and Sweep
Partially concurrent collector. This is an old collector which is been part of the JDK for some time, it is not necessarily considered a preferred collector for the production use these days unless you need extremely low pause behavior. The reason it is introduced first after parallel collector is because it highlights and show cases many of practical problems with using a concurrent collector and it makes it clear some of the pitfalls presents when you make the tradeoff to go with low pause.
Observable effects of CMS
Application threads don’t stop for as long but they do stop twice for each GC cycle, Full GC cycles take longer in wallclock time, not all of the cores are being devoted to GC when CMS running, the cores will be being shared between application threads and GC threads therefore it will take at least twice as long to collect the heap because the default is to have half cores allocated to GC and half to application threads. So application throughput reduced while CMS cycle running. GC uses more memory for keeping track of objects. considerably more CPU time needed to perform GC. The issue is CMS does not compact the heap, CMS does not move objects around, Tenured can become fragmented, we may find circumstances where there is enough memory to provide space for large object but not all in one place and when that happen CMS has fallback strategy to prevent out of memory error it will instead fallback to ParallelOld, so if objects need to be promoted and no space is available CMS will instead stop the world and do a full compacting collection in order to free up as much space as possible, this effect is known as `concurrent mode failure`. Only when your application need to respond for definite within 50–100 milliseconds then will be considered to use CMS.
Garbage First, G1
production-quality in 8u40, default in 9. Originally intended to be low-pause to replace for CMS but in practice while it has some success in area, the very low-pause spectrum is still out of reach for G1 in most configurations and instead it has ended up becoming a general purpose collector which is really useful replacement of Parallel collectors. Design aims scalable to larger heaps as discussed when looking at ParallelOld the pause time scale linearly the larger the heap gets, easy to tune(-XX:MaxGCPauseMillis), predictability and performance. G1 uses regions of memory which are by default 1M in size instead of hemispherical heap(eden, survivor spaces and old gen), as heap gets larger regions will grow.
balancing of work between collector and allocator places additional work on allocator thread, Java application point of view is impacting performance, can interfere with application throughput.
AOT, JIT compilation
AOT compilation “Ahead Of Time” just called “compilation” in C/C++, compiler turns human-readable source to executable machine code. AOT only has a single opportunity to optimize because the executable code is produced literally ahead of time then there is no opportunity to use runtime information to make better decision, in particular usually AOT compiler code does not know what CPU it will run on so instead, the compiler output generic x64 or i386 code without possibility using processor specific instructions.
JIT is designed to utilize the design principle that runtime information is extremely important.
Klass word is a pointer to class-specific metadata is most important metadata which has method definition written into a vtable, vtable is a key to hotspot JIT compiler working effectively. Hotspot will compile methods on a per method basis so each individual unit of code or method will be compiled or not as determined by the profile counters. 90% of time is spend in ~2% of methods. Hotspot starts off in interpret mode with threads watching to see which methods are the most commonly called once a method has been identified as important(hot method) then the method can be extracted given to a compiler thread to convert into heavily optimized machine code, vtable pointers then can be updated to point at compiled machine code. This process is asynchronous and does not require the interpreter threads, the application threads to stop at any point. The compiler thread can be working as part of VM in the background using a available core to provide fast machine code which we can swap in and out as soon as its available.
Single area of contiguous memory which holds JIT compiled methods(native codes), can change size with -XX:ReservedCodeCacheSize. This is why method should be small as much as possible.
Compilation Strategies in HotSpot
Inlining — move a method body contents into caller scope(Kotlin inline), which avoid cost of call. Considering getter/setter methods, these typically wraps field access and you might argue it is better to have public field rather than getter/setter since existence of getter/setters requires virtual call. In practice this is not true, JIT compilers inlining mechanism will replace the call to the method with direct access. Typically methods which will be inlined will only be those which are up to 35 bytecodes long, also depth count. Benefits include reduced pointer indirections, eliminate virtual method lookups, no penalty for good coding practices, broadens scope for further optimizations.
Loop Unrolling — reduce cost of jump by replacing a loop body with duplicated copies of the code, typically is benefit in tight loops where cost of jump is a large proportion of loop time. size vs speed tradeoff, increase native code size and use code cache(if code cache fills up JIT compilation will stop)
Monomorphic dispatch — Classes usually have 0,1 or many subclasses, as a result it is quite common for the first type to be seen at a particular call site to be the only one that is ever seen there. every time the interpreter execute an invoke code the profiling subsystem can make a note of what the type that is being called upon is. In practice, potentially most calls in an application in some workloads this can be up to 95% of
every objects must be Heap allocated even very short-lived temporary objects. from Java7, the VM undergoes scope-based analysis to determine if object visible outside the method.
When to use microbenchmarks(MBMs)
most application devs should not use MBMs. 3 main use cases for MBMs, OpenJDK development, development of general purpose libraries, low-latency applications. code path of method execution(transaction) time should be less than 5ms which usually less than 100us. allocation rate should be <1MB/s, CPU utilization should be 100% of available CPU and system utilization rate should be low (under 10%)
GraalVM and Java9, 10
String consists of some metadata and a array of chars, in Java the char data type is two bytes per char, however, when normally used to think text is being a single byte per char, so what of those other bytes that Java always had as part of char data type. for ascii or Latin1 text those bytes simply will be zero. For some applications, for example, search engine like lucene we may have been carrying twice as much heap space as we actually need. to solve this String in Java9 is not represented as char array but byte arrays and then a simple field is used to indicate whether or not byte array should be interpreted as Latin1 or UTF16 . in the Latin1 case the zero byte will be emitted and only in the case where UTF16, byte array will effectively double in size to take up the same role as previous char arrays. just by moving Java8 to Java9 you pick up the string optimization known as compact string automatically. this means if any applications are making extensive usage of strings then can get benefit from it, heap size cut in half, reduce GC pause time, decrease amount of effort required to collect heap.
JSR312 — reduce need for global safe points
for profiling one safe point per thread is required, ten threads in application then ten safe points, huge overhead