Most software engineers have heard of Java or JVM garbage collection (GC) but probably not all of us have paid much attention to it, so we rely a lot on the default settings. Especially for me, being an engineer from a non-JVM language background, the majority of my time was spent on C/C++ development. I didn't get into how garbage collection works until I started working in a position in the IT industry.
After getting into Kotlin/Spring development in my career, I recently started gaining an interest in how the JVM works, especially the garbage collection mechanism. I spent some time researching this topic and found out that it's pretty interesting how the JVM handles garbage collection, which I think is worth knowing for all engineers who work with JVM based languages.
History of garbage collection
Before we dive into the details, it's interesting to know that GC was not first introduced by Java as most programmers think, actually, it was first developed by John McCarthy around 1956, with the purpose of making Lisp memory management easy. [1]
What needs to be collected
In short, objects in JVM heap is the main focus of garbage collection.
GC Roots
Early garbage collection mechanisms used reference counting to determine if an object needed to be collected. This means that when an object is created, the reference count for this object is set to 1. Every assignment of this object to another variable will increment the reference count by 1. When a reference to the object reaches the end of lifecycle or is assigned to another, different object, the reference count for this object decrements by 1. So when the reference count reaches 0, it will be collected by the JVM. But this mechanism breaks when circular reference exists. Modern JVMs use what's called GC Roots to determine what can be collected. Basically, if the JVM detects that there is no link or path more precisely (as it's developed from graph theory) between the object and the GC Root, it will be viewed as collectible.
Objects that can be viewed as GC Roots:
- Local variables
- Active threads
- Static variables
- JNI references
Common garbage collection algorithms
Mark-Sweep
The Mark-Sweep algorithm is straight forward; it starts scanning from the GC Root; it marks any survived objects, then a second scan will perform GC for those unmarked objects, as illustrated below:
As can be seen from above, this algorithm will cause memory fragmentation problems.
Copying
The Copying algorithm is also straight forward; it aims at solving the fragmentation problem caused by Mark-Sweep. It works by dividing the JVM heap space into two sections: one is active, which stores all objects, the other one is idle and not used. All new objects are allocated in the active subspace. Once the active subspace is full, GC kicks in. It will scan the active subspace starting from GC Root, all the surviving objects will be copied into the idle subspace. After copying is finished the roles of these 2 subspaces will be swapped, so the old idle subspace will be the new active subspace for object allocation. The process is shown below:
Mark-Compact
The Mark-Compact algorithm is very similar to Mark-Sweep. The difference is that after the unmarked objects are collected it will perform an additional compaction step to move all the marked objects to an end, so this also overcomes the fragmentation problem in Mark-Sweep, but extra cost is brought in.
Generations
The new algorithm used for modern garbage collection is generation based. The idea of this approach is that it divides the heap into multi-generations, which are:
- Young Generation
- Tenured Generation
- Permanent Generation
The Generations differ from each other on how they are used for garbage collection. Because of this, each Generation can employ a different approach to handle the collection process.
Young Generation (Copying)
- First, all new objects are allocated in the Young generation space.
- The Young generation is divided based on a ratio 8:1:1 into three subspace - Eden, Survivor 0 and Survivor 1. Objects are allocated in Eden space. When GC happens, all surviving objects in the Eden space will be copied into the Survivor 0 space, and the Eden space will be cleared. Then when the next GC kicks in, surviving objects from the Eden space will be copied to a Survivor space, but this time they will be copied into the Survivor one space. Also, objects from the Survivor 0 space will be copied to Survivor 1, then Eden & Survivor 0 spaces will be cleared. Then the GC process will repeat. Since Survivor 0 is empty right now the roles of Survivor 0 & Survivor 1 will be switched for the next GC, that is, surviving objects will be copied to Survivor 0, Eden & Survivor 1’s spaces will be cleared.
- When objects survive multiple GC cycles, they will be promoted to the Tenured space.
Tenured Generation (Mark-Compact)
Within the Tenured space, GC doesn't happen as often as the Young space. When a GC happens it will be a major GC which will clear and compact the whole space.
Since GC rarely happens in permanent space, it will not be covered in this blog.
The end
This blog only covers a tiny amount of garbage collection in JVM and lots of the content is summarized to keep it short, for more information you can visit Oracle's JVM page to read a more detailed description.