Most software engineers have heard of Java or JVM garbage collection (GC) but probably not all of us have paid much attention to it, so we rely a lot on the default settings. Especially for me, being an engineer from a non-JVM language background, the majority of my time was spent on C/C++ development. I didn't get into how garbage collection works until I started working in a position in the IT industry.
After getting into Kotlin/Spring development in my career, I recently started gaining an interest in how the JVM works, especially the garbage collection mechanism. I spent some time researching this topic and found out that it's pretty interesting how the JVM handles garbage collection, which I think is worth knowing for all engineers who work with JVM based languages.
Before we dive into the details, it's interesting to know that GC was not first introduced by Java as most programmers think, actually, it was first developed by John McCarthy around 1956, with the purpose of making Lisp memory management easy. [1]
In short, objects in JVM heap is the main focus of garbage collection.
Early garbage collection mechanisms used reference counting to determine if an object needed to be collected. This means that when an object is created, the reference count for this object is set to 1. Every assignment of this object to another variable will increment the reference count by 1. When a reference to the object reaches the end of lifecycle or is assigned to another, different object, the reference count for this object decrements by 1. So when the reference count reaches 0, it will be collected by the JVM. But this mechanism breaks when circular reference exists. Modern JVMs use what's called GC Roots to determine what can be collected. Basically, if the JVM detects that there is no link or path more precisely (as it's developed from graph theory) between the object and the GC Root, it will be viewed as collectible.
Objects that can be viewed as GC Roots:
The Mark-Sweep algorithm is straight forward; it starts scanning from the GC Root; it marks any survived objects, then a second scan will perform GC for those unmarked objects, as illustrated below:
As can be seen from above, this algorithm will cause memory fragmentation problems.
The Copying algorithm is also straight forward; it aims at solving the fragmentation problem caused by Mark-Sweep. It works by dividing the JVM heap space into two sections: one is active, which stores all objects, the other one is idle and not used. All new objects are allocated in the active subspace. Once the active subspace is full, GC kicks in. It will scan the active subspace starting from GC Root, all the surviving objects will be copied into the idle subspace. After copying is finished the roles of these 2 subspaces will be swapped, so the old idle subspace will be the new active subspace for object allocation. The process is shown below:
The Mark-Compact algorithm is very similar to Mark-Sweep. The difference is that after the unmarked objects are collected it will perform an additional compaction step to move all the marked objects to an end, so this also overcomes the fragmentation problem in Mark-Sweep, but extra cost is brought in.
The new algorithm used for modern garbage collection is generation based. The idea of this approach is that it divides the heap into multi-generations, which are:
The Generations differ from each other on how they are used for garbage collection. Because of this, each Generation can employ a different approach to handle the collection process.
Within the Tenured space, GC doesn't happen as often as the Young space. When a GC happens it will be a major GC which will clear and compact the whole space.
Since GC rarely happens in permanent space, it will not be covered in this blog.
This blog only covers a tiny amount of garbage collection in JVM and lots of the content is summarized to keep it short, for more information you can visit Oracle's JVM page to read a more detailed description.