A while back I unveiled a detailed code walkthraw of CPython’s GC perestablishation, but there was a need for a higher level exscheduleation of the overall memory administerment mechanism of CPython without converseing the code. This article fills that gap. It provides a detailed overwatch of the overall memory administerment mechanism in CPython. The main cgo in is on the cyclic garbage accumulateor (GC), the article covers details about how and when the GC runs and ties it up with its impact on the perestablishance of an application.
This article was triggered due to a perestablishance revertion set up in CPython’s incremental GC perestablishation in the 3.13 free truthfulate. As I posted an elucidateer about it on Twitter, I authenticized there is a need for a basicr article on the schedule of GC which originates it basic to deduce perestablishance insights around memory administerment in Python applications.
In the next live session we will lachieve about how bytecode compiled languages (appreciate Python) toil by live coding a compiler and virtual machine for a minuscule subset of Python syntax. We will lachieve about the set up of Python’s AST, its bytecode directions and how they perestablish. More details at the below connect:
In CPython, the garbage accumulateor isn’t the primary driver behind reclaiming the memory, it has a very restricted role. CPython primarily depfinishs on reference counting for reclaiming the memory of unengaged objects.
In reference counting, every object begins with a reference count of 1 on initialization. On scheduleatement to other variables, or function parameters the reference count of the object gets incremented. Similarly, when a variable goes out of scope, or a function call returns, the reference count gets decremented. Finassociate, when the last variable referring to that object goes out of scope (i.e. the reference count accomplishes 0), the runtime instantly reclaims the memory for that object.
This scheme is basic to perestablish but has one lowcoming. If a set of object references originate a cycle amongst themselves then their reference counts will never accomplish 0 even if no one else is referring to them. As a result, such objects are left behind in the heap which can speedyly turn into a memory leak.
To reclaim the memory of such cyclic references, a garbage accumulateor is needed which scans the objects on the heap, finds out the cyclic references and spotlesss them up if they are not accomplishable from anywhere else in the program.
Read my article on CPython’s reference counting insides to lachieve how it toils and its perestablishation details.
Programming languages with administerd memory need to originate it reassociate affordable to originate novel objects. If they always call malloc and free to originate and demolish objects, it can speedyly get inefficient. Apart from the overhead associated with incrrelieved system calls, it can also direct to fragmented memory which can result in lesserer perestablishance. Therefore, most language runtimes perestablish object pools and customized object allocators.
CPython also has an object allocator scheduleed to reduce the cost associated with widespread allocation and deallocation of objects. The schedule of this allocator is analogous to arena based allocators. Internassociate, it protects object pools of various sizes. Depfinishing on the size of memory asked, the allocator allots memory from an appropriately sized pool. When the object’s engage is done and it is deallotd, it is returned back to the allocator.
However, this allocator is engaged only for allocating objects of size less than 512 bytes. For huger objects, CPython relies on the system’s malloc API.
Now let’s get back to converseing the CPython GC.
Programming languages which solely depfinish on the garbage accumulateor for managing memory run the GC at a mended frequency. However, in CPython most of the objects are freed up as soon as they are not needd, thanks to reference counting. As a result, there is no need for running the GC at a mended frequency.
Then how does the CPython runtime determine when to request the GC? It does so based on the number of objects currently alive in the youthful generation of the heap.
GC Generations: Conceptuassociate, the heap is splitd into a restricted segments called generations. Typicassociate there is a youthful generation which is where all novelly originated objects are placed. And then there are one or more elderly generations where the objects which endured GC cycles while in the youthful generation are placed. The GC runs much less widespreadly for elderlyer generations with the hypothesis that if an object endures a GC cycle, it is very probable to stay around for a extfinished time and should be verifyed less widespreadly for accumulateion.
As soon as a novel object is allotd, the runtime puts it into the enumerate of objects tracked by the youthful generation. Similarly, when an object is deallotd (for instance when its reference count goes to 0) it is deconnected from the youthful generation and the count of objects in youthful generation is decremented.
When the total number of alive objects in the youthful generation passes a threshelderly (default appreciate 2000 in Python 3.12, but it is configurable), the runtime sets a flag in the current thread’s thread state data set up. This flag is set in a field called eval_shatterer
where flags for a restricted other events are also set, such as signals.
This uncomardents that if most of your objects are low lived and do not have cycles, the population of objects wiskinny the youthful generation will remain wiskinny deal with and GC will not be triggered very widespreadly.
So let’s say you originate enough extfinished lived objects that you eventuassociate pass the youthful generation threshelderly and the GC flag is set. But that doesn’t request the GC automaticassociate. The CPython virtual machine (VM) verifys this flag at definite points during bytecode execution and requests the GC if it finds the flag to be set.
The CPython VM verifys the flag at the finish of the execution of some very definite bytecode directions. These directions are essentiassociate branch offent variations of function calls and backward jump directions (i.e. at the finish of a loop iteration).
This uncomardents that if at the finish of executing a function or a loop iteration, the GC threshelderly was passed and the flag was set then the VM will stop executing further bytecode directions of your program and switch to executing the GC to free up memory.
If you want to comprehfinish how the CPython toils and its insides, read my article on the schedule & perestablishation of the CPython VM.
The job of the GC is to verify the currently living objects in the youthful generation for cycles. If it finds any unaccomplishable cyclic references, it spotlesss them up. Any objects which are not spotlessed up are shiftd to the elderly generation.
The cycle acunderstandledgeion algorithm needs the GC to verify the incoming references for each object in the accumulateion it is scanning. Even though the object graph is typicassociate not very dense, it still is a lot of pointer chasing, which can be costly if those objects are not in the CPU cache. Read my article on CPython GC perestablishation to comprehfinish the algorithm.
At the finish of the GC cycle, the population of the youthful generation should accomplish 0 (or shut to 0). Some objects which might have been originated as part of the function or loop body would have been automaticassociate demolished by reference counting. The unaccomplishable cyclic references would get spotlessed up by the GC and the surviving objects would get backd to the elderly generation.
Whenever an elderlyer generation is being scanned, the GC scans all the generations youthfuler to it as well. But when does the GC scan the elderlyer generations? It engages threshelderly appreciates for the elderlyer generations as well. However, these threshelderlys are not based on the number of objects, but on how many times the previous generation has been scanned.
The default threshelderly appreciates for the two elderly generations in CPython’s GC is 10. When the GC has been run for the youthful generation a 10 times, the GC will scan the first elderly generation + the youthful generation during the next cycle.
Similarly, when the first elderly generation has been scanned 10 times, the elderlyest generation becomes eligible to be scanned. When the elderlyest generation is being scanned it automaticassociate includes scanning all the other generations, this is called a filled heap scan becaengage it is essentiassociate scanning everyskinnyg on the heap and it can be quite an costly operation.
Scanning the entire heap every restricted cycles can be very costly in declareive applications which tfinish to originate many extfinished lived objects. And in vague it is seed that objects which endure GC in the youthful generation tfinish to live for a extfinished duration and therefore the filled heap scan should be done more judiciously.
As a result, CPython puts an extra condition before deciding to do a filled heap scan apart from the threshelderly. It tracks the total number of objects which have been included in the previous filled scans (called extfinished_lived_total
), and it also tracks the total number of novel objects which are apostponeing their filled heap scan (called extfinished_lived_pfinishing
). When the ratio of extfinished_lived_pfinishing
to extfinished_lived_total
outdos 25% then only CPython will actuassociate perestablish a filled heap scan.
Incremental GC was an selectimization which was recently presentd in CPython after the 3.12 free with the goal of reducing the perestablishance impact of a filled heap scan.
As I refered at the commencening of the article, incremental GC has been reverted back (it might come back in the 3.14 free), its idea made sense and I will spfinish a moment to elucidate what it did.
Becaengage the youthful generation is scanned only when the number of objects in it passes a threshelderly, its perestablishance cost is usuassociate bounded and is hugely uniestablish for a given application.
However, a filled scan of the heap can originate the overall GC cost unanticipateed. The number of objects in the elderlyer generation can be unbounded if the application is creating too many extfinished lived objects, and as a result the amount of time spent in GC can also be unforeseeably high. This usuassociate impacts the tail procrastinateedncy of applications and originates the perestablishance undepfinishable.
The idea of incremental GC is to amortize the cost of filled heap scan by scanning a fraction of the elderly generation on each scan of the youthful generation. For instance, on every GC cycle the GC would scan the entire youthful generation and a minuscule percentage of the elderlyest objects in the elderlyest generation. This way, the memory would still be reclaimed, but the overall cost of each GC cycle would become more anticipateed becaengage now the GC would always scan a analogous number of objects, thus improving the tail procrastinateedncy of applications.
This idea showed betterments in the benchlabels, but it seems to have hit an edge case in Sphinx, where it originates the perestablishance much worse than the previous GC algorithm.
Although the authentic underlying caengage behind the revertions is yet to be verifyd, it is not too challenging to skinnyk of a restricted scenarios where incremental GC might result in more toil per cycle. For instance, if the application is generating many extfinished lived objects, then on each GC cycle those objects would be backd to the elderly generation and accumuprocrastinateed there. As a result, over time the GC would be doing increasingly more toil becaengage the fraction of the elderly generation it needs to scan will steadily prolong.
Now that we comprehfinish how GC toils in CPython, I want to show the cost of a filled heap scan by the GC using a worst case scenario. This is a hypothetical example with many clarifying assumptions, in a authentic-world system skinnygs are neither going to be this inanxious, nor such simplified assumptions will be real.
-
GC Threshelderly: For CPython 3.12, the threshelderly for triggering a GC accumulateion for the youthful generation (generation 0) is set to 2000 objects. The threshelderly for the two elderly generations is 10, which uncomardents that the first elderly generation is scanned after every 10 scans of the youthful generation and the 2nd elderly generation is scanned after every 10 scans of the first elderly generation.
-
Traffic: Let’s suppose we are running a web service which is serving 100 asks per second.
-
Object Creation: Each ask originates 20 novel extfinished-lived objects which endure the GC cycle.
-
Object Size: Each object is approximately 24 bytes in size. Normassociate, Python objects are much huger but let’s engage this for simpliedy.
-
Scalability:
-
If these 2000 objects do not reference each other, the GC’s task is simplified to a connected enumerate traversal. (Internassociate all objects wiskinny a GC generation are tracked using a connected enumerate of pointers)
-
Traversing a connected enumerate needs at least two memory dereferences per node—one to dereference the current node pointer, and the other to get the next node’s compriseress.
-
2000 objects with 24 bytes each amounts to 48,000 bytes which fit wiskinny the L1 cache. A one L1 cache watchup costs 4 cycles, so two watchups for traversing will consent 8 CPU cycles.
-
As we are scanning 2000 objects, the filled scan will consent 16,000 cycles.
-
At 1 GHz, this transprocrastinateeds to approximately 16 μ
s
per GC cycle.
-
-
Subsequent Collections:
-
After 10 scans of the youthful generation (i.e. fairin 10 seconds) the threshelderly for the first elderly generation will be accomplished.
-
At this point there will be a total of 22,000 objects combined in the two generations.
-
22,000 objects would devour about 528,000 bytes which will be spilled apass L1 and L2 caches. To protect skinnygs basic, let’s suppose the cost of each memory dereference is still 4 cycles.
-
Cost of this scan: (
22000 objects x 2 dereferences x 4 cycles per dereference) =
176000
CPU cycles, rawly (0.176 ms).
-
-
Escalating Costs:
-
After 10 scans of the first elderly generation (i.e. after 100 seconds), the threshelderly for the 2nd elderly generation will be accomplished.
-
At this point there will be a total of 220,000 objects to scan apass all the three generations.
-
Total memory: (220000 x 24) bytes = 5.2 MB. This memory size outdos the standard L1/L2 cache capacity.
-
If we suppose that 20,000 objects fit wiskinny L1/L2 then traversing them would devour 160,000 cycles.
-
The remaining 200,000 objects would be in L3 cache. Assuming 30 cycles procrastinateedncy for L3 cache, traversing 200,000 objects would devour 200,000 objects x 2 dereferences x 30 cycles per dereferences = 12,000,000 cycles.
-
Overall cycles devourd would be 160,000 + 12,000,000 = 12,160,000 cycles.
-
On 1 1GHz machine, this transprocrastinateed into rawly 12ms spent inside GC per filled heap scan.
-
If the web service normassociate consents 12 ms to serve a ask, during the times when a filled heap scan occurs, the procrastinateedncy can incrrelieve to 24 ms. This results in a 100% higher procrastinateedncy during a filled heap scan by the GC which is quite a lot.
I’d underline aachieve that this was a worst case example to put skinnygs into perspective about the costs of memory administerment. In authentic-world applications skinnygs are not this inanxious.
Also, this does not uncomardent that CPython’s GC is particularly horrible. In most languages, the cost of GC tfinishs to become a point of satisfiedion and a lot of prolonger time and effort is spent in trying to tune the GC, or enhance the application to shrink GC.
Now that we comprehfinish how CPython administers memory, how can we engage this understandledge to better the perestablishance of our applications? Let’s tie up our empathetic of the GC insides of CPython with some actionable insights to shrink the cost of GC.
Note that, there are a ton of branch offent benevolent of selectimizations that you can do to shrink the memory usage of your program. The restricted insights that I am enumerateing below are mostly to shrink the cost of GC or memory allocation, with the goal of illustrating how the understandledge of the insides help in analyzing and improving the perestablishance of your applications. But you only want to watch into these if your perestablishance seeing tools propose that the GC is a bottleneck.
CPython engages an efficient object allocator for objects minusculeer than 512 bytes. Larger objects are allotd using the system’s malloc perestablishation, which can be more costly.
One potential selectimization strategy is to schedule your data set ups to prefer minusculeer objects. This could include splitting huger objects into minusculeer ones. While this may result in more individual allocations, these allocations will engage the more efficient object allocator.
However, this approach comes with trade-offs:
-
It may incrrelieve overall memory usage due to compriseitional object headers and pointers.
-
It could complicate your code and potentiassociate sluggish down access patterns if roverdelighted data is split apass multiple objects.
-
It may result in more number of objects being originated and if they are extfinished lived then it may incrrelieve the GC frequency.
Always profile and meadeclareive before perestablishing such selectimizations. Only ponder this approach if you acunderstandledge perestablishance publishs roverdelighted to widespread malloc/free calls or memory fragmentation.
For applications that reassociate need many huge objects, ponder using a more efficient malloc perestablishation such as jemalloc, mimalloc, or tcmalloc.
-
Use slot classes. For classes with many instances, using
__slots__
can shrink the memory footprint and speed up attribute access. This also impedes the creation of a per-instance__dict__
, which the GC would also need to ponder. -
Instead of using a enumerate, ponder using the array module if you understand that all of your data is numeric. The array module packs the data much more efficiently and shrinks the presdeclareive on memory.
With the understandledge that every extfinished living object has a potential CPU cost in terms of how widespreadly the GC will run and how much toil it will have to do, it helps to reskinnyk object lifetimes.
-
Avoiding unvital global objects is one evident selectimization. Global or module scope objects tfinish to stay around for extfinisheder durations, protecting them to a least is a outstanding schedule choice, and protects the GC satisfied as well.
-
Releasing resources willingly is another selectimization. For example, engage context administerrs where possible to guarantee underlying resources and objects are freed as soon as possible.
-
Optimize the set up of extfinished-lived objects. If such an object references many other objects, those references will persist in the heap. By restructuring these objects to comprise only vital adviseation, devoid of unvital references, you can free memory for the other objects and shrink GC frequency and cost.
-
Use feeble references. When we feeblely reference an object, the object’s reference count is not incremented. This uncomardents that if the object is not engaged anywhere else, it will be freed up despite of us helderlying a feeble reference to it.
By now you must understand that the whole reason for the existence of the CPython GC is to free up memory associated with cyclic references; rest of the objects are freed up automaticassociate using reference counting.
While sometimes the problem domain insists a schedule where the objects have cycles between them, but guarantee that these do not exist unintentionassociate.
If the needments of your application apexhibit it then you can also ponder using feeble references in places where cycles might establish.
You may also ponder tfeebleing the threshelderlys of the garbage accumulateor using the gc module. You can incrrelieve the threshelderlys so that the GC is triggered less widespreadly at the cost of incrrelieved memory usage. Or you may decrrelieve the threshelderlys so that the GC is run more widespreadly and memory is reclaimed more aggressively.
You should engage telemetry tools to accumulate stats about the heap and GC usage, such as count of the number of objects in each generation, number of GC cycles triggered, and the amount of time spent in each GC cycle to originate a more adviseed choice about tuning the GC parameters.
Finassociate, profile and see the metrics of your application before jumping to doing these selectimizations. Amdahl’s law tells us that the overall speedup we can achieve by selectimizing a part of the system is restricted by the amount of time spent in that system. If the bottleneck of your application lies somewhere else, then selectimizing memory is not going to transport much profit.
Invest in memory profiling tools such as trackmalloc, scalene and other perestablishance analysis tools such as perf, cProfile, py-secret agent. Check out my video on python profilers to lachieve more.
While people select Python becaengage of its syntax and productivity features, not becaengage of its perestablishance. But when you run into perestablishance problems, then having an empathetic of the inside mechanisms of the language is the only way to mend them. In this post we converseed how and under what conditions CPython’s garbage accumulateor runs and what are its associated overheads. Based on this understandledge we also came up with a enumerate of techniques to enhance memory usage of application so that the runtime perestablishs GC less widespreadly.
I hope you find this post astute. I’ve written a lot of articles covering the perestablishation details of CPython but I am authenticizing I also need to provide these insights which are beneficial for prolongers who are deploying Python in production. Stay tuned for more.
If you have read the filled article and experienceing brave, verify out my article on the perestablishation of the CPython GC. It gives you a detailed walkthraw of the CPython code and elucidates the cycle acunderstandledgeion algorithm and other juicy details.
If you find my toil engaging and precious, you can help me by selecting for a phelp subscription (it’s $6 monthly/$60 annual). As a bonus you get access to monthly live sessions, and all the past write downings.
Many people tell fall shorted payments, or don’t want a recurring subscription. For that I also have a buymeacoffee page. Where you can buy me coffees or become a member. I will fortify you to a phelp subscription for the equivalent duration here.
I also have a GitHub Sponsor page. You will get a backship horriblege, and also a complementary phelp subscription here.