-
Notifications
You must be signed in to change notification settings - Fork 116
Mitigate memory leak #188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mitigate memory leak #188
Conversation
Since `Traverser` and `OutputRules` classes have a cyclic reference,
they can not be removed from the memory by PHP directly. Unless the GC
does its job. And theses classes holds a reference (stream) to the HTML
and it can be very huge (> 1Mb).
So if one uses this lib to parse lots of HTML, the leak lead to a
**massive** use of memory (>2Gb in less than 10s with my current
workload).
By "manually" removing the cyclic reference, the GC is not involved
anymore. PHP can "normally" free the two instances, and so the reference
to the HTML (stream) is also cleared.
Reproducer (with symfony/dom-crawler):
```php
$content = file_get_contents('https://www.php.net/');
$count = $argv[1] ?? 151;
$s = microtime(true);
for ($i = 0; $i < $count; $i++) {
$crawler = new Crawler($content);
$nodes =
$crawler->filterXPath('descendant-or-self::head/descendant-or-self::*/title');
$nodes->each(static function ($node): void {
$node->html(); // The faulty line
});
if (0 == $i % 10) {
preg_match('/^VmRSS:\s(.*)/m', file_get_contents('/proc/self/status'), $m);
printf("%03d - %.2fMb - %s - %.3fs\n", $i, memory_get_usage(true) / 1024 / 1024, trim($m[1]), microtime(true) - $s);
}
}
```
Results without the patch:
```
000 - 4.00Mb - 38052 kB - 0.085s
010 - 4.00Mb - 45732 kB - 0.513s
020 - 4.00Mb - 53124 kB - 0.941s
030 - 4.00Mb - 60780 kB - 1.370s
040 - 4.00Mb - 68436 kB - 1.798s
050 - 4.00Mb - 76092 kB - 2.224s
060 - 4.00Mb - 83484 kB - 2.659s
070 - 4.00Mb - 91140 kB - 3.086s
080 - 4.00Mb - 98796 kB - 3.514s
090 - 4.00Mb - 106452 kB - 3.944s
100 - 4.00Mb - 113844 kB - 4.384s
110 - 4.00Mb - 121500 kB - 4.816s
120 - 4.00Mb - 129156 kB - 5.246s
130 - 4.00Mb - 136812 kB - 5.680s
140 - 4.00Mb - 144468 kB - 6.114s
150 - 4.00Mb - 151860 kB - 6.550s
```
Results with the patch:
```
000 - 4.00Mb - 38020 kB - 0.086s
010 - 4.00Mb - 38868 kB - 0.551s
020 - 4.00Mb - 38868 kB - 0.987s
030 - 4.00Mb - 38868 kB - 1.422s
040 - 4.00Mb - 38868 kB - 1.860s
050 - 4.00Mb - 38868 kB - 2.295s
060 - 4.00Mb - 38868 kB - 2.734s
070 - 4.00Mb - 38868 kB - 3.167s
080 - 4.00Mb - 38868 kB - 3.603s
090 - 4.00Mb - 38868 kB - 4.040s
100 - 4.00Mb - 38868 kB - 4.476s
110 - 4.00Mb - 38868 kB - 4.919s
120 - 4.00Mb - 38868 kB - 5.373s
130 - 4.00Mb - 38868 kB - 5.820s
140 - 4.00Mb - 38868 kB - 6.270s
150 - 4.00Mb - 38868 kB - 6.715s
```
|
I believe that this is not a memory leak. If you call
With your approach, you remove manually the cycle, in this way PHP is able remove the object immediately without waiting for I think that this teaches us to call To be hones I'm happy with your solution, but I would ask you to add a comment on why that call is being done (and that by design is not needed). |
I think a clearer solution would be to leave the |
That was also my first idea, but that will be a BC break since adding a method to an interface will break compatibility for sure. |
|
@goetas calling |
|
What do you think about the approach in #190, is substantially the same but using a separate method in the concrete class... |
|
Closing in favor of #190 |
Since
TraverserandOutputRulesclasses have a cyclic reference,they can not be removed from the memory by PHP directly. Unless the GC
does its job. And theses classes holds a reference (stream) to the HTML
and it can be very huge (> 1Mb).
So if one uses this lib to parse lots of HTML, the leak lead to a
massive use of memory (>2Gb in less than 10s with my current
workload).
By "manually" removing the cyclic reference, the GC is not involved
anymore. PHP can "normally" free the two instances, and so the reference
to the HTML (stream) is also cleared.
Reproducer (with symfony/dom-crawler):
Results without the patch:
Results with the patch:
the full story: https://jolicode.com/blog/a-journey-to-find-a-memory-leak