Skip to content

Conversation

@mcourteaux
Copy link
Contributor

This PR adds a small change that allows you to .no_profiling() on a Func to prevent the corresponding ProducerConsumer node to not be instrumented with profiling. Therefore all the collected samples are attributed to the enclosing ProducerConsumer. I wanted to give this a try to see what the generated code looks like.

I was thinking this might be useful to get rid of this overhead manually when you are working on pipelines that have a Func without any loops sitting in the innermost loop nest level, e.g.: the Func gets compiled to only 1 to 5 instructions.

Feedback and opinions welcome!


My use case:

This code full of profiling injections (with an unrolled loop around it):
image

Became this:
image

Meanwhile, the profile report went from:

neonraw_highlight_ratio_extractor
 total time: 1389.949829 ms  samples: 1309  runs: 48  time/run: 28.957289 ms
 average threads used: 2.828113
 heap allocations: 0  peak heap usage: 0 bytes
    halide_malloc:                0.000ms ( 0.0%)   threads: 0.000 
    halide_free:                  0.000ms ( 0.0%)   threads: 0.000 
    ratio_map:                   20.969ms (72.4%)   threads: 2.685 
    repeat_edge:                  3.639ms (12.5%)   threads: 3.219  stack: 2040
    max_field_hpass:              0.617ms ( 2.1%)   threads: 3.678  stack: 1920
    max_field:                    0.263ms ( 0.9%)   threads: 2.000  stack: 1536
    min_field_hpass:              0.462ms ( 1.5%)   threads: 2.333  stack: 640
    max_channel:                  0.088ms ( 0.3%)   threads: 1.500  stack: 68
    min_field:                    0.044ms ( 0.1%)   threads: 1.500  stack: 512
    b:                            0.352ms ( 1.2%)   threads: 1.750  stack: 32
    g:                            0.154ms ( 0.5%)   threads: 1.571  stack: 32
    r:                            0.110ms ( 0.3%)   threads: 4.000  stack: 32
    lum:                          0.619ms ( 2.1%)   threads: 4.222  stack: 32
    lum_pow3:                     0.044ms ( 0.1%)   threads: 1.500  stack: 32
    lum4:                         1.547ms ( 5.3%)   threads: 3.542  stack: 32
    w_0:                          0.000ms ( 0.0%)   threads: 0.000  stack: 32
    w_1:                          0.022ms ( 0.0%)   threads: 14.000 stack: 32
    w_2:                          0.022ms ( 0.0%)   threads: 2.000  stack: 32

to:

neonraw_highlight_ratio_extractor
 total time: 1376.182739 ms  samples: 1295  runs: 48  time/run: 28.670473 ms
 average threads used: 2.325097
 heap allocations: 0  peak heap usage: 0 bytes
    halide_malloc:                0.000ms ( 0.0%)   threads: 0.000 
    halide_free:                  0.000ms ( 0.0%)   threads: 0.000 
    ratio_map:                   24.013ms (83.7%)   threads: 2.245 
    repeat_edge:                  3.434ms (11.9%)   threads: 2.647  stack: 2040
    max_field_hpass:              0.515ms ( 1.7%)   threads: 3.130  stack: 1920
    max_field:                    0.176ms ( 0.6%)   threads: 3.500  stack: 1536
    min_field_hpass:              0.486ms ( 1.6%)   threads: 2.181  stack: 640
    max_channel:                  0.000ms ( 0.0%)   threads: 0.000  stack: 68
    min_field:                    0.044ms ( 0.1%)   threads: 8.500  stack: 512
    b:                            0.000ms ( 0.0%)   threads: 0.000  stack: 32
    g:                            0.000ms ( 0.0%)   threads: 0.000  stack: 32
    r:                            0.000ms ( 0.0%)   threads: 0.000  stack: 32
    lum:                          0.000ms ( 0.0%)   threads: 0.000  stack: 32
    lum_pow3:                     0.000ms ( 0.0%)   threads: 0.000  stack: 32
    lum4:                         0.000ms ( 0.0%)   threads: 0.000  stack: 32
    w_0:                          0.000ms ( 0.0%)   threads: 0.000  stack: 32
    w_1:                          0.000ms ( 0.0%)   threads: 0.000  stack: 32
    w_2:                          0.000ms ( 0.0%)   threads: 0.000  stack: 32

@abadams abadams added the dev_meeting Topic to be discussed at the next dev meeting label Mar 5, 2024
@mcourteaux mcourteaux marked this pull request as draft March 6, 2024 10:45
@mcourteaux
Copy link
Contributor Author

I still run in a situation where the information from calling Func::no_profiling() is not fully propagated to the ProducerConsumer node. I don't see where it goes wrong. It works for quite some of the Funcs I call this on, but not all.

image

In the screenshot above, w_0, w_1, and w_2 should not have the injection be done on the produce.

@steven-johnson steven-johnson requested a review from abadams March 6, 2024 17:18
@abadams
Copy link
Member

abadams commented Mar 6, 2024

The idea seems useful, but I don't think this should be a flag on ProducerConsumer nodes for such a niche purpose. The nice thing about ProducerConsumer nodes is that they give you the name of the Func being produced/consumed. We never make ones that don't correspond directly to a Func. So I think a better way to implement it is to look up the Func in the environment passed around during lowering and read the flag from its schedule.

@mcourteaux
Copy link
Contributor Author

Replaced by #8143. Thanks for the feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dev_meeting Topic to be discussed at the next dev meeting

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants