Use `broadcast_to` instead of `broadcast_like`
Closes #159. WIP.
Two questions:
- Changing
broadcast_liketobroadcast_toinaesara/tensor/math_opt.pygives me a lot ofBadOptimizationandAssertionErrors, which I'm not knowledgeable enough to debug - can someone help me understand what's happening? - ~I've also changed
broadcast_liketobroadcast_toinaesara/tensor/basic_opt.py- is this something we want to do?~
Hmm - more tests are failing than I anticipated. Running pytest tests/tensor/test_math_opt.py gives me the 9 errors I was originally referring to: https://gist.github.com/eigenfoo/8c2d5e7867fc23b82c98553c5300b054
The failures don't actually call broadcast_to, but most of them involve toposort or optimize using local_pow_canonicalize, which does call broadcast_to. Does that mean broadcast_to affected the graph optimizations, which is what these failed tests are testing for? Apologies for rambling, I'm trying to understand the codebase better.
tests/tensor/test_basic.test_alloc_constant_folding (the test that failed the current test run) also calls toposort.
The failures don't actually call
broadcast_to, but most of them involvetoposortor optimize usinglocal_pow_canonicalize, which does callbroadcast_to. Does that meanbroadcast_toaffected the graph optimizations, which is what these failed tests are testing for?
It looks like they're failing because those unit tests make unnecessarily strong assumptions, aren't isolated/self-contained enough, don't assert the relevant changes directly, etc. In other words, you're seeing the long standing deficiencies of our test suite.
If you could fix those issues as you address this PR's issue, you would be doing significantly more for this project than the change requested in the issue itself.
Aside from strong test assumptions, there are some optimization errors that are probably worth addressing first; those may actually involve bugs.
For instance, the first optimization error is implying that the the type generated by the newly introduced BroadcastTo Op doesn't match the type of the graph it's replacing in the optimization.
@brandonwillard the latest commit https://github.com/pymc-devs/aesara/pull/288/commits/a00ee11938d5ef0942b8f58ee5ca7643beedd056 gets rid of all BadOptimization errors, as far as I can tell - all that remains are AssertionErrors.
However, I suspect I'm doing something ugly by adding the dtype parameter like this - can you take a look at the new parameter? If it all looks good, I'll move on to fixing the tests: I suspect I'll need a lot of help along the way, though 😬
Sure, that actually makes it easier. I assumed we wanted to mimic the broadcast_like signature, which took a dtype parameter. Sorry if that was a bad assumption.
The previous two commits revert the dtype commit, and use TensorVariable.astype to cast dtypes.
Also, do we want to remove the broadcast_like function now? The only reference is itself.
$ rg broadcast_like
aesara/tensor/basic_opt.py
170:def broadcast_like(value, template, fgraph, dtype=None):
182: "broadcast_like currently requires the "
@brandonwillard I think I've fixed all tests but two. I'm confused about the remaining failures - I've attached the output of pytest test_math_opt.py here: https://gist.github.com/eigenfoo/ddb7d2aad30550c0efbf64a1841881f9
-
test_stacktraceseems to be a real failure that I don't understand - what exactly is the "trace" that we're testing for? https://github.com/pymc-devs/aesara/blob/0344f1a844731eea980ada631fe180e09f397a51/aesara/graph/opt.py#L3188-L3189 - What is
test_local_reduce_jointesting for? I can tell that it's only these two lines that are failing the test, but don't know if its safe to remove them, or if this is a real failure... https://github.com/pymc-devs/aesara/blob/0344f1a844731eea980ada631fe180e09f397a51/tests/tensor/test_math_opt.py#L3472-L3473
test_stacktraceseems to be a real failure that I don't understand - what exactly is the "trace" that we're testing for?
That's probably referring to the stack traces carried by each variable in their tag. Those traces show where each variable was created, and, during optimizations, they're carried over to newly created replacement variables. Their only real purpose is for user-level debugging.
2. What is
test_local_reduce_jointesting for? I can tell that it's only these two lines that are failing the test, but don't know if its safe to remove them, or if this is a real failure...
Those looks like brittle test conditions. Instead of coming up with a direct check for something specific and relevant in the transformed graph, many tests will simply assert the number of nodes in the expected output.
In almost all cases, these kinds of asserts are simply bad, so don't take them too seriously. If you changed an optimization, an Op, and/or one of an Op's helper functions, and any of those things are used in one of these tests, then a change in the number of nodes could be easily justified.
I'll try running these tests locally within the next few days and see if I can spot any genuine issues (i.e. ones that aren't due to brittle/overly-restrictive tests).
Thanks for helping @brandonwillard! Sorry for being so slow-moving on this PR - something has popped up in the real world (and not a lack of interest in finishing this work).
I've loosened the test conditions on test_local_reduce_join so that it passes now, and I've removed test_stacktrace entirely. I'm unsure why test_stacktrace fails whereas a similar test, test_local_sum_prod_mul_by_scalar_stack_trace, succeeds. I'd be happy to put the test back if I could get a pointer on what's gone wrong.
I'll wait for the test suite to finish, but I'd also appreciate a quick triage of whether there are other real test failures - otherwise I'll assume that everything that fails is a flaky test.
Sorry for being so slow-moving on this PR
I only commented so that you knew I hadn't forgotten about this PR; there's absolutely no rush, though!
@brandonwillard test_basic/test_stack_hessian is failing - this seems like a real failure, but I'm unsure what's going wrong. Could you give some pointers?
https://github.com/pymc-devs/aesara/pull/288/checks?check_run_id=1884844321
test_basic/test_stack_hessianis failing - this seems like a real failure
Yes, that does look like a real issue. My first impression was that it had to do with in-place operations.
The new broadcast_to uses NumPy's underlying view-based broadcasting via the BroadcastTo Op, and its view_map is supposed to indicate the view relationship between the Op's first input, a, and its output. Given that the first value is 2, as expected, and all the remaining values are effectively 0, it seemed like an optimization might've been ignoring this view information and "in-placing" things that it shouldn't.
Unfortunately, the evaluated graph doesn't contain any BroadcastTo Ops, so, if the problem is related to BroadcastTo-generated views, it's manifesting in an indirect way.
Also, with aesara.config.compute_test_value turned on, the test values for the Ha and Hb graphs are incorrect, and setting aesara.config.optimizer_verbose = True shows some validation failures during optimization.
~OK, the issue does appear to involve bad in-placing. Those FunctionGraph validation failures are specifically caused by aesara.tensor.basic_opt.local_inplace_setsubtensor producing the following exception:~
InconsistencyError('Attempting to destroy indestructible variables: [TensorConstant{(1,) of 0.0}]')
Actually, those aren't the exact problem. We probably need to find out how the BroadcastTo Op are being replaced.
All right, I believe that the source of this issue is the change from broadcast_like to broadcast_to in aesara.tensor.basic_opt.local_fill_to_alloc. That's the rewrite that introduces the BroadcastTo Op in the first place.
In this case, a fill really shouldn't be replaced with a broadcasted view of the fill value; however, we might want a rewrite that turns unmodified (in-place) allocs and/or fills into BroadcastTos.
Now that I think about it, we could probably use a simple optimization that removes useless BroadcastTos.
Looks like there are a few more brittle tests (e.g. TestCrossEntropyCategorical1Hot.test_get_rid_of_advanced_indexing_version_of_xent and TestMinMax.test_optimization_max do the same len(topo) == x nonsense).
Also, some of these optimizations might be having trouble because BroadcastTo isn't being lifted.
In the case of TestMinMax.test_optimization_min, it looks like the optimizations being tested aren't applied because they're looking for neg(max(neg(...))), but now the graph is neg(max(broadcast_to(neg(...)))). If we lift the BroadcastTo Ops, it should fix that.
Just rebased this PR