aesara Use `broadcast_to` instead of `broadcast

Closes #159. WIP.

Two questions:

Changing broadcast_like to broadcast_to in aesara/tensor/math_opt.py gives me a lot of BadOptimization and AssertionErrors, which I'm not knowledgeable enough to debug - can someone help me understand what's happening?
~I've also changed broadcast_like to broadcast_to in aesara/tensor/basic_opt.py - is this something we want to do?~

Jan 30 '21 18:01 eigenfoo

Hmm - more tests are failing than I anticipated. Running pytest tests/tensor/test_math_opt.py gives me the 9 errors I was originally referring to: https://gist.github.com/eigenfoo/8c2d5e7867fc23b82c98553c5300b054

The failures don't actually call broadcast_to, but most of them involve toposort or optimize using local_pow_canonicalize, which does call broadcast_to. Does that mean broadcast_to affected the graph optimizations, which is what these failed tests are testing for? Apologies for rambling, I'm trying to understand the codebase better.

tests/tensor/test_basic.test_alloc_constant_folding (the test that failed the current test run) also calls toposort.

Jan 30 '21 18:01 eigenfoo

The failures don't actually call broadcast_to, but most of them involve toposort or optimize using local_pow_canonicalize, which does call broadcast_to. Does that mean broadcast_to affected the graph optimizations, which is what these failed tests are testing for?

It looks like they're failing because those unit tests make unnecessarily strong assumptions, aren't isolated/self-contained enough, don't assert the relevant changes directly, etc. In other words, you're seeing the long standing deficiencies of our test suite.

If you could fix those issues as you address this PR's issue, you would be doing significantly more for this project than the change requested in the issue itself.

Jan 30 '21 19:01 brandonwillard

Aside from strong test assumptions, there are some optimization errors that are probably worth addressing first; those may actually involve bugs.

For instance, the first optimization error is implying that the the type generated by the newly introduced BroadcastTo Op doesn't match the type of the graph it's replacing in the optimization.

Jan 30 '21 19:01 brandonwillard

@brandonwillard the latest commit https://github.com/pymc-devs/aesara/pull/288/commits/a00ee11938d5ef0942b8f58ee5ca7643beedd056 gets rid of all BadOptimization errors, as far as I can tell - all that remains are AssertionErrors.

However, I suspect I'm doing something ugly by adding the dtype parameter like this - can you take a look at the new parameter? If it all looks good, I'll move on to fixing the tests: I suspect I'll need a lot of help along the way, though 😬

Jan 30 '21 22:01 eigenfoo

Sure, that actually makes it easier. I assumed we wanted to mimic the broadcast_like signature, which took a dtype parameter. Sorry if that was a bad assumption.

The previous two commits revert the dtype commit, and use TensorVariable.astype to cast dtypes.

Also, do we want to remove the broadcast_like function now? The only reference is itself.

$ rg broadcast_like
aesara/tensor/basic_opt.py
170:def broadcast_like(value, template, fgraph, dtype=None):
182:            "broadcast_like currently requires the "

Jan 30 '21 22:01 eigenfoo

@brandonwillard I think I've fixed all tests but two. I'm confused about the remaining failures - I've attached the output of pytest test_math_opt.py here: https://gist.github.com/eigenfoo/ddb7d2aad30550c0efbf64a1841881f9

test_stacktrace seems to be a real failure that I don't understand - what exactly is the "trace" that we're testing for? https://github.com/pymc-devs/aesara/blob/0344f1a844731eea980ada631fe180e09f397a51/aesara/graph/opt.py#L3188-L3189
What is test_local_reduce_join testing for? I can tell that it's only these two lines that are failing the test, but don't know if its safe to remove them, or if this is a real failure... https://github.com/pymc-devs/aesara/blob/0344f1a844731eea980ada631fe180e09f397a51/tests/tensor/test_math_opt.py#L3472-L3473

Feb 07 '21 17:02 eigenfoo

test_stacktrace seems to be a real failure that I don't understand - what exactly is the "trace" that we're testing for?

That's probably referring to the stack traces carried by each variable in their tag. Those traces show where each variable was created, and, during optimizations, they're carried over to newly created replacement variables. Their only real purpose is for user-level debugging.

Feb 07 '21 20:02 brandonwillard

2. What is test_local_reduce_join testing for? I can tell that it's only these two lines that are failing the test, but don't know if its safe to remove them, or if this is a real failure...

Those looks like brittle test conditions. Instead of coming up with a direct check for something specific and relevant in the transformed graph, many tests will simply assert the number of nodes in the expected output.

In almost all cases, these kinds of asserts are simply bad, so don't take them too seriously. If you changed an optimization, an Op, and/or one of an Op's helper functions, and any of those things are used in one of these tests, then a change in the number of nodes could be easily justified.

Feb 07 '21 20:02 brandonwillard

I'll try running these tests locally within the next few days and see if I can spot any genuine issues (i.e. ones that aren't due to brittle/overly-restrictive tests).

Feb 12 '21 01:02 brandonwillard

Thanks for helping @brandonwillard! Sorry for being so slow-moving on this PR - something has popped up in the real world (and not a lack of interest in finishing this work).

I've loosened the test conditions on test_local_reduce_join so that it passes now, and I've removed test_stacktrace entirely. I'm unsure why test_stacktrace fails whereas a similar test, test_local_sum_prod_mul_by_scalar_stack_trace, succeeds. I'd be happy to put the test back if I could get a pointer on what's gone wrong.

I'll wait for the test suite to finish, but I'd also appreciate a quick triage of whether there are other real test failures - otherwise I'll assume that everything that fails is a flaky test.

Feb 12 '21 04:02 eigenfoo

Sorry for being so slow-moving on this PR

I only commented so that you knew I hadn't forgotten about this PR; there's absolutely no rush, though!

Feb 12 '21 20:02 brandonwillard

@brandonwillard test_basic/test_stack_hessian is failing - this seems like a real failure, but I'm unsure what's going wrong. Could you give some pointers?

https://github.com/pymc-devs/aesara/pull/288/checks?check_run_id=1884844321

Feb 14 '21 02:02 eigenfoo

test_basic/test_stack_hessian is failing - this seems like a real failure

Yes, that does look like a real issue. My first impression was that it had to do with in-place operations.

The new broadcast_to uses NumPy's underlying view-based broadcasting via the BroadcastTo Op, and its view_map is supposed to indicate the view relationship between the Op's first input, a, and its output. Given that the first value is 2, as expected, and all the remaining values are effectively 0, it seemed like an optimization might've been ignoring this view information and "in-placing" things that it shouldn't.

Unfortunately, the evaluated graph doesn't contain any BroadcastTo Ops, so, if the problem is related to BroadcastTo-generated views, it's manifesting in an indirect way.

Also, with aesara.config.compute_test_value turned on, the test values for the Ha and Hb graphs are incorrect, and setting aesara.config.optimizer_verbose = True shows some validation failures during optimization.

Feb 17 '21 23:02 brandonwillard

~OK, the issue does appear to involve bad in-placing. Those FunctionGraph validation failures are specifically caused by aesara.tensor.basic_opt.local_inplace_setsubtensor producing the following exception:~

InconsistencyError('Attempting to destroy indestructible variables: [TensorConstant{(1,) of 0.0}]')

Actually, those aren't the exact problem. We probably need to find out how the BroadcastTo Op are being replaced.

Feb 17 '21 23:02 brandonwillard

All right, I believe that the source of this issue is the change from broadcast_like to broadcast_to in aesara.tensor.basic_opt.local_fill_to_alloc. That's the rewrite that introduces the BroadcastTo Op in the first place.

In this case, a fill really shouldn't be replaced with a broadcasted view of the fill value; however, we might want a rewrite that turns unmodified (in-place) allocs and/or fills into BroadcastTos.

Feb 18 '21 00:02 brandonwillard

Now that I think about it, we could probably use a simple optimization that removes useless BroadcastTos.

Feb 18 '21 03:02 brandonwillard

Looks like there are a few more brittle tests (e.g. TestCrossEntropyCategorical1Hot.test_get_rid_of_advanced_indexing_version_of_xent and TestMinMax.test_optimization_max do the same len(topo) == x nonsense).

Feb 18 '21 03:02 brandonwillard

Also, some of these optimizations might be having trouble because BroadcastTo isn't being lifted.

In the case of TestMinMax.test_optimization_min, it looks like the optimizations being tested aren't applied because they're looking for neg(max(neg(...))), but now the graph is neg(max(broadcast_to(neg(...)))). If we lift the BroadcastTo Ops, it should fix that.

Feb 18 '21 03:02 brandonwillard

Just rebased this PR

May 11 '22 07:05 ricardoV94

Use `broadcast_to` instead of `broadcast_like`