Skip to content

ICU-22941 Revert ICU-22112, untailoring root word break#3249

Merged
eggrobin merged 1 commit intounicode-org:mainfrom
eggrobin:intestine
Nov 5, 2024
Merged

ICU-22941 Revert ICU-22112, untailoring root word break#3249
eggrobin merged 1 commit intounicode-org:mainfrom
eggrobin:intestine

Conversation

@eggrobin
Copy link
Copy Markdown
Member

This brings the colon back into MidLetter (with no tailoring on top of the UCD), instead of its inclusion in MidLetter being an fi & sv tailoring.

Checklist

@eggrobin
Copy link
Copy Markdown
Member Author

@markusicu, I am running into the same problem as in #3028 (comment): the documentation only tells me how to regenerate the old icudata.jar, not the ICU4J .brk files.

Can you perform the same thaumaturgy you did in 169023a? (At some point it would be good to update the documentation, too…)

@eggrobin
Copy link
Copy Markdown
Member Author

@markusicu post-UTW poke

@eggrobin
Copy link
Copy Markdown
Member Author

eggrobin commented Nov 4, 2024

@markusicu in fcd04fc I tried copying over the .brk files that get generated when I rebuild on my machine; these don’t seem to work:

java.lang.AssertionError
	at com.ibm.icu.impl.ICUBinary.readHeader(ICUBinary.java:574)
	at com.ibm.icu.impl.RBBIDataWrapper.get(RBBIDataWrapper.java:295)

But genbrk.cpp does not seem to have any options for output format, so I don’t understand how I can be generating brk files that work for ICU4C but not for ICU4J.

@markusicu
Copy link
Copy Markdown
Member

I just fetched your branch and ran the rbbi tests in Eclipse. 66 tests, 2 failed. So 64 tests worked :-)

The code fails to load "brkitr/word_fi_sv.brk". ICUBinary.getData() tries to find it two ways but ends up returning null because it's not there, and it doesn't throw an exception because the caller didn't ask for it. This is a bug --> ICU-22960

I see that you updated that .brk file, but I don't see why ICU can't find it :-(

@markusicu
Copy link
Copy Markdown
Member

Oh, wait, you are deleting that file...

@markusicu
Copy link
Copy Markdown
Member

I refreshed all of the ICU4J data on my Linux box. It still fails for me in Eclipse because it still tries to load the word_fi_sv file. I don't see where it still has that registered. Pushing my files to your branch in the hope that my Eclipse is just wedged...

@markusicu
Copy link
Copy Markdown
Member

If this works, then I suspect that updating the res_index.res file did the trick.
The failing BreakIteratorTest.TestT5615() loops over all locales; if there is anything anywhere leftover that refers to the old file then we have a problem.

@markusicu
Copy link
Copy Markdown
Member

I got it to work locally. You deleted the ICU4C brkitr/fi.txt & sv.txt files, but the repo still had the ICU4J .res versions. So when asked for Finnish word breaks, it found & loaded fi.res which referred to the deleted word_fi_sv.res file.

Hopefully this is it.

@markusicu
Copy link
Copy Markdown
Member

Bingo! 🎉
Please squash all but the first commit before you get ready to merge.

@jira-pull-request-webhook
Copy link
Copy Markdown

Notice: the branch changed across the force-push!

  • icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/icudata/brkitr/res_index.res is no longer changed in the branch
  • icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/icudata/brkitr/word_POSIX.brk is different
  • icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/icudata/brkitr/word.brk is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@eggrobin
Copy link
Copy Markdown
Member Author

eggrobin commented Nov 5, 2024

Thanks a lot. I squashed it all, dropping ca87360 (but keeping 081efef) to see if the brk files I generated were fine after all—it looks like it works, so I shouldn’t need to ask you to turn the crank next time.

@eggrobin
Copy link
Copy Markdown
Member Author

eggrobin commented Nov 5, 2024

@markusicu As discussed over virtual tea, you might still want to flip the bytes.

@markusicu
Copy link
Copy Markdown
Member

I regenerated the data and pushed the updated files. When the tests pass, please squash again.

Explanation for others: We want to keep the Java data in big-endian format, so that different people generating the data don't flip-flop on no-op data changes, and wonder why what they are doing affects BreakIterator data files.

@jira-pull-request-webhook
Copy link
Copy Markdown

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@eggrobin
Copy link
Copy Markdown
Member Author

eggrobin commented Nov 5, 2024

please squash again

Squished.

@eggrobin eggrobin merged commit 8d86ca1 into unicode-org:main Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants