Skip to content

Implement simple lexicograpic sorting#271

Closed
gunchleoc wants to merge 9 commits intosingularity:masterfrom
gunchleoc:lexicographic-sorting
Closed

Implement simple lexicograpic sorting#271
gunchleoc wants to merge 9 commits intosingularity:masterfrom
gunchleoc:lexicographic-sorting

Conversation

@gunchleoc
Copy link
Contributor

This adds a new dependency on unidecode.

Real lexicographic sorting is locale-dependent, but I couldn't find a library that can do that. There is https://github.com/miurahr/unihandecode but that's for Chinese script only and will crash on other locales.

Screenshot of the problem - "Ìsleachadh" on the knowledge screen should be sorted next to "Innleachd", not appear on the bottom:

string-order

And a screenshot with this branch:
correct_sorting

This adds a new dependency on unidecode.
Copy link
Member

@nthykier nthykier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking at improving the locale support in singularity; you are really going above and beyond on this. :)

I have added to remarks for changes that I would like to see resolved before we merge this. But otherwise, solid work. :)

README.txt Outdated
* Python 3 (https://python.org/download/)
* pygame (https://www.pygame.org/download.shtml)
* NumPy (https://www.scipy.org/install.html)
* unidecode (https://pypi.org/project/Unidecode/)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer that we use the built-in locale.strxfrm (or its counterpart locale.strcoll) by default - possibly ICU as an optional alternative, which is allegedly the only truly reliable way to sort strings locale-aware (notably, there is trouble with locale.strxfrm on Mac OS).

Example of how to use ICU is at: https://stackoverflow.com/questions/11121636/sorting-list-of-string-with-specific-locale-in-python

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for those libraries - I found this difficult to search for, since must search results for "lexicographic sorting" will just give you sorting by unicode code point, which is fine for English.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have switched to ICU. Note that none of the libraries (I tested all of them) can handle correct sorting order for German umlauts, so I have added some special code for that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking at it. It looks like ICU is now mandatory rather than optional as I had hoped for. That is the only bit that is missing for me. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the packaging info now - I already had ICU installed, so I didn't notice.

This branch can't be tested right now because I merged master, which is broken.

@nthykier
Copy link
Member

nthykier commented Jun 7, 2020 via email

@gunchleoc
Copy link
Contributor Author

No problem, things happen.

This branch should be good to go now if you're OK with using ICU. I don't have a Mac though to test if the installation of ICU for Python works.

@nthykier
Copy link
Member

Thanks.

AFAICT, there are two missing bits:

  1. It will need a rebase (github says it has conflicts)
  2. It uses ICU unconditionally; I am fine with it being an optional add-on/feature but I am not sure we are ready to commit to ICU as a strict dependency of singularity.

@gunchleoc
Copy link
Contributor Author

I'm not fond of conditional stuff, since it will give us 2 different code paths and could lead to difficulty with bug fixing.

I could try switching to locale.strxfrm (or its counterpart locale.strcoll) instead and Mac users will just have to wait until the Python programming language fixes their bug.

@nthykier
Copy link
Member

Ok. :) Ping me when you have had time to migrate to locale.strxfrm / locale.strcoll and test the things you want to test.

@gunchleoc
Copy link
Contributor Author

Ok. :) Ping me when you have had time to migrate to locale.strxfrm / locale.strcoll and test the things you want to test.

@nthykier Done :)

Best squash the commits when merging.

@nthykier nthykier closed this in 194f7ce Jun 20, 2020
@gunchleoc gunchleoc deleted the lexicographic-sorting branch June 21, 2020 09:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants