I think it'd be good to get a state of matters for where we stand on cjklib in terms of its current codebase. Do we want to use it? As it stands, I'm not sure if I'm failing to grasp the complexities of comingling our data, or if there are architectural mistakes within that just would be best if we rewrote it.
If that is the case - I wonder if you could take some time to document what is what from a data perspective. Here are few questions that'd be helpful to have answers on:
- In
cjklib.data's csv an sql files - what are these datasets? how are they used? are they used in the same way? what data do/can they hold?
More specifically, what is the following:
- edict
- cedict
- cedictgr
- handedict
- cfdict
- unihan
- kanjidic2
and
- cantoneseipainitialfinal
- cantoneseipainitialfinal
- cantoneseyaleinitialnucleuscoda
- cantoneseyalesyllables
- characterdecomposition
- charactershanghaineseipa
- grabbreviation
- grrhotacisedfinals
- grsyllables
- jyutpinginitialfinal
- jyutpingipamapping
- jyutpingsyllables
- jyutpingyalemapping
- kangxiradical
- localecharacterglyph
- mandarinipainitialfinal
- pinyinbraillefinalmapping
- pinyinbrailleinitialmapping
- pinyingrmapping
- pinyininitialfinal
- pinyinipamapping
- pinyinsyllables
- radicalequivalentcharacter
- shanghaineseipasyllables
- strokeorder
- strokes
- Unihan.zip (is this downloaded to here?)
- wadegilesinitialfinal
- wadegilespinyinmapping
- wadegilessyllables
What are the above? Why are some included while otheres are downloaded remotely? Can we package any/all of the remote data in cjklib? Is it it matter of licensing of assuring downloading of fresh data?
What data in the above datasets intersect, where?
If there is a place where the data intersects, often, I'm assuming we're massaging it in some sense so we can match it to a lookup? Maybe it'd help to have a spreadsheet / table on this?
I think that if we mapped the data we have to a spreadsheet it'd offer us all a better view of the picture - imo. Then we can take a look back away from legacy assumptions and be in a better position to make pull requests for larger architecture changes.
I realize the above is a pretty time-consuming thing, think you could take a bite at it though?
I think it'd be good to get a state of matters for where we stand on cjklib in terms of its current codebase. Do we want to use it? As it stands, I'm not sure if I'm failing to grasp the complexities of comingling our data, or if there are architectural mistakes within that just would be best if we rewrote it.
If that is the case - I wonder if you could take some time to document what is what from a data perspective. Here are few questions that'd be helpful to have answers on:
cjklib.data's csv an sql files - what are these datasets? how are they used? are they used in the same way? what data do/can they hold?More specifically, what is the following:
and
What are the above? Why are some included while otheres are downloaded remotely? Can we package any/all of the remote data in cjklib? Is it it matter of licensing of assuring downloading of fresh data?
What data in the above datasets intersect, where?
If there is a place where the data intersects, often, I'm assuming we're massaging it in some sense so we can match it to a lookup? Maybe it'd help to have a spreadsheet / table on this?
I think that if we mapped the data we have to a spreadsheet it'd offer us all a better view of the picture - imo. Then we can take a look back away from legacy assumptions and be in a better position to make pull requests for larger architecture changes.
I realize the above is a pretty time-consuming thing, think you could take a bite at it though?