Skip to content

State of the cjklib / understanding our datasets #3

@tony

Description

@tony

I think it'd be good to get a state of matters for where we stand on cjklib in terms of its current codebase. Do we want to use it? As it stands, I'm not sure if I'm failing to grasp the complexities of comingling our data, or if there are architectural mistakes within that just would be best if we rewrote it.

If that is the case - I wonder if you could take some time to document what is what from a data perspective. Here are few questions that'd be helpful to have answers on:

  • In cjklib.data's csv an sql files - what are these datasets? how are they used? are they used in the same way? what data do/can they hold?

More specifically, what is the following:

  • edict
  • cedict
  • cedictgr
  • handedict
  • cfdict
  • unihan
  • kanjidic2

and

  • cantoneseipainitialfinal
  • cantoneseipainitialfinal
  • cantoneseyaleinitialnucleuscoda
  • cantoneseyalesyllables
  • characterdecomposition
  • charactershanghaineseipa
  • grabbreviation
  • grrhotacisedfinals
  • grsyllables
  • jyutpinginitialfinal
  • jyutpingipamapping
  • jyutpingsyllables
  • jyutpingyalemapping
  • kangxiradical
  • localecharacterglyph
  • mandarinipainitialfinal
  • pinyinbraillefinalmapping
  • pinyinbrailleinitialmapping
  • pinyingrmapping
  • pinyininitialfinal
  • pinyinipamapping
  • pinyinsyllables
  • radicalequivalentcharacter
  • shanghaineseipasyllables
  • strokeorder
  • strokes
  • Unihan.zip (is this downloaded to here?)
  • wadegilesinitialfinal
  • wadegilespinyinmapping
  • wadegilessyllables

What are the above? Why are some included while otheres are downloaded remotely? Can we package any/all of the remote data in cjklib? Is it it matter of licensing of assuring downloading of fresh data?

What data in the above datasets intersect, where?

If there is a place where the data intersects, often, I'm assuming we're massaging it in some sense so we can match it to a lookup? Maybe it'd help to have a spreadsheet / table on this?

I think that if we mapped the data we have to a spreadsheet it'd offer us all a better view of the picture - imo. Then we can take a look back away from legacy assumptions and be in a better position to make pull requests for larger architecture changes.

I realize the above is a pretty time-consuming thing, think you could take a bite at it though?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions