-
Notifications
You must be signed in to change notification settings - Fork 4
Description
If an export is run from an UNIX server that has a case sensitive filesystem, an import process may slip folder that has the same name but with different casing.
For example, imagine we have the following URLs exposed on https://docs.webplatform.org/wiki/tutorial/Information Architecture Wiki page "title";
tutorial/Information Architecture/Planning out a websitetutorial/Information Architecture/ja(notice the lowercase "i", and the last part of the URL. It denotes a Japanese translation. Its currently the only way WebPlatform handles localization.)tutorial/information_architectureconcepts/IA/planning a website
While we run the export script, we already handle filesystem name and we would end up with the following folder and file hierarchy;
tutorial/Information_Architecture/Planning_out_a_website/index.htmltutorial/Information_Architecture/ja.htmltutorial/information_architecture/index.htmlconcepts/IA/planning_a_website/index.html
Notice that the tutorial/ folder will have two times the same string "Information_Architecture" and "information_architecture". This may not be a problem on a case sensitive filesystem, but it would be in the case on a system that isn’t.
We have to make sure we store content without creating this problem.
Expected outcome
During import, do the following;
- For each wiki page, get the Wiki page "title" (e.g.
tutorial/Information_Architecture/Planning_out_a_website) - Normalize the title, replacing:
- any special characters (e.g.
?,!,:,@,(,), space, etc...) (N.B. Yes, we do have this) - strip anything not from the us-ascii character-set
- any special characters (e.g.
- Create an associative map of
paths;- Split the title by
/, assign the new array to apathsvariable (e.g.['tutorial', 'Information_Architecture', 'Planning_out_a_website']) - Send each member to an associative map so that anything at the index 0 are together, same for index 1, and so on.
- Split the title by
Note that this part of the problem handles only the file name. We’ll have to setup a configuration file that will take care of serving the right file, even though the name of the file and the URL aren’t exactly the same.
Expected deliverables
- Sorted list of all words that are part of an URL from pages that aren’t deleted/redirected at its last revision (url_parts.txt)
- Steps
- From the list, find words that are common but has more than one use without the same casing
- Set in place rewrite filter to enforce consistent names. Only affect the filename, and the destination of a redirect (if applicable)