Skip to content

Ensure there is no filesystem naming collisions for folders #2

@renoirb

Description

@renoirb

If an export is run from an UNIX server that has a case sensitive filesystem, an import process may slip folder that has the same name but with different casing.

For example, imagine we have the following URLs exposed on https://docs.webplatform.org/wiki/tutorial/Information Architecture Wiki page "title";

  1. tutorial/Information Architecture/Planning out a website
  2. tutorial/Information Architecture/ja (notice the lowercase "i", and the last part of the URL. It denotes a Japanese translation. Its currently the only way WebPlatform handles localization.)
  3. tutorial/information_architecture
  4. concepts/IA/planning a website

While we run the export script, we already handle filesystem name and we would end up with the following folder and file hierarchy;

  1. tutorial/Information_Architecture/Planning_out_a_website/index.html
  2. tutorial/Information_Architecture/ja.html
  3. tutorial/information_architecture/index.html
  4. concepts/IA/planning_a_website/index.html

Notice that the tutorial/ folder will have two times the same string "Information_Architecture" and "information_architecture". This may not be a problem on a case sensitive filesystem, but it would be in the case on a system that isn’t.

We have to make sure we store content without creating this problem.

Expected outcome

During import, do the following;

  1. For each wiki page, get the Wiki page "title" (e.g. tutorial/Information_Architecture/Planning_out_a_website)
  2. Normalize the title, replacing:
    1. any special characters (e.g. ?, !, :, @, (, ), space, etc...) (N.B. Yes, we do have this)
    2. strip anything not from the us-ascii character-set
  3. Create an associative map of paths;
    1. Split the title by /, assign the new array to a paths variable (e.g. ['tutorial', 'Information_Architecture', 'Planning_out_a_website'])
    2. Send each member to an associative map so that anything at the index 0 are together, same for index 1, and so on.

Note that this part of the problem handles only the file name. We’ll have to setup a configuration file that will take care of serving the right file, even though the name of the file and the URL aren’t exactly the same.

Expected deliverables

  • Sorted list of all words that are part of an URL from pages that aren’t deleted/redirected at its last revision (url_parts.txt)
  • Steps
    • From the list, find words that are common but has more than one use without the same casing
    • Set in place rewrite filter to enforce consistent names. Only affect the filename, and the destination of a redirect (if applicable)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions