word tokenizer works only for 7 bit ascii

the word tokenizer for `WordDiff` and `WordWithSpaceDiff` uses `\b` in its regular expression. that considers word characters as `[a-zA-Z0-9_]`, which fails on anything beyond 7 bit.

f.e. the german phrase "wir üben" splits to:

```
'wir üben'.split(/\b/);
-> ["wir", " ü", "ben"]
```

replacing the tokenizer with `value.split(/(\s+)/)` is sufficient in my use-case, but i don't have newlines in my text. some further testing needed, i think.

further reading:
http://stackoverflow.com/questions/10590098/javascript-regexp-word-boundaries-unicode-characters/10590620#10590620


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

word tokenizer works only for 7 bit ascii #29

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

word tokenizer works only for 7 bit ascii #29

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions