Use ICU for text search#15858
Conversation
This comment has been minimized.
This comment has been minimized.
| template<typename T> | ||
| T _adjustBackward(T column) const noexcept; | ||
| template<typename T> | ||
| T _adjustForward(T column) const noexcept; |
There was a problem hiding this comment.
These were made into templates so that I can use ptrdiff_t indices with them.
src/buffer/out/search.cpp
Outdated
| _needle(s_CreateNeedleFromString(str)), | ||
| _renderData(renderData), | ||
| _coordAnchor(s_GetInitialAnchor(renderData, direction)) | ||
| // - reverse - True when searching backward/upwards in the buffer |
There was a problem hiding this comment.
After going through all the callers of Search I found that just using bool parameters is more ergonomic. None of the callers, not even conhost, actually used the two enums. They just converted them back/forth.
I would suggest reviewing this file without the diff viewer. It's a mostly new implementation.
| _fpDirectionDownAverage = 0; | ||
| _fpMatchCaseAverage = 0; | ||
| _uiFindNextClickedTotal = 0; | ||
| } |
There was a problem hiding this comment.
It seems no one ever noticed that we passed reverse == true (= upwards_search == true) as the fDirectionDown parameter. In other words, the telemetry for DirectionDownAverage was broken. Since I feel like no one is using this telemetry anyways, I chose to remove it instead of fixing it.
src/types/UiaTextRangeBase.cpp
Outdated
| auto searchAnchor = _start; | ||
| if (searchBackward) | ||
| const std::wstring_view queryText{ text, SysStringLen(text) }; | ||
| const auto results = _pData->GetTextBuffer().SearchText(queryText, ignoreCase != FALSE, _start.y, _end.y + 1); |
There was a problem hiding this comment.
Previously, this code could use the Search class, because Search would stop searching once it found something. The new Search class will go through the entire buffer however.
To fix that we could either add 2 parameters to Search to specify a bounding range for the search or use TextBuffer's function directly. I felt like the latter is a better option, because while the 2 for loops below are verbose and somewhat difficult to understand initially, they're also easier to verify for their correctness IMO.
There was a problem hiding this comment.
@carlos-zamora I didn't quite understand the UiaTextRangeBase::FindText implementation. Could you review that file for me? 🙂 Also, how do I best test that function?
There was a problem hiding this comment.
Reviewing 😊
As for how to test it on your own, you can do that using "Accessibility Insights":
- open A11y insights
- Click on the terminal
- Under "Patterns", Click "Explore..." button next to "TextPattern"
- A "TextPattern Explorer" window should appear
- Click the "..." button next to the "Document" range
- Click "Find..."
This new window should allow you to directly interact with the API much like any screen reader could.
There was a problem hiding this comment.
It seems Accessibility Insights is kinda buggy? I can't really tell how it works, but if you open the Find dialog and ever reach a "not found" situation, it'll continue to say "not found" forever.
After testing it, I think that my implementation works correctly. If I type "eeeee" the find dialog will find all "e"s no matter where they are or whether I use forwards or backwards search. So there don't seem to be any off-by-1 errors in the code.
However, there's one big difference to the old code: My new implementation returns a "not found" result when it has reached the end of the buffer. The old code continued to circle around the buffer if you reached the end. (This didn't always work for the backward search due to the start > _start check. It's supposed to be start >= _start, otherwise it can't find results that are at the very start of the buffer at 0,0.)
Was that intentional? Should the UIA search circle around in the buffer when it has reached either end?
There was a problem hiding this comment.
Previously, this code could use the
Searchclass, becauseSearchwould stop searching once it found something. The newSearchclass will go through the entire buffer however.
Won't this cause a decent performance hit? _getOptimizedBufferSize() used to reduce the search space by ignoring anything past the mutable bottom. Now that we're searching through the whole buffer again, won't FindText('a', ...) be slower. And if the user calls that function repeatedly, the user may have to wait quite a bit before being able to perform the next search? (FYI I haven't verified the performance question via a11y insights yet, so this could all be false)
It seems Accessibility Insights is kinda buggy? I can't really tell how it works, but if you open the Find dialog and ever reach a "not found" situation, it'll continue to say "not found" forever.
You could also use inspect.exe. It's the predecessor to a11y insights and can be found in the Windows SDK's \bin\<version>\<platform> folder. To interact with the API, do the following:
- open inspect.exe
- click on the terminal (the terminal should be highlighted in the UIA tree in inspect.exe)
- Click "Action" > "Text Pattern Explorer..." (a new "Text Explorer" window should appear)
- Click "Range" > "Find..."
Was that intentional? Should the UIA search circle around in the buffer when it has reached either end?
Interesting! I was originally going to say that we should keep the same behavior as before, but it looks like this experience is different from MS Word's! There were two notable differences with your implementation:
- circling the buffer: MS Word doesn't circle the text area whereas we do in Terminal (old and new impl). So, we probably shouldn't do that anymore.
- no more results: MS Word returns something that is interpreted by inspect.exe as "ProcessFind: FindText couldn't find matching text.". But Terminal (new impl) is interpreted as "ProcessFind: MoveEndpointByRange failed. Error -2147024809".
We should align these experiences (which is a bit annoying because your PR wasn't really intended to deal with any of this). If that's easy to do, it'd be best to do that as a part of this PR or in a follow-up before the next release. If not, I think we should keep the old experience, but I know that's not really possible anymore given that the Search class is now completely different. Thoughts?
There was a problem hiding this comment.
Previously, this code could use the
Searchclass, becauseSearchwould stop searching once it found something. The newSearchclass will go through the entire buffer however.Won't this cause a decent performance hit?
_getOptimizedBufferSize()used to reduce the search space by ignoring anything past the mutable bottom.
The new ICU-based searcher can search the entire text buffer in 4ms.
It also avoids searching any part of the text buffer that has not been "committed" to memory. :)
There was a problem hiding this comment.
ICU also has usearch_previous. We could use it for UIA specifically since it doesn't need the total number of results unlike WT (which is why it uses uregex because it's way way faster when searching for all results). Just like uregex it's very easy to use. I can make it fit. 🙂
And I'll try to make our UIA function behave like Word.
There was a problem hiding this comment.
Nevermind, it seems that searching through a character-iterator is a C++ API feature and Windows only ships the C API. Well it's probably fine. I feel like this is not going to be a bottleneck no matter what we do.
Does it just so happen to work across wrapped lines now? @asklar literally reported that to me on Teams this weekend 😄 |
|
Oh wait, I think you meant a different problem, right? I think that's something like |
zadjii-msft
left a comment
There was a problem hiding this comment.
leaving notes before lunch
| ROW& GetScratchpadRow(const TextAttribute& attributes); | ||
| const ROW& GetRowByOffset(til::CoordType index) const; | ||
| ROW& GetRowByOffset(til::CoordType index); | ||
| ROW& GetMutableRowByOffset(til::CoordType index); |
There was a problem hiding this comment.
seems like a ton of this diff is just this rename
There was a problem hiding this comment.
Yeah. But it's a necessary change because many callers have a TextBuffer& but only need a const ROW&. (TextBufferCellIterator for instance is widely used and does this.) 😟
src/buffer/out/search.cpp
Outdated
| _results = textBuffer.SearchText(str, caseInsensitive); | ||
| _mutationCount = textBuffer.GetMutationCount(); | ||
|
|
There was a problem hiding this comment.
presumably, the caller has a write lock on the text buffer so the MutationCount can't change between the search and getting the count, but figured I'd mention it here. If there's a way to guarantee that, it might be worthwhile.
There was a problem hiding this comment.
We could only assert that the lock is being held, but I think that this might be better as a separate PR that does this for all the parts that implicitly rely on the console lock being held. The mutation count could be made atomic, but this won't realistically improve anything, because you need to hold the lock anyways to initiate any kind of search (either to do a new search or to select stuff in the buffer).
src/buffer/out/Row.cpp
Outdated
| const til::CoordType columnCount = _columnCount; | ||
| const til::CoordType scale = _lineRendition != LineRendition::SingleWidth; | ||
| const til::CoordType padding = _doubleBytePadded; | ||
| return (columnCount - padding) >> scale; |
There was a problem hiding this comment.
I'm surprised padding comes before un-scaling, but I think it makes sense? A double-byte padded character that doesn't fit on a double-width row didn't fit because it took up 4 columns.. hm.
There was a problem hiding this comment.
Yeah, I was really unsure in what order these need to be, and I still am a little.
Thinking about it some more, does it have to be like this?
const auto padding = _doubleBytePadded << scale;
return (columnCount - padding) >> scale;There was a problem hiding this comment.
That would be the same as...
return (columnCount >> scale) - padding;right?
There was a problem hiding this comment.
Not quite since the rounding is different. In your code it would subtract 2 no matter what. For instance, imagine padding is 2 and columnCount is 11. In my code snippet it would return 4 and with yours it would return 3. I think 4 is correct because 4*2+3 = 11 (3 being the max. wide glyph padding).
|
|
||
| // This structure is basically an inverse of ROW::_charOffsets. If you have a pointer | ||
| // into a ROW's text this class can tell you what cell that pointer belongs to. | ||
| struct CharToColumnMapper |
There was a problem hiding this comment.
nobody consumes this externally, do they? Can it be a forward declaration?
There was a problem hiding this comment.
I was planning to use this class in other parts of the code base in the future (the renderer primarily).
| <ClCompile Include="..\precomp.cpp"> | ||
| <PrecompiledHeader>Create</PrecompiledHeader> | ||
| </ClCompile> | ||
| <ClCompile Include="..\UTextAdapter.cpp" /> |
There was a problem hiding this comment.
update sources! and probably the sources for the UT too.z
| _searcherGoForward = goForward; | ||
| _searcherCaseSensitive = caseSensitive; | ||
| } |
There was a problem hiding this comment.
Assuming you don't want to make these getters on Search?
There was a problem hiding this comment.
After thinking about it some more, I'd like to do that a different time if anything. Other editors reset the search when there's any change whatsoever. Technically we shouldn't need to track these members - we should just reset the search whenever any field in the dialog is changed. I just haven't done the work for that.
src/buffer/out/textBuffer.hpp
Outdated
|
|
||
| TextAttribute _currentAttributes; | ||
| til::CoordType _firstRow = 0; // indexes top row (not necessarily 0) | ||
| uint64_t _mutationCount = 0; |
There was a problem hiding this comment.
I can fathom a textbuffer having more than 900 quin-zill-trillion mutations. Is this a count, or an opaque number? If it's an opaque number, which is not orderable, wraparound is implicitly OK.
If somebody is going to use > on it, we're in trouble.
| if (text.size() == 0) | ||
| auto lock = _terminal->LockForWriting(); | ||
|
|
||
| if (_searcher.IsStale() || _searcherText != text || _searcherGoForward != goForward || _searcherCaseSensitive != caseSensitive) |
There was a problem hiding this comment.
hmm. we need some way to tell the searcher that it is stale when we switch into the alt buffer.
There was a problem hiding this comment.
The alt buffer will not have the same mutation count.
There was a problem hiding this comment.
There's a one in 900 quin-jillion-millio nchance that it will, but... fair.
There was a problem hiding this comment.
Yeah, I'll take that chance! 😄 I mean the worst-case scenario is that the search finds results that are stale.
@lhecker yes this sounds about right - will this PR address hyperlinks being broken up across lines (and launching to the wrong url)? if not, is there a bug tracking that already? |
This comment has been minimized.
This comment has been minimized.
DHowett
left a comment
There was a problem hiding this comment.
I love it. The mutable row stuff is great, and Search is completely understandable by a single human now. Thank you.
I need @carlos-zamora to sign off on the UIA changes -- my biggest concern is searching backwards off a previous search (?) because of accidental off-by-one errors. I am not sure how that API is supposed to work.
Blocking until Carlos signs, but otherwise I am a ✔️
src/buffer/out/Row.cpp
Outdated
| const til::CoordType columnCount = _columnCount; | ||
| const til::CoordType scale = _lineRendition != LineRendition::SingleWidth; | ||
| const til::CoordType padding = _doubleBytePadded; | ||
| return (columnCount - padding) >> scale; |
| // | ||
| // @param ut the UText to get the length of. | ||
| // @return the length, in the native units of the original text string. | ||
| static int64_t U_CALLCONV utextNativeLength(UText* ut) noexcept |
There was a problem hiding this comment.
should these functions return a failure if the underlying buffer was mutated since the utext adapter was created? Or is that not related to the purpose of these guys?
There was a problem hiding this comment.
I feel like that this would provide a false sense of security that people might end up relying on. Unlocking the console at all during a search is likely a bug or at least a bad idea because it's pretty error prone (not all state relevant to searching is contained in TextBuffer after all - selections for instance). I think we could make it an assertion? Not sure if it should be assert() or WI_ASSERT(). Probably the former, because searching without holding the lock is still somewhat predictable.
|
|
||
| // An excerpt from the ICU documentation: | ||
| // | ||
| // Extract text from a UText into a UChar buffer. |
There was a problem hiding this comment.
No. I'll keep the code but remove it from the utextFuncs struct and add a comment that the correctness of the function hasn't been verified yet. I don't think it's worth it to finish that function, because we don't need it right now. But I do think it's worth keeping because the code has already been written and is probably 99% correct (if not 100%). I initially thought I had to implement it so that uregex can work, but it seems I was wrong about that. It uses the ut->chunkContents directly (via UTEXT_NEXT32 and friends).
Basically same here. I think I've got like 3 files left but I didn't want to ✅ without him reading that (cause obviously no one knows how that works) |
src/types/UiaTextRangeBase.cpp
Outdated
| auto searchAnchor = _start; | ||
| if (searchBackward) | ||
| const std::wstring_view queryText{ text, SysStringLen(text) }; | ||
| const auto results = _pData->GetTextBuffer().SearchText(queryText, ignoreCase != FALSE, _start.y, _end.y + 1); |
There was a problem hiding this comment.
| const auto results = _pData->GetTextBuffer().SearchText(queryText, ignoreCase != FALSE, _start.y, _end.y + 1); | |
| const auto results = _pData->GetTextBuffer().SearchText(queryText, !!ignoreCase, _start.y, _end.y + 1); |
I think we do !! generally. I'm not really attached to this one though. Idk if one is better than the other.
src/types/UiaTextRangeBase.cpp
Outdated
| auto searchAnchor = _start; | ||
| if (searchBackward) | ||
| const std::wstring_view queryText{ text, SysStringLen(text) }; | ||
| const auto results = _pData->GetTextBuffer().SearchText(queryText, ignoreCase != FALSE, _start.y, _end.y + 1); |
There was a problem hiding this comment.
Previously, this code could use the
Searchclass, becauseSearchwould stop searching once it found something. The newSearchclass will go through the entire buffer however.
Won't this cause a decent performance hit? _getOptimizedBufferSize() used to reduce the search space by ignoring anything past the mutable bottom. Now that we're searching through the whole buffer again, won't FindText('a', ...) be slower. And if the user calls that function repeatedly, the user may have to wait quite a bit before being able to perform the next search? (FYI I haven't verified the performance question via a11y insights yet, so this could all be false)
It seems Accessibility Insights is kinda buggy? I can't really tell how it works, but if you open the Find dialog and ever reach a "not found" situation, it'll continue to say "not found" forever.
You could also use inspect.exe. It's the predecessor to a11y insights and can be found in the Windows SDK's \bin\<version>\<platform> folder. To interact with the API, do the following:
- open inspect.exe
- click on the terminal (the terminal should be highlighted in the UIA tree in inspect.exe)
- Click "Action" > "Text Pattern Explorer..." (a new "Text Explorer" window should appear)
- Click "Range" > "Find..."
Was that intentional? Should the UIA search circle around in the buffer when it has reached either end?
Interesting! I was originally going to say that we should keep the same behavior as before, but it looks like this experience is different from MS Word's! There were two notable differences with your implementation:
- circling the buffer: MS Word doesn't circle the text area whereas we do in Terminal (old and new impl). So, we probably shouldn't do that anymore.
- no more results: MS Word returns something that is interpreted by inspect.exe as "ProcessFind: FindText couldn't find matching text.". But Terminal (new impl) is interpreted as "ProcessFind: MoveEndpointByRange failed. Error -2147024809".
We should align these experiences (which is a bit annoying because your PR wasn't really intended to deal with any of this). If that's easy to do, it'd be best to do that as a part of this PR or in a follow-up before the next release. If not, I think we should keep the old experience, but I know that's not really possible anymore given that the Search class is now completely different. Thoughts?
zadjii-msft
left a comment
There was a problem hiding this comment.
I'm giving this a ✅ on the assumption that the UIA stuff is fine. I don't feel like any of these comments are blocking. And this makes searchv2 so much nicer
src/buffer/out/Row.cpp
Outdated
| auto mapper = _createCharToColumnMapper(offset); | ||
| return mapper.GetLeadingColumnAt(offset); |
There was a problem hiding this comment.
it seems weird to me that you need to pass the offset into the ctor (via _createCharToColumnMapper), then immediately just call Get*ColumnAt(offset) with the same offset. I don't think I'll block over that but it feels weird
There was a problem hiding this comment.
_createCharToColumnMapper only uses the offset to guess the initial column. GetLeadingColumnAt will then actually compute the proper offset. This approach avoids GetLeadingColumnAt being O(n) most of the time.
The reason the CharToColumnMapper exists as a separate class is because I want to use it for DirectWrite to map the character positions that DirectWrite returns directly back to columns. Instead, right now, we got a really ugly and inefficient approach. You can see it if you search for bufferLineColumn.
| } | ||
| } | ||
|
|
||
| // Interns URegularExpression instances so that they can be reused. This method is thread-safe. |
There was a problem hiding this comment.
Okay so I thinks I get what this does, but it might deserve a bit of a longer comment than this.
If I'm not mistaken, this basically creates a process-wide cache of unique_uregexs. That way, if someone comes asking for a particular regex, (on any thread), we can hand back the pre-compiled version of it, if we've already seen it. There's also some affordance for flushing out old cache entries, if there's more than 128.
There was a problem hiding this comment.
This is a weird one. In the compiler/language runtime field, it pretty much means means "caching". You almost always only hear it in the context of strings (e.g. "string interning").
There was a problem hiding this comment.
TIL i guess. Feel free to ignore me
There was a problem hiding this comment.
I turned it into a small class so we can more easily extract it in the future. I left it in the same place though because I still feel like it's better to keep this relatively "new" code close to where we're going to add custom patterns for the first time soon.
| const auto guard = shared.lock.lock_shared(); | ||
| if (const auto it = shared.set.find(pattern); it != shared.set.end()) | ||
| { | ||
| return ICU::unique_uregex{ uregex_clone(it->second.re.get(), &status) }; |
There was a problem hiding this comment.
does this need to get std::move'd on the way out? I always forget how exactly move semantics work
(and below too)
There was a problem hiding this comment.
Returned objects are always being moved, but with much stronger guarantees than regular std::move. return will behave as if there was no function call at all. In other words, returning a std::string by value is like constructing that string in the caller directly.
std::move is basically just a T& with extra syntactic sugar. So, if you return std::move(string); it will actually be forced to call the string(string&&) constructor on the way out, because you're not returning an object value anymore - you're returning a reference which now needs to be converted to a value. That's why one should never return std::move.
The reason it can be seen anyways sometimes is because when you have code like this:
struct StringWrapper {
// implicit constructor!
StringWrapper(std::string&& str) {}
};
StringWrapper foobar() {
std::string str = "...";
// calls the implicit constructor
return std::move(str);
}getFontCollection() in FontCache.h is an example for that.
This comment has been minimized.
This comment has been minimized.
| // | ||
| // An alternative approach would be to not make this method thread-safe and give each | ||
| // Terminal instance its own cache. I'm not sure which approach would've been better. | ||
| ICU::unique_uregex Intern(const std::wstring_view& pattern) |
There was a problem hiding this comment.
FYI suppress whitespace changes. The conversion to a class barely changed anything.
This is a resurrection of #8588. That PR became painfully stale after the `ControlCore` split. Original description: > ## Summary of the Pull Request > This is a PoC for: > * Search status in SearchBox (aka number of matches + index of the current match) > * Live search (aka search upon typing) > ## Detailed Description of the Pull Request / Additional comments > * Introduced this optionally (global setting to enable it) > * The approach is following: > * Every time the filter changes, enumerate all matches > * Upon navigation just take the relevant match and select it > I cleaned it up a bit, and added support for also displaying the positions of the matches in the scrollbar (if `showMarksOnScrollbar` is also turned on). It's also been made SUBSTANTIALLY easier after #15858 was merged. Similar to before, searching while there's piles of output running isn't _perfect_. But it's pretty awful currently, so that's not the end of the world. Gifs below. * closes #8631 (which is a bullet point in #3920) * closes #6319 Co-authored-by: Don-Vito <[email protected]> --------- Co-authored-by: khvitaly <[email protected]>
…row (#16775) ## Summary of the Pull Request URL detection was broken again in #15858. When the regex matched, we calculate the column(cell) by its offset, we use forward or backward iteration of the column to find the correct column that displays the glyphs of `_chars[offset]`. https://github.com/microsoft/terminal/blob/abf5d9423a23b13e9af4c19ca35858f9aaf0a63f/src/buffer/out/Row.cpp#L95-L104 However, when calculating the `currentOffset` we forget that MSB of `_charOffsets[col]` could be `1`, or col is pointing to another glyph in preceding column. https://github.com/microsoft/terminal/blob/abf5d9423a23b13e9af4c19ca35858f9aaf0a63f/src/buffer/out/Row.hpp#L223-L226


The ultimate goal of this PR was to use ICU for text search to
Previously we used
towlowerand only supported BMP glphs.This allows us to search for all results in the entire text buffer
at once without having to do so asynchronously.
Unfortunately, this required some significant changes too:
mapped back to buffer coordinates. This required the introduction of
CharToColumnMapperto implement sort of a reverse-_charOffsetsmapping. It turns text (character) positions back into coordinates.
It used the current selection as the starting position for the new
search. But since ICU's
uregexcannot search backwards we'rerequired to accumulate all results in a vector first and so we
need to cache that vector in between searches.
to track any changes made to
TextBuffer. The way this commit solvesit is by splitting
GetRowByOffsetintoGetRowByOffsetforconst ROWaccess andGetMutableRowByOffsetwhich increments amutation counter on each call. The
Searchinstance can then compareits cached mutation count against the previous mutation count.
Finally, this commit makes 2 semi-unrelated changes:
text search anyways. This significantly improves performance at
large window sizes.
UiaTracingwere fixed. In particular2 functions which passed strings as
wstringby copy are nowusing
wstring_viewandTraceLoggingCountedWideString.Related to #6319 and #8000
Validation Steps Performed