[WIP] Perf: construct & reuse clusters in output thread#3438
[WIP] Perf: construct & reuse clusters in output thread#3438skyline75489 wants to merge 7 commits intomicrosoft:masterfrom
Conversation
|
And another reason is that I can't get one thing working. The rendering seems fine. But when I started typing something, the last line gets printed twice: I'm extremely confused about this. How does rendering somehow affect what's got printed by I could really use some advice here. @miniksa @DHowett-MSFT |
|
To show how good this is and get everybody's attention, I'm gonna attach some evidence. Using code on master at 94213ad Duration: 5.1s Using this PR: Duration: 1.2s Both result are with Release build and on the same machine. |
|
|
||
| private: | ||
| // This is the UTF-16 string of characters that form a particular drawing cluster | ||
| const std::wstring_view _text; |
There was a problem hiding this comment.
I love the const here but I want to reuse existing _text and update the columns. So I had to make it this way.
|
|
||
| // Ask the helper to paint through this specific line. | ||
| _PaintBufferOutputHelper(pEngine, it, screenLine.Origin()); | ||
| const auto& rowData = buffer.GetRowByOffset(row); |
There was a problem hiding this comment.
This method _PaintBufferOutput is supposed to handle all kinds of dirty rect refreshing. I kind of made it only working for rows, which is basically all that's needed right now. Should make it better, though.
There was a problem hiding this comment.
This is probably what ended up causing the duplicate conpty output - by just re-rendering the entire row at the time, that's causing the text to get duplicated on the Terminal side
|
|
||
| size_t count = 0; | ||
| std::vector<Cluster> clusters; | ||
| clusters.reserve(runLength); |
There was a problem hiding this comment.
The reserve here is intended to reduce vector reallocation and it works great. Without reserve, the performance is heavily penalized.
| _charRow.DbcsAttrAt(currentIndex) = it->DbcsAttr(); | ||
| _charRow.GlyphAt(currentIndex) = it->Chars(); | ||
|
|
||
| if (it->DbcsAttr().IsDbcs()) |
There was a problem hiding this comment.
This is kind tricky but I need it to handle dbcs since I'm in CJK environment myself. A cluster with column set to zero will be ignored for rendering.
|
|
||
| size_t runStart = 0; | ||
| // This outer loop will continue until we reach the end of the text we are trying to draw. | ||
| while (runIndex < runCount) |
There was a problem hiding this comment.
This is where magic happens. We have already processed the text buffer and got the clusters we need. All that left is to walk through the row and render each run.
There's a longer explanation I can give you, but the tl;dr is that conpty works by effectively rendering the contents of the buffer to a stream of characters. This stream of characters includes VT sequences, which let the terminal on the other side re-construct the state of the console buffer. I'd take a cursory look at |
|
I like the idea and appreciate focus on this area as I know it's definitely a performance sink. However, for this particular TODO, I was hoping we could get away without having a Clusters vector at all. It looks like you saved us by going from generating-it-every-time to generating-on-change and caching it. I was hoping that we could implement a solution that perhaps provided an iterator (like the TextBufferCellIterator) as the first parameter on PaintBufferLine (instead of a span) and the internal loop would iterate until the iterator was false instead of from begin to end of the span. Then we wouldn't have to "store" an array of clusters at all. We could perhaps wrap a That should mean that we don't have to pre-pass anything and we don't have to allocate any additional memory (besides a little bit for the iterator overhead and logic). Then we'd in theory eliminate the CPU time from the vector overhead while not blowing up our memory storage. Probably not exactly as good on CPU time as the extra memory, but a better balance. |
|
@zadjii-msft Thanks for the explanation. @miniksa I think I know what you mean. I'll give it a try and left this PR open here. |
|
Another observation that surprised me. The vector thing is affecting my Ryzen 2400G desktop PC not my i7-8750H gaming laptop, which means this PR doesn't really make a difference(less than 1 second saved) on my laptop. Seems Win-tel thing is still there somehow. Update: never mind. I guess it's about the VtEngine @zadjii-msft mentioned. On my laptap I'm using plain |
|
Closed in favor of #3458 |



Summary of the Pull Request
By constructing clusters for rendering early in the output thread, the rendering thread will be relieved, making overall performance better when under heavy load.
References
#3075
PR Checklist
Detailed Description of the Pull Request / Additional comments
This is an attempt to tackle the TODO listed here:
terminal/src/renderer/base/renderer.cpp
Line 601 in 94fc40e
I may be doing a lots of thing wrong but the performance boost is so good that I gotta share it with you guys. With this PR, catting my
/etc/servicestakes only about 1.5 second, comparing to current release which takes about 5 seconds to do the same thing.Validation Steps Performed