More efficient implementation of SUBSTRING for UTF8 character set [CORE6542]

Submitted by: @hvlad

The case below shows bad performance of SUBSTRING for UTF8 comparing with \(legacy\) UNICODE\_FSS 

a\) UNICODE\_FSS

execute block
as
declare str1 varchar\(8000\) character set unicode\_fss;
declare str2 varchar\(10\)   character set unicode\_fss;
declare n int = 100000;
begin
  str1 = LPAD\('abcd', 8000, '\-\-'\);
  while \(n \> 0\) do
  begin
    str2 = SUBSTRING\(str1 from 1 FOR 10\);
    n = n \- 1;
  end
end

Execute time = 62ms


b\) UTF8

execute block
as
declare str1 varchar\(8000\) character set utf8;
declare str2 varchar\(10\)   character set utf8;
declare n int = 100000;
begin
  str1 = LPAD\('abcd', 8000, '\-\-'\);
  while \(n \> 0\) do
  begin
    str2 = SUBSTRING\(str1 from 1 FOR 10\);
    n = n \- 1;
  end
end

Execute time = 983ms

The case is simplified and based on end\-user report\. In user case the same query on the system tables run much longer with FB4 than with FB3
\(test database was restored from the same backup\)\. Origin of the problem is that FB4 uses UTF8 for metadata while FB3 uses UNICODE\_FSS\.

The SUBSTRING implementation for UNICODE\_FSS \(internal\_fss\_substring\(\)\) is straigthforward and logical \- it skips POSITION characters
from the start of the source string first and then copy LENGTH chars into dest string\.

The UTF8 implementation \(MultiByteCharSet::substring\(\)\) convert whole source string into UTF16 and only then get substring of UTF16 string\. 
This is simple but very inefficient especially for a long strings and small POSITION values\.


Commits: FirebirdSQL/firebird@9c566c006c9cf98d95272a6284f5893cf8192fad FirebirdSQL/firebird@333412807b2c06814ad0a0ec3a3d3c216982e875 FirebirdSQL/firebird@f1fe0eeff5a75341c54efd3330e962b37e921fab FirebirdSQL/firebird@ac2532fc9c461984cd1ca1e0ed22be358607bc04 FirebirdSQL/firebird@b5407c3af37ee98e5df07f9f091a45b8485cda7b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

More efficient implementation of SUBSTRING for UTF8 character set [CORE6542] #6769

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

More efficient implementation of SUBSTRING for UTF8 character set [CORE6542] #6769

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions