0% found this document useful (0 votes)
44 views4 pages

Ex 0005

Uploaded by

skamelrech2020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views4 pages

Ex 0005

Uploaded by

skamelrech2020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

182 - 06: All About Strings

The Internal Structure of Strings


While you can generally use strings without knowing much about their internals, it
is interesting to have a look to the actual data structure behind this data type. In the
early days of the Pascal language, strings had a maximum of 255 elements of one
byte each and would use the first byte (or zero byte) for storing the string length. A
lot of time has passed since those early days, but the concept of having some extra
information about the string stored as part of its data remains a specific approach of
the Object Pascal language (unlike many languages that derive from C and use the
concept of a string terminator).

note ShortString is the name of the traditional Pascal string type, a string of one byte characters or
AnsiChar limited to 255 characters. The ShortString type is still available in the desktop compil-
ers, but not in the mobile ones. You can represent a similar data structure with a dynamic array of
bytes, or TBytes, or a plain static arrays of Byte elements.

As I already mentioned, a string variable is nothing but a pointer to a data structure


allocated on the heap. Actually, the value stored in the string is not a reference to the
beginning of the data structure, but a reference to the first of the characters of the
string, with string metadata data available at negative offsets from that location.
The in-memory representation of the data of the string type is the following:
-12 -10 -8 -4 String reference address
Code page Elem size Ref count Length First char of string
The first element (counting backwards from the beginning of the string itself) is an
Integer with the string length, the second element holds the reference count. Further
fields (used on desktop compilers) are the element size in bytes (either 1 or 2 bytes)
and the code page for older Ansi-based string types (available on the desktop com-
pilers).
Quite surprisingly, it is possible to access to most of these fields with specific low-
level string metadata functions, beside the rather obvious Length function:
function StringElementSize(const S: string): Word;
function StringCodePage(const S: string): Word;
function StringRefCount(const S: string): Longint;
As an example, you can create a string and ask for some information about it, as I
did in the StringMetaTest application project:
var
str1: string;
begin
str1 := 'F' + string.Create ('o', 2);

Marco Cantù, Object Pascal Handbook


06: All About Strings - 183

Show ('SizeOf: ' + SizeOf (str1).ToString);


Show ('Length: ' + str1.Length.ToString);
Show ('StringElementSize: ' +
StringElementSize (str1).ToString);
Show ('StringRefCount: ' +
StringRefCount (str1).ToString);
Show ('StringCodePage: ' +
StringCodePage (str1).ToString);
if StringCodePage (str1) = DefaultUnicodeCodePage then
Show ('Is Unicode');
Show ('Size in bytes: ' +
(Length (str1) * StringElementSize (str1)).ToString);
Show ('ByteLength: ' +
ByteLength (str1).ToString);

note There is a specific reason the program builds the 'Foo' string dynamically rather than assigning a
constant, and that is because constant strings have the reference count disabled (or set to -1). In
the demo I preferred showing a proper value for the reference count, hence the dynamic string
construction.

This program produces output similar to the following when running on Windows:
SizeOf: 4
Length: 3
StringElementSize: 2
StringRefCount: 1
StringCodePage: 1200
Is Unicode
Size in bytes: 6
ByteLength: 6
The following is the output if you run the same program on Android:
SizeOf: 4
Length: 3
StringElementSize: 2
StringRefCount: 1
StringCodePage: 1200
Is Unicode
Size in bytes: 6
ByteLength: 6
The code page returned by a UnicodeString is 1200, a number stored in the global
variable DefaultUnicodeCodePage. In the code above (and its output) you can
clearly notice the difference between the size of a string variable (invariably 4), the
logical length, and the physical length in bytes.
This can be obtained by multiplying the size in bytes of each character times the
number of characters, or by calling ByteLength. This latter function, however,
doesn't support some of the string types of the older desktop compiler.

Marco Cantù, Object Pascal Handbook


184 - 06: All About Strings

Looking at Strings in Memory


The ability to look into a string's metadata can be used to better understand how
string memory management works, particularly in relationship with the reference
counting. For this purpose, I've added some more code to the StringMetaTest appli-
cation project.
The program has two global strings: MyStr1 and MyStr2. The program assigns a
dynamic string to the first of the two variables (for the reason explained earlier in
the note) and then assigns the second variable to the first:
MyStr1 := string.Create(['H', 'e', 'l', 'l', 'o']);
MyStr2 := MyStr1;
Besides working on the strings, the program shows their internal status, using the
following StringStatus function:
function StringStatus (const Str: string): string;
begin
Result := 'Addr: ' +
IntToStr (Integer (Str)) +
', Len: ' +
IntToStr (Length (Str)) +
', Ref: ' +
IntToStr (PInteger (Integer (Str) - 8)^) +
', Val: ' + Str;
end;
It is important in the StringStatus function to pass the string parameter as a const
parameter. Passing this parameter by copy will cause the side effect of having one
extra reference to the string while the function is being executed. By contrast, pass-
ing the parameter via a reference (var) or constant (const) doesn't imply a further
reference to the string. In this case I've used a const parameter, as the function is
not supposed to modify the string.
To obtain the memory address of the string (useful to determine its actual identity
and to see when two different strings refer to the same memory area), I've simply
made a hard-coded typecast from the string type to the Integer type. Strings are ref-
erences-in practice, they're pointers: Their value holds the actual memory location
of the string not the string itself.
The code used for testing what happens to the string is the following:
Show ('MyStr1 - ' + StringStatus (MyStr1));
Show ('MyStr2 - ' + StringStatus (MyStr2));
MyStr1 [1] := 'a';
Show ('Change 2nd char');
Show ('MyStr1 - ' + StringStatus (MyStr1));
Show ('MyStr2 - ' + StringStatus (MyStr2));

Marco Cantù, Object Pascal Handbook


06: All About Strings - 185

Initially, you should get two strings with the same content, the same memory loca-
tion, and a reference count of 2.
MyStr1 - Addr: 51837036, Len: 5, Ref: 2, Val: Hello
MyStr2 - Addr: 51837036, Len: 5, Ref: 2, Val: Hello
As the application changes the value of one of the two strings (it doesn't matter
which one), the memory location of the updated string will change. This is the effect
of the copy-on-write technique. This is the second part of the output:
Change 2nd char
MyStr1 - Addr: 51848300, Len: 5, Ref: 1, Val: Hallo
MyStr2 - Addr: 51837036, Len: 5, Ref: 1, Val: Hello
You can freely extend this example and use the StringStatus function to explore the
behavior of long strings in many other circumstances, with multiple references,
when they are passed as parameters, assigned to local variables, and more.

Strings and Encodings


As we have seen the string type in Object Pascal is mapped to the Unicode UTF-16
format, with 2-bytes per element and management of surrogate pairs for code points
outside of the BMP (Basic Multi-language Plane).
There are many cases, though, in which you need to save to file, load from file, trans-
mit over a socket connection, or receive textual data from a connection that uses a
different representation, like ANSI or UTF-8.
To convert files and in memory data among different formats (or encodings), the
Object Pascal RTL has a handy TEncoding class, defined in the System.SysUtils
unit along with several inherited classes.

note There are several other handy classes in the Object Pascal RTL that you can use for reading and
writing data in text formats. For example, the TStreamReader and TStreamWriter classes offer
support for text files with any encoding. These classes will be introduced in Chapter 18.

Although I still haven't introduced classes and inheritance, this set of encoding
classes is very easy to use, as there is already a global object for each encoding, auto-
matically created for you.
In other words, an object of each of these encoding classes is available within the
TEncoding class, as a class property:
type
TEncoding = class

Marco Cantù, Object Pascal Handbook

You might also like