On immutability

An immutable structure is a structure that does not change once it has been created. If the concept is nothing new, it has been used more and more for its upsides. Let’s review how Python handles immutability.

Immutable numbers

This has already been addressed in a previous post: numbers in Python are immutable. You do not change the content of a variable, you make it point to a new number object. There are several other languages that manage numbers as an object (Smalltalk, Objective C, Ruby). I don’t however know if they have immutable numbers (although Ruby seems to)

Immutable strings

The first immutable type that a programming language typically implements is strings. When a string gets created in Python, it never gets updated ever again – only deleted when it is not used anymore. You want to change a single character from that string? A new string will be created. What happens to the older string? It gets garbage-collected – provided it is not used somewhere else.

Immutable strings can however generate a lot of garbage objects (in a future post we will analyze how the CPython garbage collector works). Imagine you want to read a file and store the result in a variable.

file = open('myfile.txt', 'r')

text = ""
for line in file:
    text += line

Every loop generates 2 garbage objects: one for the line being read from the file, and one for the old version of “text”. If you are reading a file composed of 100 lines of 10 bytes each (so a 1000 bytes file), the whole operation generates 100 garbage objects for each line read (for a total of 1000 bytes) and 99 garbage objects for all the precious versions of “text” (for a total of 10 + 20 + 30 + … + 990 = 49500 bytes), for a grand total of 1000 + 49500 = 50500 bytes worth of garbage objects. Or over 50 times the size of the file (plus the overhead for all these objects) !

Because strings are immutable, this however mean that the same string can be shared by several variables. For example:

>>> def func():
...     print(id('Hello' + 'World'))
...     print(id('HelloWorld'))
...
>>> func()
42900400
42900400

The two strings generated have the same id, which means they point to the exact same object (id() actually returns the memory address of the object). Considering strings are immutable, it does not matter to the user whether two strings that contain the same text share the same object or not (except if you compare ids or use “string1 is string2”). Now, how does Python detects these are the same string? Let’s disassemble the function:

>>> import dis
>>> dis.dis(func)
  2           0 LOAD_GLOBAL              0 (print)
              3 LOAD_GLOBAL              1 (id)
              6 LOAD_CONST               4 ('HelloWorld')
              9 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             12 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             15 POP_TOP

  3          16 LOAD_GLOBAL              0 (print)
             19 LOAD_GLOBAL              1 (id)
             22 LOAD_CONST               3 ('HelloWorld')
             25 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             28 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             31 POP_TOP
             32 LOAD_CONST               0 (None)
             35 RETURN_VALUE
>>>
>>> func.__code__.co_consts
(None, 'Hello', 'World', 'HelloWorld', 'HelloWorld')

As you can see, the compiler detects the ‘Hello’ + ‘World’ operation and compiles it as ‘HelloWorld’. Interestingly, it is nonetheless using two different constants for the two strings (ids 4 and 3, and as shown at line 20). But Python has another optimization:

>>> string1 = 'HelloWorld'
>>> string2 = 'HelloWorld'
>>> id(string1)
35924656
>>> id(string2)
35924656

CPython keeps a dictionary of so-called interned strings which it can reuse when using variable names, attribute/method names or constants (here, the string ‘HelloWorld’ is compiled as a constant). This however does not work with strings that contain white spaces:

>>> string1 = 'Hello W'
>>> string2 = 'Hello W'
>>> id(string1)
35934312
>>> id(string2)
35934256

When it comes to constants, CPython is indeed only interning strings which are “valid name characters”. That is, contain only either alphanumeric characters and/or underscore characters(1). I assume the rationale is to only try to reuse strings used for symbols as they are the most likely to be reused. More complex strings, on the other hand, are more likely to be unique, to there is less to be gained by interning them.

It is however possible to force Python to use the interned dictionary for any type of string:

>>> from sys import intern
>>> a = intern('Hello W')
>>> b = intern('Hello W')
>>> id(a)
42651352
>>> id(b)
42651352

Immutable collections

Immutable collections are primarily used by functional programming languages (e.g. Scala, Haskell, F#). You do not insert an element in a collection, you create a new collection which contains the new element. In the case of a tree structure you do not even need to recreate a complete new tree and can reuse whole parts of the previous tree (the subtrees not affected by the change)

Python has some immutable collections with tuples and frozensets. Once you create a tuple, you cannot add elements to it. You cannot replace its elements. You can concatenate tuples, but this will only create a new tuple and won’t modify any of the source tuples. The only modification you can do is if you modify any mutable type inside the tuple, such as a list or a user class instance.

Because dictionary keys in Python are required to be immutable, you can use a tuple or a frozenset as a dictionary key.

Why immutability?

Using immutable structures has some implications. Consider the following code:

>>> class MyClass1(object):
...     allusers = ['John']
...     def add_user(self, user):
...             self.allusers.append(user)
...     def get_users(self):
...             return self.allusers
...
>>> obj = MyClass1()
>>> users = obj.get_users()
>>> users
['John']
>>> obj.add_user('Mary')
>>> users
['John', 'Mary']
>>> users.clear()
>>> obj.get_users()
[]

The get_users() method returns a dynamic list of users. Adding a new user afterwards gets reflected in “users”. However, because the result is mutable, the caller of get_users() can use functions such as append() or clear() that mutate the list. Let’s now look at a similar class which is instead using tuples:

>>> class MyClass2(object):
...     allusers = ('John',)
...     def add_user(self, user):
...             self.allusers += (user,)
...     def get_users(self):
...             return self.allusers
...
>>> obj = MyClass2()
>>> users = obj.get_users()
>>> users
('John',)
>>> obj.add_user('Mary')
>>> users
('John',)
>>> obj.get_users()
('John', 'Mary')

The caller of get_result() cannot update the result anymore as it is immutable. The “+=” operation does not mutable “users”. It creates a new tuple and makes “‘obj.allusers” point to that new structure. However, get_result() returns a snapshot of “obj.allusers” at the time it was called. That result may not even be up-to-date by the time the next instruction is executed. Adding new users indeed does not update “users”.

Immutability is mostly used when one wants to avoid that different various parts of a program step on each others toes – in particular in the case of high concurrency programs(2). The caller of MyClass2.get_users() knows that, no matter what, the result will not change. It can for example store “len(users)” in a variable and use it later on, knowing it will always be consistent with “users”.

The flip side is that the result may not be up-to-date, but in some cases that is OK. For example when rendering a Web page. After all, a Web page can be seen as an immutable view of the server: it was a snapshot at the time it was being generated, and may not be up-to-date by the time it has finished downloading (unless you’re using Ajax but that’s another story)

Note that you do not necessarily need to use immutable structures to have an immutable effect. If you are getting the list through a SQL query, the result can be considered immutable in the sense any subsequent change to the database after the query is performed will not be reflected in the result. Sure, you can modify the results, but those changes will not be seen by anybody else.

But in the case of a resource in memory (like a list that stays in memory and which can be updated by several threads), it may be useful to use an immutable structure.

Last but not least, for those who are interested in the use of immutability in high-concurrency environments, Pat Helland (who worked at Microsoft, Amazon and Salesforce.com) gave a great talk called Immutability Changes Everything. He argues that, a few decades ago, computer resources were expensive but there was no problem grabbing a lock. Nowadays, hardware resources are pretty cheap (provided you don’t care about the quality). But try to grab a lock and all hell breaks loose.

 


(1) For those of you who really want to know how it is implemented, check out Objects/codeobject:

  • The C function PyCode_New() is calling intern_strings() on the names, variable names and other symbols stored in the bytecode (lines 83 to 86)
    • intern_strings() (line 36) is in turn calling PyUnicode_InternInPlace() which is managing the interned dictionary (see Objects/unicodeobject.c line 14992)
  • For constants however, they must first clear all_name_chars() (line 90) before PyUnicode_InternInPlace() is called (line 92)
    • all_name_chars() (line 11) checks that all the characters from the string belong to the list stored in NAME_CHARS (line 5)

(2) There is always some level of concurrency and mutability. At the very least the program needs to update the variable that references the immutable structure – and you want this operation to be atomic. But immutability nevertheless strongly reduces concurrency problems.

Leave a comment