Releases · ashvardanian/less_slow.py

Release: v0.5.1 [skip ci]

Patch

Docs: New Emoji & NumPy highlights (d52cb8c)

Release: v0.5.0 [skip ci]

Minor

Add: Control-flow & GC (e1803f3)
Add: Pandas vs PyArrow (6646307)
Add: Reflection & Inspection (c0620de)
Add: Strings & NumPy layouts (aec8370)
Add: Heterogenous Collections (ad8a2b2)

Patch

Make: Fix deps (cddb005)
Docs: Table of contents (8a5f43d)
Docs: Typos (0f2d99f)
Make: Bump CI (8838230)
Docs: Logging environment info (3188859)

Release: v0.4.0 [skip ci]

Minor

Add: Decompositions (f66c9fc)

Patch

Make: Update uv deps (e4ecf54)

@jmg-duarte

This snippet was contributed by @jmg-duarte 🙌

There's a difference between calling dict vs {}, for the first one, Python needs to lookup dict while the {} nomenclature cannot be overwritten, the bytecode says it all:

for {}:

  2           LOAD_CONST               1 ('x')
              LOAD_CONST               2 (10)
              BUILD_MAP                1
              RETURN_VALUE

for dict:

  6           LOAD_GLOBAL              1 (dict + NULL)
              LOAD_CONST               1 (10)
              LOAD_CONST               2 (('x',))
              CALL_KW                  1
              RETURN_VALUE

Turns out, the difference is big enough that it can almost close the gap to the dataclass and friends — see test_structs_dict_fun:

----------------------------------------------------------------------------------- benchmark 'composite-structs': 10 tests ------------------------------------------------------------------------------------
Name (time in ns)                     Min                    Max                Mean              StdDev              Median                IQR             Outliers  OPS (Mops/s)            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_structs_tuple_unpacking      52.4998 (1.0)       1,118.3391 (1.61)      55.3057 (1.0)       11.7465 (1.67)      55.0004 (1.0)       1.6601 (3.94)     1175;1736       18.0813 (1.0)      180473         100
test_structs_tuple_indexing       53.7490 (1.02)        692.9203 (1.0)       56.7784 (1.03)       7.0220 (1.0)       56.6605 (1.03)      0.4214 (1.0)     1220;37751       17.6123 (0.97)     161081         100
test_structs_dict                101.6690 (1.94)        870.4094 (1.26)     106.3716 (1.92)      10.7967 (1.54)     105.4199 (1.92)      2.5099 (5.96)     1670;2190        9.4010 (0.52)      95238         100
test_structs_dict_fun            124.9998 (2.38)      1,146.2497 (1.65)     130.9408 (2.37)      10.0044 (1.42)     130.4201 (2.37)      1.2503 (2.97)     952;12269        7.6370 (0.42)      77671         100
test_structs_slots_dataclass     142.0791 (2.71)        875.8390 (1.26)     148.4811 (2.68)      11.4265 (1.63)     147.9208 (2.69)      1.2503 (2.97)     721;11753        6.7349 (0.37)      65933         100
test_structs_dataclass           147.1235 (2.80)      1,933.5966 (2.79)     156.0895 (2.82)      18.9933 (2.70)     154.9706 (2.82)      1.3169 (3.12)    1247;32677        6.4066 (0.35)     199999          32
test_structs_class               147.4994 (2.81)        767.9209 (1.11)     153.8915 (2.78)      10.7097 (1.53)     153.7497 (2.80)      2.0803 (4.94)      738;4708        6.4981 (0.36)      62500         100
test_structs_namedtuple          189.9991 (3.62)      2,618.3575 (3.78)     201.7096 (3.65)      24.9604 (3.55)     200.0015 (3.64)      3.3202 (7.88)    1857;15508        4.9576 (0.27)     195124          25
test_structs_attrs               457.9779 (8.72)     32,833.8938 (47.38)    533.1721 (9.64)     233.8386 (33.30)    540.9820 (9.84)     42.0259 (99.72)     371;2292        1.8756 (0.10)     130446           1
test_structs_pydantic            582.8915 (11.10)     3,000.0228 (4.33)     644.9098 (11.66)     98.3162 (14.00)    625.0339 (11.36)    40.9782 (97.24)        11;11        1.5506 (0.09)        779           1

Benchmarking IO throughput for Networking and Storage intensive workloads in Python is trickier than in other languages due to the single-threaded nature of the Python interpreter. So, we are forced to spawn sibling processes that will run the client and server code independently to avoid them affecting each other and competing for resources inside of a single interpreter instance.

To write Less Slow software in Python, I wanted to cover the following topics:

UDP vs TCP sockets
asyncio costs
fastapi vs uvicorn server-side abstraction layers
requests vs httpx client-side implementation differences

Most developers don't even consider using UDP stateless sockets and go directly for the heavy TCP/IP stateful stack. Similarly, many assume that async code will make IO faster, but it's not free and will backfire in a well-connected data center. Moreover, state management in high-level libraries is often more expensive than sending a packet to a neighboring country or running the inference of an AI model 🤯

Running these benchmarks on AWS c7i instance with Intel Xeon 4 Sapphire Rapids CPUs, we get:

$ uv run --python="3.12" --no-sync --with-requirements requirements.in pytest -ra -q less_slow.py -k rpc

---------------------------------------------------------------------------------------------------- benchmark 'echo': 10 tests ----------------------------------------------------------------------------------------------------
Name (time in us)                              Min                    Max                   Mean                StdDev                 Median                   IQR              Outliers          OPS            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_rpc_udp_loopback                       7.7320 (1.0)         841.0759 (3.60)         22.4711 (1.0)          3.7709 (1.76)         22.2980 (1.0)          0.6569 (1.0)       4870;8454  44,501.6398 (1.0)      100000           1
test_rpc_udp_public                        10.9550 (1.42)        581.0489 (2.49)         22.8188 (1.02)         2.7904 (1.30)         22.4309 (1.01)         0.6965 (1.06)      3507;6951  43,823.6065 (0.98)     100000           1
test_rpc_tcp_public                        17.2070 (2.23)        274.0650 (1.17)         26.4949 (1.18)         2.2590 (1.05)         25.9890 (1.17)         0.8119 (1.24)     9706;11257  37,743.1265 (0.85)     100000           1
test_rpc_tcp_loopback                      17.2630 (2.23)        233.7879 (1.0)          26.4878 (1.18)         2.1469 (1.0)          25.9880 (1.17)         0.8739 (1.33)    10523;11359  37,753.1842 (0.85)     100000           1
test_batch16_rpc_asyncio_unordered        499.9969 (64.67)     1,614.7849 (6.91)        637.3076 (28.36)       48.6703 (22.67)       631.1065 (28.30)       22.2565 (33.88)         64;79   1,569.1011 (0.04)       1000           1
test_batch16_rpc_asyncio_ordered          524.8130 (67.88)     2,071.1240 (8.86)        643.1911 (28.62)       81.8031 (38.10)       632.6414 (28.37)       24.6380 (37.50)         30;74   1,554.7478 (0.03)       1000           1
test_batch16_rpc_uvicorn_requests       7,762.3690 (>1000.0)   9,353.0880 (40.01)     7,972.3333 (354.78)      84.5799 (39.40)     7,969.1625 (357.39)      74.1920 (112.94)       137;23     125.4338 (0.00)       1000           1
test_batch16_rpc_fastapi_requests       7,936.2941 (>1000.0)  12,207.9119 (52.22)     8,162.6604 (363.25)     277.2606 (129.14)    8,123.9245 (364.33)     156.6890 (238.52)        45;46     122.5091 (0.00)       1000           1
test_batch16_rpc_uvicorn_httpx          9,187.5041 (>1000.0)  45,975.7270 (196.66)   11,298.3767 (502.80)   2,557.2035 (>1000.0)   9,838.5875 (441.23)   4,082.6655 (>1000.0)       164;3      88.5083 (0.00)       1000           1
test_batch16_rpc_fastapi_httpx         10,339.5160 (>1000.0)  46,581.8760 (199.25)   13,243.6710 (589.37)   2,175.6974 (>1000.0)  13,776.3910 (617.83)   2,728.9370 (>1000.0)        81;9      75.5078 (0.00)       1000           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean

Frankly, pytest-benchmark is not the most accurate benchmarking tool, but most numbers here are so large that timing isn't an issue. Here are the highlights:

UDP and TCP variants have comparable latency between 20 and 30 microseconds.
Batching 16 requests with asyncio results in a batch latency of around 650 microseconds or over 40 microseconds per packet, higher than vanilla blocking IO.
Using high-level libraries results in another 10x throughput reduction compared to asyncio batching 😭

Do you want to know how fast web servers in C and C++ are? Check out less_slow.cpp and feel free to help me port more examples to less_slow.rs in Rust 🦀

Release: v0.1.1 [skip ci]

Patch

Fix: Missing VERSION file (e709540)
Make: Semantic Versioning (2cbd2d7)

Releases: ashvardanian/less_slow.py

Release v0.5.1

Patch

Uh oh!

Release v0.5.0

Minor

Patch

Uh oh!

Release v0.4.0

Minor

Patch

Uh oh!

v0.3: Cost of `dict()` vs `{}`

Contributors

Uh oh!

v0.2: Networking in Python 🐍

Uh oh!

Release v0.1.1

Patch

Uh oh!