Releases: ashvardanian/less_slow.py
Release v0.5.1
Release v0.5.0
Release v0.4.0
v0.3: Cost of `dict()` vs `{}`
This snippet was contributed by @jmg-duarte 🙌
There's a difference between calling dict vs {}, for the first one, Python needs to lookup dict while the {} nomenclature cannot be overwritten, the bytecode says it all:
-
for
{}:2 LOAD_CONST 1 ('x') LOAD_CONST 2 (10) BUILD_MAP 1 RETURN_VALUE -
for
dict:6 LOAD_GLOBAL 1 (dict + NULL) LOAD_CONST 1 (10) LOAD_CONST 2 (('x',)) CALL_KW 1 RETURN_VALUE
Turns out, the difference is big enough that it can almost close the gap to the dataclass and friends — see test_structs_dict_fun:
----------------------------------------------------------------------------------- benchmark 'composite-structs': 10 tests ------------------------------------------------------------------------------------
Name (time in ns) Min Max Mean StdDev Median IQR Outliers OPS (Mops/s) Rounds Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_structs_tuple_unpacking 52.4998 (1.0) 1,118.3391 (1.61) 55.3057 (1.0) 11.7465 (1.67) 55.0004 (1.0) 1.6601 (3.94) 1175;1736 18.0813 (1.0) 180473 100
test_structs_tuple_indexing 53.7490 (1.02) 692.9203 (1.0) 56.7784 (1.03) 7.0220 (1.0) 56.6605 (1.03) 0.4214 (1.0) 1220;37751 17.6123 (0.97) 161081 100
test_structs_dict 101.6690 (1.94) 870.4094 (1.26) 106.3716 (1.92) 10.7967 (1.54) 105.4199 (1.92) 2.5099 (5.96) 1670;2190 9.4010 (0.52) 95238 100
test_structs_dict_fun 124.9998 (2.38) 1,146.2497 (1.65) 130.9408 (2.37) 10.0044 (1.42) 130.4201 (2.37) 1.2503 (2.97) 952;12269 7.6370 (0.42) 77671 100
test_structs_slots_dataclass 142.0791 (2.71) 875.8390 (1.26) 148.4811 (2.68) 11.4265 (1.63) 147.9208 (2.69) 1.2503 (2.97) 721;11753 6.7349 (0.37) 65933 100
test_structs_dataclass 147.1235 (2.80) 1,933.5966 (2.79) 156.0895 (2.82) 18.9933 (2.70) 154.9706 (2.82) 1.3169 (3.12) 1247;32677 6.4066 (0.35) 199999 32
test_structs_class 147.4994 (2.81) 767.9209 (1.11) 153.8915 (2.78) 10.7097 (1.53) 153.7497 (2.80) 2.0803 (4.94) 738;4708 6.4981 (0.36) 62500 100
test_structs_namedtuple 189.9991 (3.62) 2,618.3575 (3.78) 201.7096 (3.65) 24.9604 (3.55) 200.0015 (3.64) 3.3202 (7.88) 1857;15508 4.9576 (0.27) 195124 25
test_structs_attrs 457.9779 (8.72) 32,833.8938 (47.38) 533.1721 (9.64) 233.8386 (33.30) 540.9820 (9.84) 42.0259 (99.72) 371;2292 1.8756 (0.10) 130446 1
test_structs_pydantic 582.8915 (11.10) 3,000.0228 (4.33) 644.9098 (11.66) 98.3162 (14.00) 625.0339 (11.36) 40.9782 (97.24) 11;11 1.5506 (0.09) 779 1v0.2: Networking in Python 🐍
Benchmarking IO throughput for Networking and Storage intensive workloads in Python is trickier than in other languages due to the single-threaded nature of the Python interpreter. So, we are forced to spawn sibling processes that will run the client and server code independently to avoid them affecting each other and competing for resources inside of a single interpreter instance.
To write Less Slow software in Python, I wanted to cover the following topics:
- UDP vs TCP sockets
-
asynciocosts -
fastapivsuvicornserver-side abstraction layers -
requestsvshttpxclient-side implementation differences
Most developers don't even consider using UDP stateless sockets and go directly for the heavy TCP/IP stateful stack. Similarly, many assume that async code will make IO faster, but it's not free and will backfire in a well-connected data center. Moreover, state management in high-level libraries is often more expensive than sending a packet to a neighboring country or running the inference of an AI model 🤯
Running these benchmarks on AWS c7i instance with Intel Xeon 4 Sapphire Rapids CPUs, we get:
$ uv run --python="3.12" --no-sync --with-requirements requirements.in pytest -ra -q less_slow.py -k rpc
---------------------------------------------------------------------------------------------------- benchmark 'echo': 10 tests ----------------------------------------------------------------------------------------------------
Name (time in us) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_rpc_udp_loopback 7.7320 (1.0) 841.0759 (3.60) 22.4711 (1.0) 3.7709 (1.76) 22.2980 (1.0) 0.6569 (1.0) 4870;8454 44,501.6398 (1.0) 100000 1
test_rpc_udp_public 10.9550 (1.42) 581.0489 (2.49) 22.8188 (1.02) 2.7904 (1.30) 22.4309 (1.01) 0.6965 (1.06) 3507;6951 43,823.6065 (0.98) 100000 1
test_rpc_tcp_public 17.2070 (2.23) 274.0650 (1.17) 26.4949 (1.18) 2.2590 (1.05) 25.9890 (1.17) 0.8119 (1.24) 9706;11257 37,743.1265 (0.85) 100000 1
test_rpc_tcp_loopback 17.2630 (2.23) 233.7879 (1.0) 26.4878 (1.18) 2.1469 (1.0) 25.9880 (1.17) 0.8739 (1.33) 10523;11359 37,753.1842 (0.85) 100000 1
test_batch16_rpc_asyncio_unordered 499.9969 (64.67) 1,614.7849 (6.91) 637.3076 (28.36) 48.6703 (22.67) 631.1065 (28.30) 22.2565 (33.88) 64;79 1,569.1011 (0.04) 1000 1
test_batch16_rpc_asyncio_ordered 524.8130 (67.88) 2,071.1240 (8.86) 643.1911 (28.62) 81.8031 (38.10) 632.6414 (28.37) 24.6380 (37.50) 30;74 1,554.7478 (0.03) 1000 1
test_batch16_rpc_uvicorn_requests 7,762.3690 (>1000.0) 9,353.0880 (40.01) 7,972.3333 (354.78) 84.5799 (39.40) 7,969.1625 (357.39) 74.1920 (112.94) 137;23 125.4338 (0.00) 1000 1
test_batch16_rpc_fastapi_requests 7,936.2941 (>1000.0) 12,207.9119 (52.22) 8,162.6604 (363.25) 277.2606 (129.14) 8,123.9245 (364.33) 156.6890 (238.52) 45;46 122.5091 (0.00) 1000 1
test_batch16_rpc_uvicorn_httpx 9,187.5041 (>1000.0) 45,975.7270 (196.66) 11,298.3767 (502.80) 2,557.2035 (>1000.0) 9,838.5875 (441.23) 4,082.6655 (>1000.0) 164;3 88.5083 (0.00) 1000 1
test_batch16_rpc_fastapi_httpx 10,339.5160 (>1000.0) 46,581.8760 (199.25) 13,243.6710 (589.37) 2,175.6974 (>1000.0) 13,776.3910 (617.83) 2,728.9370 (>1000.0) 81;9 75.5078 (0.00) 1000 1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Legend:
Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
OPS: Operations Per Second, computed as 1 / MeanFrankly, pytest-benchmark is not the most accurate benchmarking tool, but most numbers here are so large that timing isn't an issue. Here are the highlights:
- UDP and TCP variants have comparable latency between 20 and 30 microseconds.
- Batching 16 requests with
asyncioresults in a batch latency of around 650 microseconds or over 40 microseconds per packet, higher than vanilla blocking IO. - Using high-level libraries results in another 10x throughput reduction compared to
asynciobatching 😭
Do you want to know how fast web servers in C and C++ are? Check out less_slow.cpp and feel free to help me port more examples to less_slow.rs in Rust 🦀