Poor RPC performance

I'm running a Westend RPC node that exhibits a poor performance serving RPC calls (both WS and HTTP).

RPC calls execution time varies greatly on whether it is first or subsequent call. The most notable example being `state_getMetadata` method that can take anywhere from 8ms to 18s. I recorded two profiles using `perf` during the execution of `state_getMetadata` method and rendered a flamegraph:

1. Slow (first) RPC call
![perf-slow](https://user-images.githubusercontent.com/8650477/181222206-514c4cd3-97c9-4a19-bb23-3719dee9a250.svg)
Source: https://transfer.sh/e52S3b/perf-slow.svg

2. Fast (subsequent) RPC call
![perf-fast](https://user-images.githubusercontent.com/8650477/181236973-cf6948ea-67ee-415d-97d2-73ca5090c01a.svg)
Source: https://transfer.sh/RQXIUs/perf-fast.svg

On the first flamegraph most of the CPU time was spent compiling a WASM code and only 2.39% of the time actually executing it. During the WASM compilation I observe a high CPU utilization (100% per vCPU, depending on the number of in-flight requests).

I plotted a graph for the median distribution of the RPC calls execution time over the past two days (y-axis being microseconds)
![image](https://user-images.githubusercontent.com/8650477/181239724-4c212563-2b56-4cbd-9502-8c462f1a8620.png)

`state_getMetadata` method alone is being executed on average 1.23 times per second.
![image](https://user-images.githubusercontent.com/8650477/181241053-51a6d9ce-c0ea-4f16-9324-d56c96c935cd.png)

Each time a method execution falls into a `slow RPC call` bucket and requires WASM compilation, a node uses 100% of the vCPU. As there are other methods being called it significantly increases the resources usage and impacts end-user experience. It also leads to a possibility to cause a DoS by utilizing all of the available resources.

The above behavior is observed on both GCP VM instance and inside a container running in a K8s cluster
System info 1 (VM):
Debian 10
Linux kernel: 4.19.0-18-cloud-amd64
Polkadot binary v0.9.26
CLI parameters: `polkadot --detailed-log-output --name westend-rpc-1 --unsafe-ws-external --rpc-methods Safe --rpc-cors * --chain westend --listen-addr=/ip4/0.0.0.0/tcp/30333 --public-addr=/ip4/<redacted>/tcp/30333 --in-peers 25 --out-peers 25 --db-cache 512 --telemetry-url wss://<redacted> --pruning=archive -lsync=trace,rpc_metrics=debug --prometheus-external --prometheus-port 9615 --ws-port 9944 --ws-max-connections 5000 --rpc-port 9933`

System info 1 (container):
Google Kubernetes engine: v1.22.10-gke.600
Linux kernel 5.10.109+
[Polkadot image](hub.docker.com/r/parity/polkadot) v0.9.26
CLI parameters: `polkadot --name=westend-rpc-node-0 --base-path=/data/ --chain=westend --database=rocksdb --pruning=archive --prometheus-external --prometheus-port 9615 --unsafe-rpc-external --unsafe-ws-external --rpc-methods=safe --rpc-cors=all --ws-max-connections 5000 --telemetry-url wss://<redacted> --db-cache 512 --in-peers 25 --out-peers 25 -lrpc_metrics=debug --listen-addr=/ip4/0.0.0.0/tcp/30333`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Poor RPC performance #5821

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Poor RPC performance #5821

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions