cuda_miner.py is a GPU accelerated single worker mining script for Denaro. It uses PyCUDA to JIT (Just in Time) compile a CUDA kernel for a SHA256 based nonce search over a constant block prefix and submits candidate blocks to a specified Denaro node.
-
- NVIDIA GPU with CUDA support.
-
- Linux x86-64
- Windows and WSL2 may work but are not covered here.
-
- NVIDIA Driver compatible with your CUDA Toolkit.
- CUDA Toolkit 11.8 or newer (for modern GPUs).
- Python 3.8+.
-
Install NVIDIA Driver + CUDA Toolkit:
-
Follow NVIDIA’s official Linux installation guide for your distribution: CUDA Installation Guide for Linux
-
Ensure your CUDA Toolkit installation is compatiable with your GPU architechture.
-
-
Validate the installation:
nvidia-smi nvcc --version
-
Ensure CUDA is discoverable:
-
Add the following to your shell profile if missing:
export PATH=/usr/local/cuda/bin:${PATH} export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH}
-
- Only perform this step if you do not already have Python virtual environment setup for Denaro.
- For more details refer to : README.md.
-
Create and activate a Python virtual environment:
python3 -m venv ../venv source ../venv/bin/activate python -m pip install --upgrade pip -
Install dependencies
pip install -r requirements.txt
-
Syntax:
cuda_miner.py [-h] --address ADDRESS [--node NODE] [--max-blocks MAX_BLOCKS] [--gpu-blocks GPU_BLOCKS] [--gpu-threads GPU_THREADS] [--gpu-iterations GPU_ITERATIONS] --gpu-arch GPU_ARCH [--no-tui] [--verbose]
-
Options:
-
--help,-h: Shows help message and exits. -
--address,-a: Mining address to receive rewards (required). -
--node,-n: URL of the Denaro node (Default: http://127.0.0.1:3006/). -
--max-blocks,-m: Max number of blocks to mine before exit. Runs indefinitely when not specified. -
--gpu-blocks: CUDA grid blocks per launch (Default: 256). -
--gpu-threads: CUDA threads per block (Default: 256). -
--gpu-iterations: Iterations per thread per kernel batch (Default: 10000). -
--gpu-arch: Sets thenvccarchitecture flag (required).- To determine the correct architecture flag for your GPU, refer to: SM architecture reference.
-
--no-tui: Disables the Textual TUI. Print logs to stdout instead. -
--verbose: Enables DEBUG-level logging.
-
-
Example:
python3 cuda_miner.py -a DZ8CxkXKwcnwQh6aXidBBWE75qvymfg4zZfLqZVP6Qh5A --node https://node.denaro.network --gpu-arch sm_86
This section provides guidance on how to fine tune the existing GPU settings to improve speed, stability, and system responsiveness.
-
Only do this if you know what you are doing, otherwise use the default settings.
-
The parameters below map directly to the script’s CLI flags and is specific to this script’s CUDA launch model (single worker, PyCUDA JIT):
-
Blocks: (
--gpu-blocks) Increasing this parameter increases the number of independent work units available to the scheduler. More blocks can hide latency when the GPU has headroom. However if the GPU is saturated or register limited, more blocks can add overhead without any apparent gain. -
Threads: (
--gpu-threads) The practical default for this parameter is 256. When setting this parameter, use multiples of 32 but do not exceed 1024. -
Iterations: (
--gpu-iterations) For this parameter, higher values reduce host overhead (fewer launches) but lengthen each kernel, which can cause desktop stutter and slow response to new work. -
Kernel Batch Size :
Blocks × Threads × Iterationsattempted nonces in one kernel launch. In this script each thread advances the nonce byglobal_step = Blocks × Threadson every iteration, soIterationsdirectly scales single-kernel runtime.
-
-
5.1 Default Profile:
-
When running the miner, use the default GPU settings first, and then fine tune them based on observed behavior:
- Blocks: 256
- Threads per block: 256
- Iterations per thread: 10,000
-
Approximate work per batch:
- Global threads =
256 × 256 = 65,536 - Attempts per batch =
65,536 × 10,000 ≈ 6.55e8
- Global threads =
-
You can also choose the closest profile to your hardware. However it is safer to use the default settings first before adjusting any parameters.
Hardware class Blocks Threads Iterations Laptops or older desktop GPUs 128 256 5000 Most mainstream desktop GPUs 256 256 10000 Faster desktop GPUs 512 256 15000 RTX 40 or 50 Series 1024 512 20,000
-
-
5.2 Practical Procedure:
-
Run the miner with the default values for 60–90 seconds while monitoring GPU utilization, clocks, temperature, and power:
nvidia-smi --loop=1 --query-gpu=temperature.gpu,utilization.gpu,power.draw,clocks.sm --format=csv
-
If utilization is low and the system is responsive:
- Increase Iterations in small steps (e.g., +2,000 to +5,000).
- If utilization remains low, increase Blocks in small steps (e.g., +64 or +128).
- Recheck temperature and interactivity after each change.
-
If you see lag, stutter, or a driver reset:
- Reduce Iterations by 25–50%.
- If needed, reduce Blocks (e.g., 256 → 192 → 128) to ease scheduler pressure.
- If you encounter launch/resource errors or unusually low occupancy, set Threads to 128 and retest.
-
Repeat the process until you reach a stable point with acceptable utilization and temperature. Prefer changing Iterations first. Then change Blocks. Finally, change Threads only when resource limits require it.
-
What to watch while adjusting parameters:
- Utilization:
- A healthy miner shows high multiprocessor utilization with minimal dips. Sustained low utilization with low temperature often means the kernel is too short or the grid is too small.
- Clocks:
- Stable multiprocessor clocks indicate the workload is within power and thermal budgets. Large drops mean power or thermal limits are active.
- Temperature and power:
- Keep steady state temperature comfortably below the throttle point. If power draw pegs near the limit, shorten the kernel with fewer Iterations.
- Utilization:
-
-
5.3 Estimating a target kernel duration (optional):
- If you the know approximate throughput (
H) in hashes per second and you want a target single kernel time (T) in seconds:- Attempts per batch:
Blocks × Threads × Iterations - Choose
Iterations = (H × T) / (Blocks × Threads)
- Attempts per batch:
- Example:
- If
H = 400 MH/s,Blocks = 256,Threads = 256, and targetT = 1.5 s, thenIterations = (400e6 × 1.5) / (256 × 256) = 9,155. Use 9,200 as a rounded value.
- If
- If
His unknown, apply the procedures in 5.2 and monitor the GPU while you adjust the parameters.
- If you the know approximate throughput (
-
5.4 Targets and limits:
-
Display-attached GPUs:
- Keep kernel batches short to avoid UI stutter and watchdog/TDR timeouts. Prefer moderate Iterations. Make gradual changes and test responsiveness after each adjustment.
-
Headless or secondary GPUs:
- Longer kernel batches are acceptable if temperatures, clocks, and stability remain within limits.
-
Thermals and power:
- If you approach limits, lower Iterations first to shorten kernel batches, then lower Blocks if needed. This reduces sustained load without changing correctness.
-
-
5.5 Troubleshooting:
-
Desktop lags, stutters, or triggers a driver reset:
- Lower Iterations (e.g., 10,000 → 7,500 → 5,000).
- If necessary, lower Blocks (e.g., 256 → 192 → 128).
- As a last step, set Threads to 128 to reduce per-block resource pressure and allow more resident blocks per streaming multiprocessor.
-
Low GPU utilization (system remains stable):
- Increase Blocks in small increments to deepen the work queue.
- Optionally increase Iterations to amortize launch overhead and lengthen kernels within safe time limits.
-
Launch failures / resource errors / poor occupancy:
- Reduce Threads (256 → 128) to ease register pressure and increase the number of active blocks per streaming multiprocessor.
- Reduce Blocks if errors persist or if the device reports insufficient resources for the launch.
-
High temperatures or loud fans:
- Reduce Iterations to shorten kernels and lower sustained load.
- If needed, also reduce Blocks or improve cooling. Confirm that power limiting is not capping clocks.
-