Conversation
|
Thank you for the PR. - name: Launch job script
run: |
RUNNER_NAME=${{ runner.name }}
RUNNER_LABEL=${{ inputs.runner }}
bash ./runners/launch_${RUNNER_NAME%%_*}.sh ${{ inputs.exp-name }}and in bash benchmarks/${MODEL_CODE}_${RUNNER_LABEL}_slurm.sh |
|
Thanks for the review, @kimbochen! made these changes: ✅ uncommented vLLM |
|
Testing shows the script doesn't pick up |
|
No worries, reverted! |
|
@kimbochen B200 trt jobs are failing because trt sqsh file shares name with vllm. Made a fix, also temporarily removed vllm and other configs. just to test if b200 trt is working. Can you please cancel the current job and re-run with these fixes? salloc: Granted job allocation 1919 salloc: Waiting for resource configuration salloc: Nodes dgx05-b200 are ready for job + srun --jobid=1919 bash -c 'enroot import -o /raid/image_70b_b200.sqsh docker://nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0' Error: File already exists: /raid/image_70b_b200.sqsh srun: error: dgx05-b200: task 0: Exited with exit code 1 + srun --jobid=1919 --container-image=/raid/image_70b_b200.sqsh --container-mounts=/home/gharunnerb1/actions-runner/_work/InferenceMAX/InferenceMAX:/workspace/,/raid/hf_hub_cache/:/mnt/hf_hub_cache/ --container-mount-home --container-workdir=/workspace/ --no-container-entrypoint --export=ALL bash benchmarks/70b_b200-trt_slurm.sh JOB 1919 running on dgx05-b200 + hf download nvidia/Llama-3.3-70B-Instruct-FP8 Fetching 25 files: 0%| | 0/25 [00:00 |
…summarize.py to reflect backend, fix issue with result filename
|
Hello @kedarpotdar-nv , thank you so much for the updates.
|
|
Thanks Kimbo. I am going to close this, refactor trt code to 70b. |
Add upstream sync workflow
Added B200 TRT-LLM runner configuration and consolidated runner logic
Changes Made: