We should support using reward model based environments (ones that potentially need GPUs).
@abukharin-nv was explaining how the logic for different models may look very different:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "nvidia/Llama-3.1-Nemotron-70B-Reward-HF"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "What is 1+1?"
good_response = "1+1=2"
bad_response = "1+1=3"
for response in [good_response, bad_response]:
messages = [{'role': "user", "content": prompt}, {'role': "assistant", "content": response}]
tokenized_message = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=False, return_tensors="pt", return_dict=True)
response_token_ids = model.generate(tokenized_message['input_ids'].cuda(),attention_mask=tokenized_message['attention_mask'].cuda(), max_new_tokens=1, return_dict_in_generate=True, output_scores=True)
reward = response_token_ids['scores'][0][0][0].item()
print(reward)
# reward for good_response = -3.28125
# reward for bad_response = -4.28125
vs
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
# Load model and tokenizer
device = "cuda:0"
model_name = "Skywork/Skywork-Reward-Llama-3.1-8B"
rm = AutoModelForSequenceClassification.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map=device,
attn_implementation="flash_attention_2",
num_labels=1,
)
rm_tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "Jane has 12 apples. She gives 4 apples to her friend Mark, then buys 1 more apple, and finally splits all her apples equally among herself and her 2 siblings. How many apples does each person get?"
response1 = "1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.\n2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.\n3. Jane splits the 9 apples equally among herself and her 2 siblings (3 people in total). 9 ÷ 3 = 3 apples each. Each person gets 3 apples."
response2 = "1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.\n2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.\n3. Jane splits the 9 apples equally among her 2 siblings (2 people in total). 9 ÷ 2 = 4.5 apples each. Each person gets 4 apples."
conv1 = [{"role": "user", "content": prompt}, {"role": "assistant", "content": response1}]
conv2 = [{"role": "user", "content": prompt}, {"role": "assistant", "content": response2}]
# Format and tokenize the conversations
conv1_formatted = rm_tokenizer.apply_chat_template(conv1, tokenize=False)
conv2_formatted = rm_tokenizer.apply_chat_template(conv2, tokenize=False)
conv1_tokenized = rm_tokenizer(conv1_formatted, return_tensors="pt").to(device)
conv2_tokenized = rm_tokenizer(conv2_formatted, return_tensors="pt").to(device)
# Get the reward scores
with torch.no_grad():
score1 = rm(**conv1_tokenized).logits[0][0].item()
score2 = rm(**conv2_tokenized).logits[0][0].item()
print(f"Score for response 1: {score1}")
print(f"Score for response 2: {score2}")
# Output:
# Score for response 1: 12.625
# Score for response 2: -15.25
Given the dynamism we should make sure:
- reward environments can claim GPUs (initially split placement for simplicity)
- define the OAI interface
- to consider external module based logic
We should support using reward model based environments (ones that potentially need GPUs).
@abukharin-nv was explaining how the logic for different models may look very different:
vs
Given the dynamism we should make sure: