Reward Model based Environments

We should support using reward model based environments (ones that potentially need GPUs).

@abukharin-nv was explaining how the logic for different models may look very different:

```py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "nvidia/Llama-3.1-Nemotron-70B-Reward-HF"

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "What is 1+1?"
good_response = "1+1=2"
bad_response = "1+1=3"

for response in [good_response, bad_response]:
    messages = [{'role': "user", "content": prompt}, {'role': "assistant", "content": response}]
    tokenized_message = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=False, return_tensors="pt", return_dict=True)
    response_token_ids = model.generate(tokenized_message['input_ids'].cuda(),attention_mask=tokenized_message['attention_mask'].cuda(),  max_new_tokens=1, return_dict_in_generate=True, output_scores=True)
    reward = response_token_ids['scores'][0][0][0].item()
    print(reward)

# reward for good_response = -3.28125
# reward for bad_response = -4.28125
```
vs
```py
import torch

from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Load model and tokenizer
device = "cuda:0"
model_name = "Skywork/Skywork-Reward-Llama-3.1-8B"
rm = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map=device,
    attn_implementation="flash_attention_2",
    num_labels=1,
)
rm_tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Jane has 12 apples. She gives 4 apples to her friend Mark, then buys 1 more apple, and finally splits all her apples equally among herself and her 2 siblings. How many apples does each person get?"
response1 = "1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.\n2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.\n3. Jane splits the 9 apples equally among herself and her 2 siblings (3 people in total). 9 ÷ 3 = 3 apples each. Each person gets 3 apples."
response2 = "1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.\n2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.\n3. Jane splits the 9 apples equally among her 2 siblings (2 people in total). 9 ÷ 2 = 4.5 apples each. Each person gets 4 apples."

conv1 = [{"role": "user", "content": prompt}, {"role": "assistant", "content": response1}]
conv2 = [{"role": "user", "content": prompt}, {"role": "assistant", "content": response2}]

# Format and tokenize the conversations
conv1_formatted = rm_tokenizer.apply_chat_template(conv1, tokenize=False)
conv2_formatted = rm_tokenizer.apply_chat_template(conv2, tokenize=False)
conv1_tokenized = rm_tokenizer(conv1_formatted, return_tensors="pt").to(device)
conv2_tokenized = rm_tokenizer(conv2_formatted, return_tensors="pt").to(device)

# Get the reward scores
with torch.no_grad():
    score1 = rm(**conv1_tokenized).logits[0][0].item()
    score2 = rm(**conv2_tokenized).logits[0][0].item()
print(f"Score for response 1: {score1}")
print(f"Score for response 2: {score2}")

# Output:
# Score for response 1: 12.625
# Score for response 2: -15.25
```

Given the dynamism we should make sure:
1. reward environments can claim GPUs (initially split placement for simplicity)
2. define the OAI interface
3. to consider external module based logic

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reward Model based Environments #670

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reward Model based Environments #670

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions