🥈 Solo Silver Medal Solution (Top 3%)
Thanks to The Learning Agency LLC for hosting such a fun and rewarding competition! I learned a lot about LLM finetuning and especially enjoyed exploring creative data augmentation and feature engineering.
For each question in the data, I determined the options, as well as their letter by looking at the student responses. Many student responses revealed which option they chose with wording such as "It's c because ..." (wasn't 100% consistent, but overall decent). With these options fixed, I also determined which letter the student chose by matching the MC_Answer field in the data with the options I constructed. My input prompt to the LLM was the same for all my models where I wrapped key information in xml tags.
<question>{Question Text}</question>
<choices>
<choice id='A'>{Option A Text}</choice>
<choice id='B'>{Option B Text}</choice>
<choice id='C'>{Option C Text}</choice>
<choice id='D'>{Option D Text}</choice>
</choices>
<student_answer_letter>{Student Chosen Letter}</student_answer_letter>
<student_answer_text>{MC_Answer}</student_answer_text>
<correct_answer>{yes/no}</correct_answer>
<explanation>{Student Explanation}</explanation>
No Prompt Truncation: Most public notebooks truncate the input sequence to a maximum length of 256, even if the prompt is longer. Since my prompt was longer due to added features, I could not afford to do this, and added fixed padding to match the longest training sequence. During inference to save time, I tokenize the test prompts and sort by length, and use DataCollatorWithPadding to pad only to the maximum length sequence in a batch. This sped up my submission by nearly 2x and allowed me to fit 4 pretty big models in my ensemble.
Label Space Reduction: Another trick I used was mentioned in the discussion forums: since all the questions in the test set are present in the training set, we can hardcode the True/False target for each prediction and we only need to predict the misconception. This reduces the label space from 65->37 making the problem slightly easier for the LLM.
The training dataset consisted of a lot of student typos, grammatical mistakes etc. To mitigate overfitting on these spurious features, I added a large amount of data augmentation by artificially introducing typos, punctuation, etc. More specifically, for each student explanation, I loop through each possible augmentation I defined and apply it with 20% probability. Here is a brief description of each augmentation I included in my final training pipeline:
- Typos: Randomly introduce typos in the explanation. For each word in the explanation with probability 6% I decide whether to add a typo. If typo is added, randomly pick from duplicating character, removing character, replacing character with one adjacent on the keyboard.
- Whitespace: Randomly add additional whitespace between words.
- Number Form: Convert spelled out numbers into numerical form or vice versa (twenty one -> 21 or 13 -> thirteen).
- Contractions: Replace some common phrases with contractions or expand them (do not -> don't or can't -> can not).
- Punctuation: If the explanation does not end with punctuation, randomly add sentence end punctuation such as period of explanation mark. If the explanation already ends with punctuation, remove it instead.
- Capitalization: Randomly capitalize the first letter in the explanation, or lower case it if it is already capitalized
- Operators: Change some operators like multiplication into verbal form or vice versa. (2 x 3 -> 2 times 3). I debated whether to include this since it can significantly change the explanation and didn't lead to much improvement but ultimately decided to keep it.
Tag Removal Augmentation: I also have another augmentation independent of student explanation which I apply with 20% probability to a given training sample. I randomly remove one of the xml tags in my prompt (answer options, correct answer feature, etc.) I always make sure to include the question and explanation in the prompt regardless of the augmentation chosen. However, during validation and my submission, I keep all information in the prompt.
Model Architecture: All models were decoder-only LLMs adapted for sequence classification via a classification head with 37 targets. Training used LoRA in 16-bit precision with bfloat16.
Hardware: Most training ran on a single node with 4× RTX 3090 GPUs using DDP. Some runs also used a single A100 80 GB GPU depending on resource availability.
Hyperparameters: Each model trained for 3 epochs.
For full-dataset models, I selected the final checkpoint by public LB performance (though I’d carefully consider this decision in the future; it has a strong risk of overfitting).
For cross-validation models, I used 5-fold stratified CV by misconception class. I selected the best checkpoint per fold based on OOF MAP@3 during training, evaluating every 100 parameter updates.
A list of the final models in my ensemble:
| # | Model | Training Setup | Notes / Approx. CV |
|---|---|---|---|
| 1 | Qwen3-32B | Full data, 3 epochs LoRA r = 8, α = 16, lr = 2e-4 |
— |
| 2 | Qwen3-4B-Instruct-2507 | 5-fold CV LoRA r = 32, α = 16, lr = 8e-5 |
CV ≈ 0.947-0.948 |
| 3 | Gemma2-9B-it | Full data, 3 epochs LoRA r = 32, α = 16, lr = 8e-5 |
— |
| 4 | Qwen3-1.7B | 5-fold CV LoRA r = 8, α = 16, lr = 5e-4 |
CV ≈ 0.945-0.956 |
Class Filtering: I restricted my top 3 predicted misconception targets to only include targets present in the training data for that question. I was unsure if this would hurt generalization to unseen question-target pairs, so one submission I picked had this filtering, and another did not.
Exact Match: There were some (question, mc_answer, explanation) triplets in the test set which were exact matches as the ones in the train set. If I detected these, I would reorder my top 3 predictions to place the training label as the top prediction, as long as there weren't inconsistencies in the training data.
Final Submission Selection: For my final 2 scoring submissions, I selected a simple uniform weighted blend of the probabilities predicted by each model (0.25, 0.25, 0.25, 0.25). I also selected a version which combines rank weighted average with probability average in a ratio of (0.7 probability weighted pred + 0.3 rank weighted pred)
Feel free to take a look at my training code (it's all in train_fold.py) if you want to take a close look at how I preprocessed and handled the distributed training. After setting up the environment, you can run training with bash train_ddp.sh and customize your hyperparameters in the bash script.
- During development, I filtered out training examples with a character edit distance < 5 from validation samples to reduce data leakage. I later removed this filter in final runs. Keeping the filter on could result in better checkpoint selection and generalization.
- I intended to determine ensemble weights via hill climbing on OOF predictions (
hill_climbing_stacker.py), but limited time prevented full CV runs for larger models. Using uniform ensembling worked well but likely left room for improvement. - The label distribution was pretty imbalanced, so I explored oversampling some rare classes, but it didn't improve my CV. Augmenting rare classes with synthetic data or other approaches could definitely lead to some improvement.