Skip to content

tossowski/MAP

Repository files navigation

MAP - Charting Student Math Misunderstandings

🥈 Solo Silver Medal Solution (Top 3%)

Thanks to The Learning Agency LLC for hosting such a fun and rewarding competition! I learned a lot about LLM finetuning and especially enjoyed exploring creative data augmentation and feature engineering.

Feature Engineering:

For each question in the data, I determined the options, as well as their letter by looking at the student responses. Many student responses revealed which option they chose with wording such as "It's c because ..." (wasn't 100% consistent, but overall decent). With these options fixed, I also determined which letter the student chose by matching the MC_Answer field in the data with the options I constructed. My input prompt to the LLM was the same for all my models where I wrapped key information in xml tags.

<question>{Question Text}</question>
<choices>
  <choice id='A'>{Option A Text}</choice>
  <choice id='B'>{Option B Text}</choice>
  <choice id='C'>{Option C Text}</choice>
  <choice id='D'>{Option D Text}</choice>
</choices>
<student_answer_letter>{Student Chosen Letter}</student_answer_letter>
<student_answer_text>{MC_Answer}</student_answer_text>
<correct_answer>{yes/no}</correct_answer>
<explanation>{Student Explanation}</explanation>

No Prompt Truncation: Most public notebooks truncate the input sequence to a maximum length of 256, even if the prompt is longer. Since my prompt was longer due to added features, I could not afford to do this, and added fixed padding to match the longest training sequence. During inference to save time, I tokenize the test prompts and sort by length, and use DataCollatorWithPadding to pad only to the maximum length sequence in a batch. This sped up my submission by nearly 2x and allowed me to fit 4 pretty big models in my ensemble.

Label Space Reduction: Another trick I used was mentioned in the discussion forums: since all the questions in the test set are present in the training set, we can hardcode the True/False target for each prediction and we only need to predict the misconception. This reduces the label space from 65->37 making the problem slightly easier for the LLM.

Data Augmentation:

The training dataset consisted of a lot of student typos, grammatical mistakes etc. To mitigate overfitting on these spurious features, I added a large amount of data augmentation by artificially introducing typos, punctuation, etc. More specifically, for each student explanation, I loop through each possible augmentation I defined and apply it with 20% probability. Here is a brief description of each augmentation I included in my final training pipeline:

  • Typos: Randomly introduce typos in the explanation. For each word in the explanation with probability 6% I decide whether to add a typo. If typo is added, randomly pick from duplicating character, removing character, replacing character with one adjacent on the keyboard.
  • Whitespace: Randomly add additional whitespace between words.
  • Number Form: Convert spelled out numbers into numerical form or vice versa (twenty one -> 21 or 13 -> thirteen).
  • Contractions: Replace some common phrases with contractions or expand them (do not -> don't or can't -> can not).
  • Punctuation: If the explanation does not end with punctuation, randomly add sentence end punctuation such as period of explanation mark. If the explanation already ends with punctuation, remove it instead.
  • Capitalization: Randomly capitalize the first letter in the explanation, or lower case it if it is already capitalized
  • Operators: Change some operators like multiplication into verbal form or vice versa. (2 x 3 -> 2 times 3). I debated whether to include this since it can significantly change the explanation and didn't lead to much improvement but ultimately decided to keep it.

Tag Removal Augmentation: I also have another augmentation independent of student explanation which I apply with 20% probability to a given training sample. I randomly remove one of the xml tags in my prompt (answer options, correct answer feature, etc.) I always make sure to include the question and explanation in the prompt regardless of the augmentation chosen. However, during validation and my submission, I keep all information in the prompt.

Training Details

Model Architecture: All models were decoder-only LLMs adapted for sequence classification via a classification head with 37 targets. Training used LoRA in 16-bit precision with bfloat16.

Hardware: Most training ran on a single node with 4× RTX 3090 GPUs using DDP. Some runs also used a single A100 80 GB GPU depending on resource availability.

Hyperparameters: Each model trained for 3 epochs.

For full-dataset models, I selected the final checkpoint by public LB performance (though I’d carefully consider this decision in the future; it has a strong risk of overfitting).

For cross-validation models, I used 5-fold stratified CV by misconception class. I selected the best checkpoint per fold based on OOF MAP@3 during training, evaluating every 100 parameter updates.

A list of the final models in my ensemble:

# Model Training Setup Notes / Approx. CV
1 Qwen3-32B Full data, 3 epochs
LoRA r = 8, α = 16, lr = 2e-4
2 Qwen3-4B-Instruct-2507 5-fold CV
LoRA r = 32, α = 16, lr = 8e-5
CV ≈ 0.947-0.948
3 Gemma2-9B-it Full data, 3 epochs
LoRA r = 32, α = 16, lr = 8e-5
4 Qwen3-1.7B 5-fold CV
LoRA r = 8, α = 16, lr = 5e-4
CV ≈ 0.945-0.956

Post Processing:

Class Filtering: I restricted my top 3 predicted misconception targets to only include targets present in the training data for that question. I was unsure if this would hurt generalization to unseen question-target pairs, so one submission I picked had this filtering, and another did not.

Exact Match: There were some (question, mc_answer, explanation) triplets in the test set which were exact matches as the ones in the train set. If I detected these, I would reorder my top 3 predictions to place the training label as the top prediction, as long as there weren't inconsistencies in the training data.

Final Submission Selection: For my final 2 scoring submissions, I selected a simple uniform weighted blend of the probabilities predicted by each model (0.25, 0.25, 0.25, 0.25). I also selected a version which combines rank weighted average with probability average in a ratio of (0.7 probability weighted pred + 0.3 rank weighted pred)

Running the Code

Feel free to take a look at my training code (it's all in train_fold.py) if you want to take a close look at how I preprocessed and handled the distributed training. After setting up the environment, you can run training with bash train_ddp.sh and customize your hyperparameters in the bash script.

Miscellaneous Discussion and Potential Improvement

  • During development, I filtered out training examples with a character edit distance < 5 from validation samples to reduce data leakage. I later removed this filter in final runs. Keeping the filter on could result in better checkpoint selection and generalization.
  • I intended to determine ensemble weights via hill climbing on OOF predictions (hill_climbing_stacker.py), but limited time prevented full CV runs for larger models. Using uniform ensembling worked well but likely left room for improvement.
  • The label distribution was pretty imbalanced, so I explored oversampling some rare classes, but it didn't improve my CV. Augmenting rare classes with synthetic data or other approaches could definitely lead to some improvement.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published