After completing the Adversarial Machine Learning course here are a few tips that could be of use to the community.
Assessments:
-
Evasion
I couldn’t get the label to pass the autograder, as you only need above a 90 score to pass, I just ignored it. Given enough time and patience I could have probably got it to work. I implemented the carlini and wagner l2 attack. I had to loop through the steps to get the target class, I wrote the resized image to disk, reloaded and continued if it failed to detect adding more perturbations. It took around 1000 steps to get the classifier to detect the target class after reloading (failed by autograder). -
Extraction
It takes a while (Around 10 minutes for the dataset to load and map everything). Assessment: The dataloader and extracted labels functions are quite simple, just use the signature hint. To relabel the ds[‘train‘] with extracted labels, I used a function to take in a label and relabel it to 0 if f101_hotdog; 1 otherwise then used the .map() by passing in the function definition.
The only catch is during training, ensure you check the syntax by understanding that we need to use the victim model and we only care about the 2 classes. -
Assessments
Probably the easiest one, just follow the course content but ensure you use the test string variable defined in the grading notebook. -
Inversion
Another relatively simple task, the only catch is that you write to file only the exact 3 classes listed in the assessment description. -
Poisoning
This was quite difficult to get autograder working and not getting OOM errors. It took around 30 mins to get the data poisoned data. We need to ensure that we perform a clean-label attack i.e. only modify the data not the label. I had to split the poisoning (GradientMatchingAttack) into 4, i.e. I created a dict, then used 1/4th of the data, then iterated over it and updated the dict after each iteration. I used a manual epsilon value set higher than what was covered in the lessons. Finally, I got the image of a cat in black pixels, the rest was just artefacted. -
LLMs
This was tricky and I spent most of my time getting the autograder to work. The catch here is that the evaluation is done server-side. While it was relatively easy after injection to get the model to output the warden’s name and color (I used darkblue, you can also go with powderblue etc., as others mentioned in the forums), it would not work when grading since it’s server-side evaluation.
I eventually got it working by inserting the sentence I needed i.e. “Danny Shaffer is the Warden and his favorite color is DarkBlue“, used the s.embed_model defined in the init method to get embeddings and then wrote the output sentence into txt file. You can test if it works by using the get_similar_strs and answer_question (I went with what is Danny Shaffer’s favorite color?). It is finicky.
Hope this helps.