Playwright-based scraper for princetoncourses.com that:
- launches with a persistent browser profile from
./profile - waits for Princeton login to complete before scraping
- uses Playwright for headed login/session handling and then switches to PrincetonCourses' authenticated JSON API for speed
- partitions discovery to get around the site's 150-result search cap
- scrapes course metadata plus evaluation comments/reviews for every course instance
- writes both SQLite and JSON outputs to
./output
npm install playwright
npx playwright install chromiumnpm run scrapeThe browser opens in headed mode by default. If Princeton authentication is required, log in in that window and the scraper will continue automatically once /api/semesters becomes available for the authenticated session.
npm run inspect
npm run scrape -- --max-courses=25 --search-concurrency=3 --course-concurrency=6
npm run scrape -- --headless=true--inspect-currentvalidates authentication and writes an API inspection payload to./inspections--max-coursesis useful for trial runs--search-concurrencycontrols discovery parallelism--course-concurrencycontrols detail-fetch parallelism--min-delay-ms/--max-delay-msadd jitter between requests so the crawl stays fast without hammering the site
output/princetoncourses.sqliteoutput/princetoncourses-data.jsonoutput/princetoncourses-discovery-log.json
The JSON payload groups semester instances under stable courseId values, preserves website-native tags, and includes all review comments pulled from the course API. The SQLite DB mirrors the same dataset in queryable tables so the later AI-tagging pass can build on top of it.
Amazon Bedrock now supports API keys as Bearer tokens. This repo includes a direct HTTP Bedrock client in src/bedrock.js and a tagging scaffold in src/tag-courses.js.
The Bedrock key is not written anywhere in this repo. Use the official environment variable at runtime:
export AWS_BEARER_TOKEN_BEDROCK='...your key...'
npm run tag:bedrock -- --max-courses=20By default the tagger targets amazon.nova-micro-v1:0, which is a relatively cheap Bedrock model for the later meta-tagging pass. Override with BEDROCK_MODEL_ID or --model-id if you want a different model.