Skip to content

anishesg/easyprincetoncourses

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PrincetonCourses Scraper

Playwright-based scraper for princetoncourses.com that:

  • launches with a persistent browser profile from ./profile
  • waits for Princeton login to complete before scraping
  • uses Playwright for headed login/session handling and then switches to PrincetonCourses' authenticated JSON API for speed
  • partitions discovery to get around the site's 150-result search cap
  • scrapes course metadata plus evaluation comments/reviews for every course instance
  • writes both SQLite and JSON outputs to ./output

Install

npm install playwright
npx playwright install chromium

Run

npm run scrape

The browser opens in headed mode by default. If Princeton authentication is required, log in in that window and the scraper will continue automatically once /api/semesters becomes available for the authenticated session.

Useful Flags

npm run inspect
npm run scrape -- --max-courses=25 --search-concurrency=3 --course-concurrency=6
npm run scrape -- --headless=true
  • --inspect-current validates authentication and writes an API inspection payload to ./inspections
  • --max-courses is useful for trial runs
  • --search-concurrency controls discovery parallelism
  • --course-concurrency controls detail-fetch parallelism
  • --min-delay-ms / --max-delay-ms add jitter between requests so the crawl stays fast without hammering the site

Outputs

  • output/princetoncourses.sqlite
  • output/princetoncourses-data.json
  • output/princetoncourses-discovery-log.json

The JSON payload groups semester instances under stable courseId values, preserves website-native tags, and includes all review comments pulled from the course API. The SQLite DB mirrors the same dataset in queryable tables so the later AI-tagging pass can build on top of it.

Bedrock Integration

Amazon Bedrock now supports API keys as Bearer tokens. This repo includes a direct HTTP Bedrock client in src/bedrock.js and a tagging scaffold in src/tag-courses.js.

The Bedrock key is not written anywhere in this repo. Use the official environment variable at runtime:

export AWS_BEARER_TOKEN_BEDROCK='...your key...'
npm run tag:bedrock -- --max-courses=20

By default the tagger targets amazon.nova-micro-v1:0, which is a relatively cheap Bedrock model for the later meta-tagging pass. Override with BEDROCK_MODEL_ID or --model-id if you want a different model.

About

AI-powered course discovery for Princeton students

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors