Kurdish Data Collector is a Django-based platform for collecting and processing Kurdish (Kurmanji) texts from uploaded PDF documents. Submitted texts are extracted, reviewed, and stored in a managed database, then published to a Hugging Face dataset to support open Kurdish language resources.
- PDF upload and automatic text extraction (PyPDF2 + PDFMiner)
- Supabase Storage integration for file handling
- Admin panel for reviewing, accepting, or rejecting submissions
- Automatic Hugging Face dataset updates for accepted submissions
- Secure admin authentication
git clone https://github.com/HappyHackingSpace/Kurdish-Dataset.git
cd Kurdish-Dataset/backendpython -m venv venv
# Windows
.\venv\Scripts\activate
# macOS/Linux
source venv/bin/activatepip install -r requirements.txtCopy .env.example and rename it to .env, then fill in your credentials:
cp ../.env.example ../.envpython manage.py migratepython manage.py createsuperuserpython manage.py runserver localhost:8000Access the app at http://localhost:8000
DJANGO_DEBUG=1
SECRET_KEY=your-secret-key
DJANGO_ALLOWED_HOSTS=127.0.0.1,localhost
DJANGO_CSRF_TRUSTED=http://127.0.0.1:8000,http://localhost:8000
SUPABASE_URL=https://your-ref.supabase.co
SUPABASE_SERVICE_ROLE_KEY=your-service-role-key
SUPABASE_KEY=your-supabase-key
SUPABASE_BUCKET=your-bucket
HUGGINGFACE_TOKEN=your-huggingface-token
Licensed under the MIT License. See the LICENSE file for details.