Data Extraction Methods for www.mosdac.gov.
in
Overview of Website Content
---------------------------
MOSDAC (Meteorological and Oceanographic Satellite Data Archival Centre)
includes:
- Satellite imagery and data products (static + dynamic)
- Interactive maps and charts
- FAQs and documentation
- Searchable data archives
- Downloadable files (PDF, NetCDF, GeoTIFF, etc.)
Data Extraction Methods
------------------------
| Type | Description | Tools |
|------------------|---------------------------------------------|-----------------------------|
| Static Scraping | HTML pages, FAQs, documents | BeautifulSoup,
Scrapy |
| Dynamic Scraping | JavaScript-rendered content (maps, charts) | Selenium,
Playwright |
| API Access | Hidden API endpoints | Browser DevTools,
Requests |
| File Downloads | PDFs, NetCDF, GeoTIFFs | wget, curl, requests
| Geospatial Data | Map layers, GeoTIFF | GDAL, rasterio |
Procedure
---------
1. Inspect Website Structure:
- Use browser DevTools to check HTML, API, and JS.
2. Static Data Extraction:
Example using BeautifulSoup:
```
import requests
from bs4 import BeautifulSoup
url = 'https://www.mosdac.gov.in/site/content/faq'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
questions = soup.select('.faq-question-class')
for q in questions:
print(q.get_text())
```
3. Dynamic Content Extraction:
Example using Selenium:
```
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument('--headless')
driver = webdriver.Chrome(options=options)
driver.get('https://www.mosdac.gov.in/live')
data = driver.page_source
```
4. API Access via Network Interception:
```
import requests
url = 'https://www.mosdac.gov.in/api/data?type=...'
headers = {'User-Agent': 'Mozilla/5.0'}
params = {'date': '2025-07-08', 'product': 'temp'}
res = requests.get(url, headers=headers, params=params)
print(res.json())
```
5. Automating File Downloads:
```
import requests
file_url = 'https://www.mosdac.gov.in/file_download/sample_data.tif'
r = requests.get(file_url)
with open('data.tif', 'wb') as f:
f.write(r.content)
```
Tool Summary
-------------
| Tool | Use Case |
|------------------|--------------------------------|
| BeautifulSoup | HTML parsing |
| Scrapy | Large-scale scraping |
| Selenium | JavaScript/dynamic data |
| Requests | APIs and file downloads |
| Postman | API testing |
| GDAL/rasterio | Geospatial data processing |
Important Considerations
-------------------------
- Check robots.txt at: https://www.mosdac.gov.in/robots.txt
- Respect site policies and licenses
- Use delay/timers in automated requests
- Prefer official APIs if available