In this article, we will show you how to build a scalable, safe, and realistic synthetic data generation system.
To do this, we will be using a self-hosted LLM via a LangChain-based producer-consumer workflow. In order to ensure privacy compliance, we will leverage differentially private data generation.
Why you need synthetic data generation at scale
Imagine you need thousands of rows of realistic test data for a project.
Large language models (LLMs) can help, but generating high-quality data at scale takes more than a simple prompt. Here, you need a well-engineered solution.
But there is a catch: If the source dataset contains sensitive information, that information can leak into the generated synthetic data, making it possible to trace details back to real individuals.
On top of that, sending source data with sensitive information to commercial cloud LLMs like OpenAI’s ChatGPT is often a no-go due to privacy laws or internal data protection policies.
This is how to generate synthetic data
Rather than choosing between realism and privacy, we designed a solution that gives us both.
This solution involves running a self-hostedLLM and combining it with a differentially private data generation strategy. This completely removes the risk of sensitive information leaking into our synthetic datasets, or leaving our infrastructure in the first place.
To make this approach practical at scale, we pair it with a lightweight producer-consumer workflow that parallelises data generation.
The result is a system that can generate thousands of high-quality, realistic test records efficiently, securely, and fully under our control.
What is Differential Privacy?
Differential Privacy is a way to protect individual data in a dataset by adding randomness. It ensures output does not reveal if data belonging to a specific person was used.
Differential Privacy is useful when you need to leverage real data in various use cases, such as generating synthetic data or training models on it.
How does Differential Privacy work?
Differential Privacy adds controlled noise to data or results. This way, it ensures outputs look similar whether a person’s data is included or not.
Why use Differential Privacy to generate synthetic data
The Differential Privacy technique allows you to protect sensitive data, such as personally-Identifiable Data (PII). The main advantages are:
- It enables safe data sharing or model training without exposing individuals.
- It ensures compliance with privacy laws like GDPR.
Dataset selection
The dataset we choose is the Adult Census Income dataset from Kaggle.
We wanted to use data that felt real enough to matter but also simple enough to work with. The Adult Census Income dataset ticked all the boxes.
With almost 49k rows of information about people’s age, work, education, and income, it gives us a rich picture of everyday life in a neatly structured table. It is the kind of dataset that has been tried and tested by countless Kaggle users, so we could be confident we were building on solid ground.
Moreover, this dataset would give us numbers and categories but also valuable context. The mix of personal details like age and occupation allowed us to explore how privacy can be protected while still learning from the data.
Finally, the data is open for research under a friendly license, so we could use it freely.
The solution: How to build a data synthesis pipeline
Hosting our LLM
We tested our solution in a Runpod-hosted environment. We explored multiple LLM options and selected TinyLlama/TinyLlama-1.1B-Chat-v1.0. This LLM is suitable for resource-constrained environments, as it offers fast, high-quality text generation compatible with Llama 2-based projects.
While we understand a small model might give the best performance on the task, we can always switch to a different model later.
With sufficient guardrails, which we will explain below, this model ticks the boxes for our data generation needs.
The Prompt: Few-shot examples and guardrails
It was time to prompt the LLM.
We prompted it with a number of randomly selected examples from the dataset, and instructed it to strictly generate one record for each invocation.
This would make it less likely to hallucinate and deviate from the guardrails we have provided in the prompt.
It took several iterations of trial and error to craft the prompt to suit the use case. Here is what the code looked like:
Building the app with LangChain and a Producer / Consumer pattern
Structuring the application into a Producer / Consumer-patterned architecture paired well with LangChain for the multi stage pipeline we were building.
This architecture allows you to:
- Decouple stages: Separate prompt construction, LLM inference, parsing and validation, and storage. Change or redeploy each stage independently with different resources or configurations.
- Scale with elastic concurrency: Add more consumers to scale the slow step, such as LLM calls.
- Maintain backpressure and stability: Set a maximum queue size to throttle upstream producers and prevent the system from becoming overwhelmed.
Pipeline stages
Stage 1: Producer: Build prompts by combining system and user instructions with example records from the dataset. Submit each prompt to the prompt queue. This stage runs quickly.
Stage 2: Consumer–producers: Worker processes read prompts from the prompt queue and send them to the self-hosted LLM using LangChain. After inference, validate outputs with Pydantic and apply differential privacy sampling. This stage runs slowly because it depends on LLM inference. Increase throughput by adding more workers and scaling horizontally.
Stage 3: Consumer: Read processed results from the results queue. Append validated rows to a CSV file and write errors to a JSONL log file.
Pydantic output parser
Even with careful prompt design, the LLM could still produce outputs that did not meet the specification. It could hallucinate, return the wrong data type, or include extra text.
To control this, we used an output parser based on Pydantic.
Pydantic validates unpredictable LLM outputs and converts them into typed Python objects. This gives us structured, validated data rather than free text.
The process works as follows:
- Define a schema, such as the Row model in the example code below.
- Parse the LLM output into this schema using LangChain.
- Let Pydantic validate the result. If the output contains the wrong type, missing fields, or invalid values, Pydantic raises a validation error.
This approach makes the pipeline robust. It either returns a valid object or fails with a clear validation error.
Differentially Private data generation
As explained earlier in the blogpost, the idea behind Differential Privacy is to add randomness to data (aka noise) so that individual records cannot be identified directly. It is like adding static to a photo so you can still see the crowd but cannot recognise individuals.
Let’s dive deep into the code:
Step 1: Add noise to probabilities
This function takes a list of probabilities and adds Laplace noise (controlled by epsilon). Then cleans it up so the numbers stay non-negative and sum to 1.
The output is a slightly “shaken up” probability distribution. John D. Cook explains this well on his blog.
Step 2: Default options
This defines the allowed categorical values for each column. It is a dictionary that lists all permitted categories per field.
Step 3: Randomising the row
Th function below brings the process together. It takes a single row and, for each categorical field such as workclass, education, occupation, gender and income, it:
- Creates a uniform distribution across the allowed categories;
- Adds noise;
- Samples one category at random.
It returns a noisy version of the row.
This approach prevents anyone from tracing a row back to the original record. At the same time, the dataset remains realistic and suitable for aggregate analysis.
Running the pipeline
When you execute the pipeline, it runs a full producer–consumer data generation workflow. It samples from a seed dataset and distributes LLM-based generation across multiple worker processes. The system streams generated records to a CSV file and writes validation errors to a JSONL log file.
Key scaling parameters:
- n_batches: Set the total number of synthetic records to generate.
- num_workers: Set the number of parallel worker processes used for generation. Increase this gradually and monitor memory usage.
Execution and results
| GPU | A100 – 80 GB |
|---|---|
| Worker nodes | 6 |
| Number of synthetic records requested | 1000 |
| Number of synthetic records successfully after Pydentic validation | ~ 850 |
| Time taken | ~ 12 mins |
| Other observations | High GPU utilisation(100%) and low memory utilisation. Tiny 1 B parameter model is quite memory efficient and provided a good trade-off between accuracy and efficiently |
Keep in mind
We built the proof of concept in a Jupyter notebook. This works well for experimentation, but it does not suit large-scale or production use.
To scale the approach, we would need a stronger orchestration layer. Tools such as Apache Airflow, Prefect, or Kubernetes can manage workflows, scheduling, and fault tolerance in a production setting.
Model size is another limitation. We used a small 1.1 billion parameter model, TinyLlama. It was sufficient to demonstrate the concept, but it may not capture the detail required for complex, domain-specific tasks.
We also used the Adult Census Income dataset, which is clean and well structured. Real-world datasets often contain missing values, anomalies, and inconsistent schemas. To handle this, we would need to add preprocessing, anomaly detection, and schema validation before sending records to the model.
Final word on synthetic data
In this article, we demonstrated how to generate realistic synthetic data at scale using a self-hosted LLM. We protected privacy with a differentially private data generation approach, ensuring that sensitive source information remained secure.
Combining a few-shot prompting strategy with a producer–consumer workflow, we efficiently produced large volumes of synthetic data for tests.
The full source code is available in this GitHub repository.
If you have any questions, reach out to the team via the contact form or on LinkedIn.
