⏱️ Read Time | 🛠️ Implementation + Runtime | 💵 Cost | 🎓 Experience |
---|---|---|---|
10–20 min | ~2–3 hours | $40 | Intermediate |
The Goal
We want to create a synthetic dataset that captures the relevant characteristics and signals of another, sensitive dataset. By the end, we’ll have a final, derived synthetic dataset based on the original one that retains over 95% of the signal of the original dataset and reduces PII occurrences by over 97% - as well as a multi-step pipeline written using the Sutro Python SDK to generate it easily, quickly, and inexpensively. Let’s get started!Choosing our Original Dataset
In our previous example, we created a heterogeneous, 20,000 row synthetic dataset of product reviews. Typically product reviews would be public facing, and often would not be a major issue with regards to privacy. However, we can use this dataset to create simulated customer support questions and dialogues, which are often sensitive, internal, and private. Because this will be a synthetic dataset itself, we’re relieved of privacy concerns for this demo. This step is relatively straightforward, and we can use a similar pipeline to the one we used in the previous example.job_priority=0
) jobs on a few different models before settling on this one. Despite the larger model it only cost about $8 to generate.
Once it finished, we easily grabbed the resulting data, appending it to our original product review dataset, and saved the relevant columns to create a new, synthetic dataset.
Creating a Feature Map
Here’s where the mad science begins! The goal overall is to create a synthetic dataset that captures the relevant characteristics and signals of the original dataset. So to do that, we need to decide what those important characteristics and signals are. Once we have a good understanding of what the important, distinct features are that we want to preserve, we can extract each of them from each record in the original dataset, and use this as a “feature map” to guide the generation of the final synthetic dataset. In this case, we’re dealing with customer support dialogues. And let’s say for the sake of this example, our goal is to train a helpful, downstream assistant chatbot to help answer the questions about the products in our catalog. So, what are the important signals we want to preserve from each record? For this example, we’ll preserve:- product name
- product description
- issue type (open-set label)
- issue severity (low, medium, high)
- issue description
- resolution path description
- outcome (success, failure)
- customer sentiment (positive, negative, neutral)
- customer satisfaction (low, medium, high)
- customer name and other personally identifiable information
- sensitive personal details that aren’t relevant to the issue
- any other information that isn’t relevant to the issue

Let the Map Lead the Way
Now that we’ve extracted the relevant features from the original dataset, we can use it guide the generation of a new, synthetic dataset.Evaluating the Results
We claim to have created a dataset that’s just as useful, but without privacy concerns. Let’s see if that’s the case! To do so, we’ll use a lightweight LLM-as-a-judge method to evaluate- The extent to which the synthetic dataset captures the same features as the original dataset
- The drop in PII and sensitive personal details between the two datasets
statistic | value |
---|---|
count | 20000.0 |
null_count | 0.0 |
mean | 95.93085 |
std | 3.517469 |
min | 20.0 |
25% | 95.0 |
50% | 95.0 |
75% | 98.0 |
max | 100.0 |
statistic | value |
---|---|
count | 19999.0 |
null_count | 1.0 |
mean | 1.131107 |
std | 0.534079 |
min | 0.0 |
25% | 1.0 |
50% | 1.0 |
75% | 1.0 |
max | 5.0 |
statistic | value |
---|---|
count | 20000.0 |
null_count | 0.0 |
mean | 0.0314 |
std | 0.183618 |
min | 0.0 |
25% | 0.0 |
50% | 0.0 |
75% | 0.0 |
max | 3.0 |
Conclusion
We’ve demonstrated that using LLMs alone, we can create synthetic datasets that capture signal, intent, and other important characteristics of the original dataset, while simultaneously minimizing privacy concerns. This can be a powerful tool for organizations who want to unlock the value of their sensitive data without compromising privacy. Again - using such methods aren’t perfect, so it’s important to make sure such methods are appropriate for your use case. In summary:✅ We created a synthetic dataset that captures the same features as the original dataset (95.93% similarity score).
✅ We decreased the PII count from 1.13 to 0.03 per dialogue (a 97% reduction).
✅ We did this using a handful of small Python scripts and no infrastructure setup.
✅ We did all of this in a few hours.
✅ For less than $40 worth of tokens!
The final dataset is available on HuggingFace here: