Learn how to create useful synthetic data from the relevant characteristics of another dataset while reducing privacy concerns.
15 min read ~2-3 hour project ~$40 Medium
It’s common for organizations to sit on vast amounts of sensitive data that can’t be analyzed, shared, or used to train AI models because of privacy concerns and regulatory constraints.However, in certain cases, synthetic data can unlock the ability to make use of otherwise off-limits data by creating derived versions that capture the intent, signal, and other important characteristics of the original dataset while simultaneously reducing privacy concerns associated with the original dataset.For example, consider the use of healthcare or financial data such as clinical notes and financial transactions. Such records are often extremely sensitive, but can be useful for training models to help with patient care or financial planning. Such data is often off-limits for sharing, analyzing, training, or even moving off of physical servers. But with synthetic data, many of these concerns can be mitigated.Note: it’s important to consult with legal and privacy professionals before using synthetic data for privacy purposes. This example is for educational purposes only.
We want to create a synthetic dataset that captures the relevant characteristics and signals of another, sensitive dataset.By the end, we’ll have a final, derived synthetic dataset based on the original one that retains over 95% of the signal of the original dataset and reduces PII occurrences by over 97% - as well as a multi-step pipeline written using the Sutro Python SDK to generate it easily, quickly, and inexpensively.Let’s get started!
In our previous example, we created a heterogeneous, 20,000 row synthetic dataset of product reviews. Typically product reviews would be public facing, and often would not be a major issue with regards to privacy.However, we can use this dataset to create simulated customer support questions and dialogues, which are often sensitive, internal, and private. Because this will be a synthetic dataset itself, we’re relieved of privacy concerns for this demo.This step is relatively straightforward, and we can use a similar pipeline to the one we used in the previous example.
Copy
import sutro as soimport polars as plfrom pydantic import BaseModelfrom typing import List# Login using e.g. `huggingface-cli login` to access this datasetdf = pl.read_parquet('hf://datasets/sutro/synthetic-product-reviews-20k/results.parquet')system_prompt = """You will be shown a product review. It contains the product name, product description, review title, review, product rating, and product reviewer's name.Your job is to write a realistic customer support dialogue between the product reviewer and a customer support agent.The dialogue should start with a realistic question about the product from the reviewer. You should return the dialogue as a list of strings."""class CustomerSupportQuestion(BaseModel): dialogue: List[str]results = so.infer( df, column=["Product Name: ", "product_name", " Product Description: ", "product_description", " Review Title: ", "review_title", " Review: ", "review_text", " Product Rating: ", "rating_out_of_5", " Product Reviewer: ", "review_author"], system_prompt=system_prompt, model="qwen-3-32b", output_schema=CustomerSupportQuestion, job_priority=1)
We opted to use a 32B model in this case, a good middle ground between speed and quality to create a nuanced, realistic customer support dialogue. We iterated using prototyping (job_priority=0) jobs on a few different models before settling on this one. Despite the larger model it only cost about $8 to generate.We’re also using Sutro’s helpful column concatenation feature to assembly a single input (string) using multiple others.Once it finished, we easily grabbed the resulting data, appending it to our original product review dataset, and saved the relevant columns to create a new, synthetic dataset.
Copy
# Grab the results from the jobresults = so.get_job_results('job-b1e75fa1-821e-4479-8009-5b6b22616513')dialogues = results['dialogue'].to_list()# Append the dialogues to the original datasetdf = df.with_columns(pl.Series(dialogues).alias('customer_support_dialogue'))df = df.select(['product_name', 'product_description', 'customer_support_dialogue'])# Save the new datasetdf.write_parquet('20k-customer-support-dialogues.parquet')
Here’s where the mad science begins! The goal overall is to create a synthetic dataset that captures the relevant characteristics and signals of the original dataset. So to do that, we need to decide what those important characteristics and signals are.Once we have a good understanding of what the important, distinct features are that we want to preserve, we can extract each of them from each record in the original dataset, and use this as a “feature map” to guide the generation of the final synthetic dataset.In this case, we’re dealing with customer support dialogues. And let’s say for the sake of this example, our goal is to train a helpful, downstream assistant chatbot to help answer the questions about the products in our catalog. So, what are the important signals we want to preserve from each record?For this example, we’ll preserve:
product name
product description
issue type (open-set label)
issue severity (low, medium, high)
issue description
resolution path description
outcome (success, failure)
customer sentiment (positive, negative, neutral)
customer satisfaction (low, medium, high)
Things we’ll want to filter out or avoid:
customer name and other personally identifiable information
sensitive personal details that aren’t relevant to the issue
any other information that isn’t relevant to the issue
If we’re able to effectively extract out these features and avoid the privacy-compromising information, we’ll be retaining what’s actually useful about the original dataset. Using this feature map, we’ll be able to reconstruct new, synthetic dialogues that are just as useful for training a helpful assistant chatbot, while simultaneously avoiding privacy concerns.Let’s do that now!
Copy
import sutro as soimport polars as plfrom pydantic import BaseModeldf = pl.read_parquet('hf://datasets/sutro/synthetic-customer-support-dialogues-20k/20k-customer-support-dialogues.parquet')# join the customer support dialogue into a single stringdf = df.with_columns( pl.col("customer_support_dialogue").list.join("\n"),)system_prompt = """You will be shown a customer support dialogue about a product ordered by a customer. It contains the product name, product description, and customer support dialogue.Your goal is to extract out the following important features from the dialogue:- issue type (open-set label)- issue severity (low, medium, high)- issue description- resolution path description- outcome (success, failure)- customer sentiment (positive, negative, neutral)- customer satisfaction (low medium, high)You should avoid extracting any personally identifiable information or sensitive personal details that aren't relevant to the issue."""class CustomerSupportFeatureExtraction(BaseModel): issue_type: str issue_severity: str issue_description: str resolution_path_description: str outcome: str customer_sentiment: str customer_satisfaction: strresults = so.infer( df, column=["Product Name: ", "product_name", " Product Description: ", "product_description", " Customer Support Dialogue: ", "customer_support_dialogue"], system_prompt=system_prompt, model="qwen-3-32b", output_schema=CustomerSupportFeatureExtraction, job_priority=1)
You can easily explore the results in the Sutro Web UI:These results look good! There is a lot of variation in issue type, issue description, and resolution path description. In a real dataset, we’d likely see more variation in customer sentiment, customer satisfaction, and outcome. But for our purposes, this should be sufficient for our demo.And despite using a 32B model, it only cost $3.11 to run this feature extraction job. Just for numbers sake - this 20,000 row job processed 10.8M input tokens, and generated 2.3M output tokens.
Now that we’ve extracted the relevant features from the original dataset, we can use it guide the generation of a new, synthetic dataset.
Copy
df = pl.read_parquet('20k-customer-support-dialogues.parquet')results = so.get_job_results('job-63e3cd5a-0e32-4583-bf1e-b6cc72c844ad')# horizontally concatenate df with resultsdf = df.with_columns(results.select('issue_type', 'issue_severity', 'issue_description', 'resolution_path_description', 'outcome', 'customer_sentiment', 'customer_satisfaction'))system_prompt = """You will be shown features extracted from a customer dialogue about a product. It contains the product name, product description, issue type, issue severity, issue description, resolution path description, outcome, customer sentiment, and customer satisfaction.Your goal is to generate a new, multi-turn, realistic customer dialogue that captures the same features as the original dialogue between the customer and the customer support agent. It should start with a question about the product from the customer.You should return the dialogue as a list of strings."""class CustomerSupportDialogue(BaseModel): dialogue: List[str]results = so.infer( df, column=["Product Name: ", "product_name", " Product Description: ", "product_description", " Issue Type: ", "issue_type", " Issue Severity: ", "issue_severity", " Issue Description: ", "issue_description", " Resolution Path Description: ", "resolution_path_description", " Outcome: ", "outcome", " Customer Sentiment: ", "customer_sentiment", " Customer Satisfaction: ", "customer_satisfaction"], system_prompt=system_prompt, model="qwen-3-32b", output_schema=CustomerSupportDialogue, job_priority=1)
This produced a new, synthetic dataset of 20,000 customer support dialogues that captures the same features as the original dataset. The total cost of this job was $5.97.
We claim to have created a dataset that’s just as useful, but without privacy concerns. Let’s see if that’s the case!To do so, we’ll use a lightweight LLM-as-a-judge method to evaluate
The extent to which the synthetic dataset captures the same features as the original dataset
The drop in PII and sensitive personal details between the two datasets
First, we’ll evaluate the extent to which the synthetic dataset captures the same features as the original dataset.
Copy
df = pl.read_parquet('20k-customer-support-dialogues.parquet')new_dialogues = so.get_job_results('job-16f9f7f2-1247-4fed-bbb0-46742e77f2a8')df = df.with_columns(new_dialogues['dialogue'].alias('new_customer_support_dialogue'))system_prompt = """You will be shown two customer support dialogues about a product.Your goal is to evaluate the similarity between the two dialogues, with respect to the following features:- product name- product description- issue type- issue severity- issue description- resolution path description- outcome- customer sentiment- customer satisfactionYou should return a score between 0 and 100, where 100 means the two dialogues capture the same features."""class CustomerSupportDialogueEvaluation(BaseModel): score: intresults = so.infer( df, column=["Customer Support Dialogue 1: ", "customer_support_dialogue", " Customer Support Dialogue 2: ", "new_customer_support_dialogue"], system_prompt=system_prompt, model="qwen-3-4b-thinking", output_schema=CustomerSupportDialogueEvaluation, job_priority=1)
This job was a bit token-heavy comparing the dialogues and reasoning about them (15.2M input tokens, and 55M output tokens). Generally speaking, LLMs are strong at comparing text, so we opted to use a 4B model for this evaluation. Despite the heavy token usage, it only cost $11.81 to run. Most proprietary models would cost significantly more - likely 5-10x more for this job - showing the cost-saving potential of using small, open-source models at scale.Once the job finishes, we can grab the results and evaluate the similarity score.
Copy
similarity_results = so.get_job_results('job-38b880e0-41bf-41aa-83b7-277668086e5f')# extract out score from content fielddf = df.with_columns(similarity_results['content'].struct.field('score').alias('similarity_score'))print(df['similarity_score'].describe())
statistic
value
count
20000.0
null_count
0.0
mean
95.93085
std
3.517469
min
20.0
25%
95.0
50%
95.0
75%
98.0
max
100.0
If our LLM-as-a-judge is to be trusted, this is a good result! We’re averaging a 95.93% similarity score between the two dialogues.Next we’ll evaluate for PII and sensitive personal details.
Copy
df = pl.read_parquet('20k-customer-support-dialogues.parquet')new_dialogues = so.get_job_results('job-16f9f7f2-1247-4fed-bbb0-46742e77f2a8')df = df.with_columns(new_dialogues['dialogue'].alias('new_customer_support_dialogue'))df = df.with_columns( pl.col("customer_support_dialogue").list.join("\n") .alias('customer_support_dialogue_prompt'))df = df.with_columns( pl.col("new_customer_support_dialogue").list.join("\n") .alias('new_customer_support_dialogue_prompt'))system_prompt = """You will be shown a customer support dialogue about a product.Your goal is to review the dialogue for any personally identifiable information (PII) or sensitive personal details.Return the number of PII or sensitive personal details in the dialogue."""class CustomerSupportDialoguePIIReview(BaseModel): pii_count: intresults = so.infer( df, column="customer_support_dialogue_prompt", system_prompt=system_prompt, model="qwen-3-14b-thinking", output_schema=CustomerSupportDialoguePIIReview, job_priority=1)results = so.infer( df, column="new_customer_support_dialogue_prompt", system_prompt=system_prompt, model="qwen-3-14b-thinking", output_schema=CustomerSupportDialoguePIIReview, job_priority=1)
Each of these jobs were lighter weight, and cost less than $6 for both.We’ll gather the results and evaluate the PII count.
We’ve demonstrated that using LLMs alone, we can create synthetic datasets that capture signal, intent, and other important characteristics of the original dataset, while simultaneously minimizing privacy concerns. This can be a powerful tool for organizations who want to unlock the value of their sensitive data without compromising privacy. Again - using such methods aren’t perfect, so it’s important to make sure such methods are appropriate for your use case.In summary:
✅ We created a synthetic dataset that captures the same features as the original dataset (95.93% similarity score).
✅ We decreased the PII count from 1.13 to 0.03 per dialogue (a 97% reduction).
✅ We did this using a handful of small Python scripts and no infrastructure setup.
✅ We did all of this in a few hours.
✅ For less than $40 worth of tokens! The final dataset is available on HuggingFace here:
It’s worth noting that this method can reduce privacy concerns, but does not contain the same mathematical gaurantees as other methods such as differential privacy or k-anonymity. Similarly, LLM-as-a-judge methods are fast, cheap, and scalable proxies for human judgement, but are not a perfect substitute in critical applications.Make sure to use the appropriate tools for your use case.