Synthetic Data For Privacy Preservation

15 min read ~2-3 hour project ~$40 Medium

It’s common for organizations to sit on vast amounts of sensitive data that can’t be analyzed, shared, or used to train AI models because of privacy concerns and regulatory constraints. However, in certain cases, synthetic data can unlock the ability to make use of otherwise off-limits data by creating derived versions that capture the intent, signal, and other important characteristics of the original dataset while simultaneously reducing privacy concerns associated with the original dataset. For example, consider the use of healthcare or financial data such as clinical notes and financial transactions. Such records are often extremely sensitive, but can be useful for training models to help with patient care or financial planning. Such data is often off-limits for sharing, analyzing, training, or even moving off of physical servers. But with synthetic data, many of these concerns can be mitigated. Note: it’s important to consult with legal and privacy professionals before using synthetic data for privacy purposes. This example is for educational purposes only.

The Goal

We want to create a synthetic dataset that captures the relevant characteristics and signals of another, sensitive dataset. By the end, we’ll have a final, derived synthetic dataset based on the original one that retains over 95% of the signal of the original dataset and reduces PII occurrences by over 97% - as well as a multi-step pipeline written using the Sutro Python SDK to generate it easily, quickly, and inexpensively. Let’s get started!

Choosing our Original Dataset

In our previous example, we created a heterogeneous, 20,000 row synthetic dataset of product reviews. Typically product reviews would be public facing, and often would not be a major issue with regards to privacy. However, we can use this dataset to create simulated customer support questions and dialogues, which are often sensitive, internal, and private. Because this will be a synthetic dataset itself, we’re relieved of privacy concerns for this demo. This step is relatively straightforward, and we can use a similar pipeline to the one we used in the previous example.

import sutro as so
import polars as pl
from pydantic import BaseModel
from typing import List

# Login using e.g. `huggingface-cli login` to access this dataset
df = pl.read_parquet('hf://datasets/sutro/synthetic-product-reviews-20k/results.parquet')

system_prompt = """
You will be shown a product review. It contains the product name, product description, review title, review, product rating, and product reviewer's name.
Your job is to write a realistic customer support dialogue between the product reviewer and a customer support agent.
The dialogue should start with a realistic question about the product from the reviewer. 
You should return the dialogue as a list of strings.
"""

class CustomerSupportQuestion(BaseModel):
    dialogue: List[str]

results = so.infer(
    df,
    column=["Product Name: ", "product_name", " Product Description: ", "product_description", " Review Title: ", "review_title", " Review: ", "review_text", " Product Rating: ", "rating_out_of_5", " Product Reviewer: ", "review_author"],
    system_prompt=system_prompt,
    model="qwen-3-32b",
    output_schema=CustomerSupportQuestion,
    job_priority=1
)

We opted to use a 32B model in this case, a good middle ground between speed and quality to create a nuanced, realistic customer support dialogue. We iterated using prototyping (job_priority=0) jobs on a few different models before settling on this one. Despite the larger model it only cost about $8 to generate. We’re also using Sutro’s helpful column concatenation feature to assembly a single input (string) using multiple others. Once it finished, we easily grabbed the resulting data, appending it to our original product review dataset, and saved the relevant columns to create a new, synthetic dataset.

# Grab the results from the job
results = so.get_job_results('job-b1e75fa1-821e-4479-8009-5b6b22616513')
dialogues = results['dialogue'].to_list()

# Append the dialogues to the original dataset
df = df.with_columns(pl.Series(dialogues).alias('customer_support_dialogue'))
df = df.select(['product_name', 'product_description', 'customer_support_dialogue'])

# Save the new dataset
df.write_parquet('20k-customer-support-dialogues.parquet')

You can view the dataset on Hugging Face here: https://huggingface.co/datasets/sutro/synthetic-customer-support-dialogues-20k

Creating a Feature Map

Here’s where the mad science begins! The goal overall is to create a synthetic dataset that captures the relevant characteristics and signals of the original dataset. So to do that, we need to decide what those important characteristics and signals are. Once we have a good understanding of what the important, distinct features are that we want to preserve, we can extract each of them from each record in the original dataset, and use this as a “feature map” to guide the generation of the final synthetic dataset. In this case, we’re dealing with customer support dialogues. And let’s say for the sake of this example, our goal is to train a helpful, downstream assistant chatbot to help answer the questions about the products in our catalog. So, what are the important signals we want to preserve from each record? For this example, we’ll preserve:

product name
product description
issue type (open-set label)
issue severity (low, medium, high)
issue description
resolution path description
outcome (success, failure)
customer sentiment (positive, negative, neutral)
customer satisfaction (low, medium, high)

Things we’ll want to filter out or avoid:

customer name and other personally identifiable information
sensitive personal details that aren’t relevant to the issue
any other information that isn’t relevant to the issue

If we’re able to effectively extract out these features and avoid the privacy-compromising information, we’ll be retaining what’s actually useful about the original dataset. Using this feature map, we’ll be able to reconstruct new, synthetic dialogues that are just as useful for training a helpful assistant chatbot, while simultaneously avoiding privacy concerns. Let’s do that now!

import sutro as so
import polars as pl
from pydantic import BaseModel

df = pl.read_parquet('hf://datasets/sutro/synthetic-customer-support-dialogues-20k/20k-customer-support-dialogues.parquet')

# join the customer support dialogue into a single string
df = df.with_columns(
    pl.col("customer_support_dialogue").list.join("\n"),
)

system_prompt = """
You will be shown a customer support dialogue about a product ordered by a customer. It contains the product name, product description, and customer support dialogue.
Your goal is to extract out the following important features from the dialogue:
- issue type (open-set label)
- issue severity (low, medium, high)
- issue description
- resolution path description
- outcome (success, failure)
- customer sentiment (positive, negative, neutral)
- customer satisfaction (low medium, high)

You should avoid extracting any personally identifiable information or sensitive personal details that aren't relevant to the issue.
"""

class CustomerSupportFeatureExtraction(BaseModel):
    issue_type: str
    issue_severity: str
    issue_description: str
    resolution_path_description: str
    outcome: str
    customer_sentiment: str
    customer_satisfaction: str

results = so.infer(
    df,
    column=["Product Name: ", "product_name", " Product Description: ", "product_description", " Customer Support Dialogue: ", "customer_support_dialogue"],
    system_prompt=system_prompt,
    model="qwen-3-32b",
    output_schema=CustomerSupportFeatureExtraction,
    job_priority=1
)

You can easily explore the results in the Sutro Web UI:

These results look good! There is a lot of variation in issue type, issue description, and resolution path description. In a real dataset, we’d likely see more variation in customer sentiment, customer satisfaction, and outcome. But for our purposes, this should be sufficient for our demo. And despite using a 32B model, it only cost $3.11 to run this feature extraction job. Just for numbers sake - this 20,000 row job processed 10.8M input tokens, and generated 2.3M output tokens.

Let the Map Lead the Way

Now that we’ve extracted the relevant features from the original dataset, we can use it guide the generation of a new, synthetic dataset.

df = pl.read_parquet('20k-customer-support-dialogues.parquet')

results = so.get_job_results('job-63e3cd5a-0e32-4583-bf1e-b6cc72c844ad')

# horizontally concatenate df with results
df = df.with_columns(results.select('issue_type', 'issue_severity', 'issue_description', 'resolution_path_description', 'outcome', 'customer_sentiment', 'customer_satisfaction'))

system_prompt = """
You will be shown features extracted from a customer dialogue about a product. 
It contains the product name, product description, issue type, issue severity, issue description, resolution path description, outcome, customer sentiment, and customer satisfaction.
Your goal is to generate a new, multi-turn, realistic customer dialogue that captures the same features as the original dialogue between the customer and the customer support agent. 
It should start with a question about the product from the customer.
You should return the dialogue as a list of strings.
"""

class CustomerSupportDialogue(BaseModel):
    dialogue: List[str]

results = so.infer(
    df,
    column=["Product Name: ", "product_name", " Product Description: ", "product_description", " Issue Type: ", "issue_type", " Issue Severity: ", "issue_severity", " Issue Description: ", "issue_description", " Resolution Path Description: ", "resolution_path_description", " Outcome: ", "outcome", " Customer Sentiment: ", "customer_sentiment", " Customer Satisfaction: ", "customer_satisfaction"],
    system_prompt=system_prompt,
    model="qwen-3-32b",
    output_schema=CustomerSupportDialogue,
    job_priority=1
)

This produced a new, synthetic dataset of 20,000 customer support dialogues that captures the same features as the original dataset. The total cost of this job was $5.97.

Evaluating the Results

We claim to have created a dataset that’s just as useful, but without privacy concerns. Let’s see if that’s the case! To do so, we’ll use a lightweight LLM-as-a-judge method to evaluate

The extent to which the synthetic dataset captures the same features as the original dataset
The drop in PII and sensitive personal details between the two datasets

First, we’ll evaluate the extent to which the synthetic dataset captures the same features as the original dataset.

df = pl.read_parquet('20k-customer-support-dialogues.parquet')
new_dialogues = so.get_job_results('job-16f9f7f2-1247-4fed-bbb0-46742e77f2a8')
df = df.with_columns(new_dialogues['dialogue'].alias('new_customer_support_dialogue'))

system_prompt = """
You will be shown two customer support dialogues about a product.
Your goal is to evaluate the similarity between the two dialogues, with respect to the following features:
- product name
- product description
- issue type
- issue severity
- issue description
- resolution path description
- outcome
- customer sentiment
- customer satisfaction

You should return a score between 0 and 100, where 100 means the two dialogues capture the same features.
"""

class CustomerSupportDialogueEvaluation(BaseModel):
    score: int

results = so.infer(
    df,
    column=["Customer Support Dialogue 1: ", "customer_support_dialogue", " Customer Support Dialogue 2: ", "new_customer_support_dialogue"],
    system_prompt=system_prompt,
    model="qwen-3-4b-thinking",
    output_schema=CustomerSupportDialogueEvaluation,
    job_priority=1
)

This job was a bit token-heavy comparing the dialogues and reasoning about them (15.2M input tokens, and 55M output tokens). Generally speaking, LLMs are strong at comparing text, so we opted to use a 4B model for this evaluation. Despite the heavy token usage, it only cost $11.81 to run. Most proprietary models would cost significantly more - likely 5-10x more for this job - showing the cost-saving potential of using small, open-source models at scale. Once the job finishes, we can grab the results and evaluate the similarity score.

similarity_results = so.get_job_results('job-38b880e0-41bf-41aa-83b7-277668086e5f')

# extract out score from content field
df = df.with_columns(similarity_results['content'].struct.field('score').alias('similarity_score'))

print(df['similarity_score'].describe())

statistic	value
count	20000.0
null_count	0.0
mean	95.93085
std	3.517469
min	20.0
25%	95.0
50%	95.0
75%	98.0
max	100.0

If our LLM-as-a-judge is to be trusted, this is a good result! We’re averaging a 95.93% similarity score between the two dialogues. Next we’ll evaluate for PII and sensitive personal details.

df = pl.read_parquet('20k-customer-support-dialogues.parquet')

new_dialogues = so.get_job_results('job-16f9f7f2-1247-4fed-bbb0-46742e77f2a8')
df = df.with_columns(new_dialogues['dialogue'].alias('new_customer_support_dialogue'))

df = df.with_columns(
    pl.col("customer_support_dialogue").list.join("\n")
    .alias('customer_support_dialogue_prompt')
)

df = df.with_columns(
    pl.col("new_customer_support_dialogue").list.join("\n")
    .alias('new_customer_support_dialogue_prompt')
)

system_prompt = """
You will be shown a customer support dialogue about a product.
Your goal is to review the dialogue for any personally identifiable information (PII) or sensitive personal details.
Return the number of PII or sensitive personal details in the dialogue.
"""

class CustomerSupportDialoguePIIReview(BaseModel):
    pii_count: int

results = so.infer(
    df,
    column="customer_support_dialogue_prompt",
    system_prompt=system_prompt,
    model="qwen-3-14b-thinking",
    output_schema=CustomerSupportDialoguePIIReview,
    job_priority=1
)

results = so.infer(
    df,
    column="new_customer_support_dialogue_prompt",
    system_prompt=system_prompt,
    model="qwen-3-14b-thinking",
    output_schema=CustomerSupportDialoguePIIReview,
    job_priority=1
)

Each of these jobs were lighter weight, and cost less than $6 for both. We’ll gather the results and evaluate the PII count.

old_pii_results = so.get_job_results('job-ebfec0d1-1674-43ef-8eeb-38a2534614f0')
df = df.with_columns(old_pii_results['content'].struct.field('pii_count').alias('old_pii_count'))

new_pii_results = so.get_job_results('job-be5291e8-930f-474b-bf3f-0ac642391cc7')
df = df.with_columns(new_pii_results['content'].struct.field('pii_count').alias('new_pii_count'))

print(df['old_pii_count'].describe())
print(df['new_pii_count'].describe())

In the original dataset, we get the following statistics:

statistic	value
count	19999.0
null_count	1.0
mean	1.131107
std	0.534079
min	0.0
25%	1.0
50%	1.0
75%	1.0
max	5.0

In the (new) synthetic dataset, we get the following statistics:

statistic	value
count	20000.0
null_count	0.0
mean	0.0314
std	0.183618
min	0.0
25%	0.0
50%	0.0
75%	0.0
max	3.0

As you can see, the average PII count drops from 1.13 to 0.03 per dialogue. The results aren’t perfect, but it’s a significant improvement.

Conclusion

We’ve demonstrated that using LLMs alone, we can create synthetic datasets that capture signal, intent, and other important characteristics of the original dataset, while simultaneously minimizing privacy concerns. This can be a powerful tool for organizations who want to unlock the value of their sensitive data without compromising privacy. Again - using such methods aren’t perfect, so it’s important to make sure such methods are appropriate for your use case. In summary:
✅ We created a synthetic dataset that captures the same features as the original dataset (95.93% similarity score).
✅ We decreased the PII count from 1.13 to 0.03 per dialogue (a 97% reduction).
✅ We did this using a handful of small Python scripts and no infrastructure setup.
✅ We did all of this in a few hours.
✅ For less than $40 worth of tokens!
The final dataset is available on HuggingFace here:

Note on Limitations

It’s worth noting that this method can reduce privacy concerns, but does not contain the same mathematical gaurantees as other methods such as differential privacy or k-anonymity. Similarly, LLM-as-a-judge methods are fast, cheap, and scalable proxies for human judgement, but are not a perfect substitute in critical applications. Make sure to use the appropriate tools for your use case.

​The Goal

​Choosing our Original Dataset

​Creating a Feature Map

​Let the Map Lead the Way

​Evaluating the Results

​Conclusion

​Note on Limitations