Back to blog
use-casedata-processing

Cleaning messy data

3 min read

Every data export looks clean until you actually try to use it. Duplicate rows, inconsistent formatting, missing fields, categories that don't match. Fixing this manually is tedious. Writing a script is overkill for a one-off.

An agent handles it in one run.

The approach

Create a single agent with file I/O. Upload the messy CSV, describe exactly what "clean" means, download the result.

The key is specificity. "Clean this data" is vague. "Deduplicate by email, standardize phone numbers, categorize by industry" is actionable.

Example: cleaning a lead list

You have a CSV export from a trade show scanner. 500 rows. Inconsistent company names ("Acme Inc", "ACME", "Acme, Inc."), phone numbers in four different formats, some rows missing emails entirely.

Create an agent. Go to Create and describe the cleaning job, or create one manually at Agents with no MCP servers (this is pure data work, no web access needed). System prompt:

You are a data cleaning specialist. Read CSV files from input, apply the requested transformations, and write cleaned results to output. Preserve all original columns unless told otherwise.

Task:

Read /home/daytona/input/tradeshow_leads.csv. Apply these transformations:

  1. Deduplicate rows where the email address matches (keep the row with more filled columns)
  2. Normalize company names: trim whitespace, title case, remove trailing "Inc", "LLC", "Ltd" variations
  3. Standardize phone numbers to E.164 format (+1XXXXXXXXXX for US)
  4. Remove rows with no email AND no phone number
  5. Add a quality column: "high" if email + phone + company all present, "medium" if two of three, "low" if one
  6. Sort by quality descending, then company name ascending

Write result to /home/daytona/output/leads_cleaned.csv. Write a summary to /home/daytona/output/cleaning_report.txt with: rows before, rows after, duplicates removed, rows removed for missing data.

Upload the file at Files, then attach it in agent settings under Advanced Settings > Files.

Run it. Takes 30-60 seconds. Download both output files from the run artifacts.

Tips for data cleaning tasks

Be explicit about edge cases. "If the phone number has no country code, assume US (+1)." "If company name contains parentheses, remove the parenthetical."

Ask for a cleaning report. The summary file tells you what changed and why. Useful for verifying the agent did what you expected.

Start with a sample. If you have 10,000 rows, test with the first 100. Check the output, adjust the task description, then run the full dataset.

Use playbooks for recurring patterns. If you clean the same type of data regularly (CRM exports, form submissions, survey results), create a playbook with your standard cleaning rules. Attach it to the agent and every run follows the same methodology.

Add mcp-sqlite-tools for heavier data work. If you need to query, join, or aggregate across tables, add the mcp-sqlite-tools MCP server to your agent. A couple of clicks in agent settings. The agent can load CSVs into SQLite, run queries, and export results. For simple cleaning like the example above, file I/O alone is faster — but for anything that feels like it needs a database, give it one.

Try it

Download the sample tradeshow leads CSV and run it through the example above. It has all the problems described — duplicate emails, inconsistent company names, mixed phone formats, missing fields.

Beyond CSV

The same approach works for JSON, XML, plain text logs, or any structured data the agent can parse. Upload the file, describe the transformation, download the result.

Back to all posts