Cleaning and Validating Data
Why Messy Data Breaks Everything
In Lesson 4.1, you learned that data quality determines output quality. Now let's get specific about what goes wrong when data is messy and how to prevent it.
Imagine a spending tracker with these entries:
- $25.00, 25, Twenty-five, 25.0, $25 — five different ways to write the same amount
- food, Food, FOOD, Food & Drink, Eating Out — five different versions of the same category
- 9/15/24, 2024-09-15, Sept 15, 15 Sep 2024 — four date formats
Every one of these inconsistencies breaks something. The total spending calculation fails because it can't add "Twenty-five" to 25. The category filter misses entries because "food" and "Food" are different strings. The date sort puts September before January because it's sorting text, not dates.
The problem isn't the data itself — it's the inconsistency. When every entry follows the same format, everything works. When they don't, everything breaks.
How This Applies to Your Project — Track 1
What happens if a user adds a habit called " " (just spaces)? Or types a 500-character habit name? Or marks something complete and then refreshes — does it save? Validation means your app handles these edge cases instead of breaking. You'll add input validation in Building Session 2.
How This Applies to Your Project — Track 2
Your content data needs to be consistent. If some items have categories and others don't, your filters will break. If descriptions vary wildly in length, your layout will look uneven. Cleaning your data means making sure every item has all required fields filled in consistently before you display it.
How This Applies to Your Project — Track 3
Dirty data = misleading charts. If a user enters "five" instead of "5" in an amount field, your chart calculations break. If dates are formatted inconsistently, your time-series charts won't render correctly. Validation prevents bad data from entering; cleaning fixes data that's already there.
Answer: Some amounts might include dollar signs ("$25") or be stored as text instead of numbers. The calculation can't add "$25" (text) to 25 (number), so it either errors out or shows zero. Consistent numeric formatting fixes this.
Prevention: Designing for Clean Data
The best time to fix data quality is before bad data enters your system. Here are five strategies:
1. Use dropdowns instead of text fields.
If categories should be "Food," "Transport," or "Entertainment," don't let users type freely. Give them a dropdown with exactly those options. Zero misspellings, zero inconsistencies.
2. Enforce consistent formats.
Use date pickers instead of text fields for dates. Use number inputs for amounts (no dollar signs in the data — add those in the display). Format consistency should be built into the interface, not left to the user.
3. Set required fields.
If every expense needs an amount and a date, make those fields required. Don't allow saving an entry with missing critical data.
4. Set sensible defaults.
If 90% of entries will use today's date, pre-fill it. If most habits are daily, default the frequency to "daily." Defaults reduce errors and speed up data entry.
5. Limit input ranges.
A spending amount should be a positive number. A date shouldn't be in the future (for a log) or the distant past. Setting limits catches obvious mistakes.
AI can help you design these validations. Tell AI about your data fields and ask: "What validation rules should each field have to prevent bad data?" It'll suggest specific rules for each field.
Cleanup: Fixing Messy Data
Sometimes you'll inherit messy data or realize your existing data has problems. Here's how to clean it:
Standardize formats: Pick one format for each field type and convert everything to it. All dates become YYYY-MM-DD. All amounts become plain numbers without symbols. All categories use Title Case.
Remove duplicates: Look for entries that represent the same thing entered twice. In spreadsheets, sort by key fields to spot them.
Fill in gaps: Identify entries with missing data. Can you fill in the blanks from other information? If not, decide: is the entry still useful, or should it be removed?
Fix typos and inconsistencies: Search for variations ("Food" vs "food" vs "FOOD") and standardize. In spreadsheets, use find-and-replace.
AI is excellent at data cleanup. You can paste messy data into an AI tool and ask: "Clean this data. Standardize date formats to YYYY-MM-DD, make all categories Title Case, and flag any entries that look like duplicates." AI handles the tedious work while you review the results.
Answer: Prevention. Cleaning data is possible but time-consuming and risky (you might miss things). Designing your input to enforce consistency from the start is faster, more reliable, and means you never have to clean. Prevention is always the better investment.
Validation: Making Sure Data Is What You Expect
Validation is the checkpoint between the user and your data storage. Before any entry gets saved, validation checks: is this data correct and properly formatted?
Types of validation your project might use:
- Type check: Is this supposed to be a number? A date? Text? Reject if it's the wrong type.
- Range check: Is this number between 0 and 10,000? Is this date within the last year? Reject if out of range.
- Required check: Are all mandatory fields filled in? Block saving if not.
- Format check: Does this email have an @ symbol? Does this date match YYYY-MM-DD? Reject bad formats.
- Uniqueness check: Has this exact entry already been added? Flag potential duplicates.
When you're vibe coding, tell AI what to check: "Add validation to this form that checks [specific rules] before saving." Describe what valid and invalid input looks like, and AI will implement the checks for you.
Key Concepts
- Inconsistent data (mixed formats, typos, duplicates) breaks calculations, searches, and displays.
- Prevention beats cleanup: use dropdowns, date pickers, required fields, defaults, and range limits to keep data clean from the start.
- When cleanup is needed: standardize formats, remove duplicates, fill gaps, fix typos. AI is excellent at this.
- Validation checks data before it's saved: type, range, required, format, and uniqueness checks.
- Design your data entry to make it hard to enter bad data. Your future self will thank you.
Try It: Clean a Messy Dataset
Practice cleanup with a real (simulated) mess.
- Download the sample messy dataset (CSV). It's a mock spending tracker with intentional data quality problems.
- Open it in Google Sheets or any spreadsheet program.
- Identify at least 5 data quality issues (inconsistent formats, duplicates, missing data, typos).
- Use AI to help clean it: paste a sample of the messy data and ask AI to identify problems and suggest fixes.
- Apply the fixes and compare your clean version to the original.
This exercise builds the instinct to spot data problems. You'll use this skill every time you work with data in your project.
Check Your Understanding
1. A spreadsheet has dates in three formats: "9/15/24," "2024-09-15," and "Sept 15." What's the best fix?
Explanation: One consistent format solves every downstream problem. YYYY-MM-DD is ideal because it sorts correctly as text. Going forward, use a date picker to prevent mixed formats from entering your data.
2. Which data entry method prevents the most quality issues?
Explanation: Dropdowns eliminate typos, inconsistent capitalization, and unexpected values entirely. Users can only pick from valid options. This is the single most effective data quality prevention technique.
3. How can AI help with data cleaning?
Explanation: AI excels at pattern-based cleanup: standardizing formats, spotting inconsistencies, and suggesting fixes. You paste the data, describe what you need, and review AI's suggestions. It handles the tedious work.
Reflect & Write
Write 2–3 sentences: Think about your project's data. What's the most likely data quality problem you'll face? What prevention strategy from this lesson would address it?
Project Checkpoint
Design validation rules for your project:
- For each field in your data schema, define: data type, whether it's required, acceptable range/format, and input method (text field, dropdown, date picker, etc.).
- Identify which fields need dropdowns to prevent inconsistency.
- Ask AI: "Here's my data schema [paste]. What validation rules should each field have?"
- Document your validation plan — you'll implement these when you build.
Level Up: Coming Next
Lesson 4.4 — Privacy and Consent. The most important data lesson: what you can and can't collect, why consent matters, and how to build privacy-first.
Continue to Lesson 4.4 →