Session 3: Financial Data Cleaning & Validation - Making Your Data Analysis-Ready#
Learning Objectives#
By the end of this session, you will be able to:
Identify and fix common data quality issues in financial datasets that cause analysis errors
Clean messy stock data using simple, systematic Python techniques
Handle missing values and outliers without complex programming
Validate your cleaned data to ensure it’s ready for analysis
Document your cleaning process for transparency and reproducibility
Present your data quality improvements clearly in a professional video
Section 1: The Financial Hook#
The $440 Million Data Error#
In August 2012, Knight Capital Group lost $440 million in 45 minutes. The cause? Bad data in their trading system that wasn’t properly cleaned and validated.
Here’s a simpler example you might face:
Your Analysis: “Tesla stock crashed 67% on August 25, 2022!” Reality: Tesla had a 3-for-1 stock split. The price went from $900 to $300, but shareholders now had 3x more shares.
Without proper data cleaning, you’d panic sell. With cleaned data, you’d see no actual loss occurred.
Common Data Problems That Cost Money#
FINANCIAL DATA NIGHTMARES:
├── Stock splits (fake crashes)
├── Missing dividend data (undervalued returns)
├── Weekend gaps (analysis errors)
├── Wrong decimal places ($0.01 vs $1.00)
└── Outliers from data errors (not real trades)
Why This Matters for Your Career#
Entry-Level Analyst: Clean data manually, one stock at a time Senior Analyst: Build automated cleaning systems for portfolios Portfolio Manager: Trust clean data for million-dollar decisions Your Goal: Learn systematic cleaning that scales with your career
Real-World Timeline#
Day 1: Download messy data
Day 2: Try analysis, get nonsense results
Day 3: Learn data cleaning (today!)
Day 4: Produce accurate analysis
Week 2: Automate for multiple stocks
Month 2: Handle complex corporate actions
Section 2: Foundational Financial Concepts & Models#
What is Data Cleaning?#
Data Cleaning: The process of detecting and correcting errors in financial data to ensure accurate analysis.
Think of it like this:
Raw Data = Ingredients from the store (might have bad items)
Data Cleaning = Washing and preparing ingredients
Clean Data = Ready to cook with
Types of Data Quality Issues#
1. Missing Values#
Date AAPL Price
2024-01-01 $192.35
2024-01-02 $191.80
2024-01-03 NaN ← Missing! Market was closed?
2024-01-04 $190.50
2. Outliers and Errors#
Date MSFT Price
2024-02-01 $405.20
2024-02-02 $4.05 ← Decimal error! Should be $405.00
2024-02-03 $406.10
3. Corporate Actions#
What are Corporate Actions? Corporate actions are events initiated by a company that affect its stock price or structure. Common examples include:
Stock splits: Dividing shares into more pieces (like cutting a pizza into more slices)
Dividends: Cash payments to shareholders
Mergers: Two companies combining into one
Spin-offs: One company splitting into separate companies
These events can make prices look wrong if you don’t know they happened!
Date TSLA Price
2022-08-24 $891.30
2022-08-25 $297.10 ← 3:1 split, not a crash!
2022-08-26 $299.50
Mathematical Framework#
What is an Outlier? An outlier is a data point that’s extremely different from other data points. In stock prices, this could be a real event (like a crash) or a data error (like a missing decimal point).
Outlier Detection Using Standard Deviation:
The Z-score tells us how many standard deviations away from average a value is:
Understanding Z-scores:
Z-score = 0: The value is exactly average
Z-score = 1: The value is 1 standard deviation above average
Z-score = -2: The value is 2 standard deviations below average
|Z-score| > 3: The value is so extreme it’s likely an outlier (happens less than 0.3% of the time)
Example: If Apple’s average daily return is 0.1% with standard deviation 2%, and today it shows 10% return: Z-score = (10% - 0.1%) / 2% = 4.95 → Likely an outlier!
Data Completeness Score: $\(Completeness = \frac{Non\text{-}Missing\ Values}{Total\ Expected\ Values} \times 100\%\)$
Return Calculation Check: $\(Daily\ Return = \frac{Today's\ Price - Yesterday's\ Price}{Yesterday's\ Price}\)$
If daily return > 50% or < -50%, investigate for splits!
🎯 AI Learning Support - Understanding Data Quality#
Learning Goal: Develop the ability to identify data quality issues systematically.
Starting Prompt: “I have stock price data with some weird values. How do I find problems?”
🚀 Hints to Improve Your Prompt:
Add specific context (what stock? what time period?)
Describe the weird values you’re seeing
Mention what analysis you’re trying to do
Ask for a systematic approach, not just a quick fix
💡 Better Version Hints:
Include actual numbers from your data
Specify if you’re looking for splits, errors, or missing data
Ask for both detection AND correction methods
Request validation steps to confirm fixes worked
🎯 Your Challenge: Rewrite the prompt to get comprehensive help for YOUR specific data issues.
Section 3: The Financial Gym - Partner Practice & AI Copilot Learning#
Solo Warm-Up (15 minutes)#
Exercise 1: Spot the Data Problem
# Apple stock data - can you spot the issues?
# Each line has a comment explaining what we're doing
# Step 1: Create sample data with problems
dates = ['2024-01-02', '2024-01-03', '2024-01-04', '2024-01-05']
prices = [192.35, 0.0, 191.80, 1915.50] # Problems hidden here!
# Step 2: Look at each price
print("Apple Daily Prices:")
for i in range(len(dates)):
print(f"{dates[i]}: ${prices[i]}")
# What problems do you see?
# Problem 1: ________________
# Problem 2: ________________
AI Copilot Learning Phase (20 minutes)#
🎯 AI Learning Support - Data Cleaning Strategy
Starting Prompt: “I have stock prices with zeros and weird values. What should I do?”
🚀 Make It Better By Adding:
The actual data you’re working with
What “weird” means (too high? too low? missing?)
What you’ve already tried
Whether you need to preserve all data or can remove some
💡 Level-Up Hints:
Ask for step-by-step cleaning process
Request validation methods
Include downstream analysis needs
Ask about documentation requirements
Reciprocal Teaching Component (25 minutes)#
Partner A Teaching Points (10 min):
Explain what makes data “dirty” vs “clean”
Show how to spot a stock split in price data
Demonstrate filling missing values
Partner B Teaching Points (10 min):
Explain why outliers matter in financial data
Show how to calculate Z-scores simply
Demonstrate validating cleaned data
Together (5 min):
Find a real stock with a recent split and download its data
Identify the split date together
Collaborative Challenge (20 minutes)#
Clean Tesla’s Split Data Together
# SIMPLE DATA CLEANING CHALLENGE
# Work together to fix this data
# Tesla prices around the August 2022 split
dates = ['2022-08-24', '2022-08-25', '2022-08-26']
prices = [891.30, 297.10, 299.50]
# Step 1: Calculate daily returns
print("Daily returns:")
for i in range(1, len(prices)):
daily_return = (prices[i] - prices[i-1]) / prices[i-1]
print(f"{dates[i]}: {daily_return * 100:.1f}%")
# Step 2: Identify the problem
# Your code here: Detect if any return is too extreme
# Step 3: Fix the split issue
# Your code here: Adjust pre-split prices
💡 AI Learning Support - Troubleshooting#
Starting Prompt: “My code shows Tesla dropped 67%. How do I fix this?”
🚀 Improvement Hints:
Specify the dates you’re analyzing
Mention you suspect a stock split
Ask for the split ratio and how to adjust
Request verification method
🎯 Your Challenge: Create a prompt that would help someone who has never seen a stock split before.
Section 4: The Financial Coaching - Your DRIVER Learning Guide#
Complete DRIVER Case Study: Cleaning a Dividend Portfolio#
Scenario: Your manager asks: “Clean our dividend portfolio data for quarterly analysis. We need accurate prices and dividends for JNJ, KO, and PEP. Make sure the data is reliable.”
D - Define & Discover#
Understanding the Problem
Data quality issues that affect dividend analysis:
Missing dividend payments
Stock splits affecting price continuity
Weekend/holiday gaps
Data vendor errors
💻 AI Learning Support - Problem Discovery
Starting Prompt: “I need to clean data for dividend stocks. What should I check?”
🚀 Enhancement Hints:
Specify which stocks and time period
Mention you need both prices AND dividends
Ask about dividend-specific quality issues
Request priority order for cleaning tasks
Design Requirements:
Handle 3 dividend stocks
Clean 1 year of daily data
Validate prices and dividends
Document all changes
R - Represent#
Data Cleaning Workflow
1. Load raw data
↓
2. Check for missing values
↓
3. Identify outliers/errors
↓
4. Fix identified issues
↓
5. Validate the fixes
↓
6. Document changes
🚀 AI Learning Support - Workflow Design
Starting Prompt: “Help me plan data cleaning steps.”
💡 Better Prompt Hints:
Specify your data types (prices, dividends, volumes)
Ask for decision points (when to remove vs. fix data)
Request validation methods for each step
Include regulatory/audit requirements
I - Implement#
Simple Data Cleaning Implementation
# DIVIDEND PORTFOLIO DATA CLEANING
# Everything step-by-step, no complex code
# Step 1: Setup our portfolio
stocks = ['JNJ', 'KO', 'PEP']
stock_names = ['Johnson & Johnson', 'Coca-Cola', 'PepsiCo']
# Step 2: Import what we need
import yfinance as yf
import pandas as pd
# Step 3: Download and clean each stock
for i in range(len(stocks)):
ticker = stocks[i]
name = stock_names[i]
print(f"\n=== Cleaning {name} ({ticker}) ===")
# Download 1 year of data
stock = yf.Ticker(ticker)
data = stock.history(period='1y')
# Check 1: Missing values
missing_count = data['Close'].isna().sum()
print(f"Missing values: {missing_count}")
# Check 2: Zero or negative prices
bad_prices = (data['Close'] <= 0).sum()
print(f"Bad prices: {bad_prices}")
# Check 3: Extreme daily moves (possible splits)
data['Daily_Return'] = data['Close'].pct_change()
extreme_moves = (abs(data['Daily_Return']) > 0.20).sum()
print(f"Extreme moves (>20%): {extreme_moves}")
# Simple cleaning
# Fix 1: Fill missing prices with previous day
data['Close'] = data['Close'].fillna(method='ffill')
# Fix 2: Remove any remaining bad data
data = data[data['Close'] > 0]
# Validation
final_missing = data['Close'].isna().sum()
print(f"\nAfter cleaning:")
print(f"Missing values: {final_missing}")
print(f"Data points: {len(data)}")
💻 AI Learning Support - Implementation Help
Starting Prompt: “How do I fill missing values in stock data?”
🚀 Better Prompt Elements:
Specify forward-fill vs interpolation vs removal
Ask about finance-specific considerations
Request validation after filling
Include example code with your data structure
V - Validate#
Validation Checks
# VALIDATION: Make sure cleaning worked
# Step 1: Check data completeness
print("\n=== DATA VALIDATION ===")
# For each stock, validate the cleaning
for ticker in ['JNJ', 'KO', 'PEP']:
# Get fresh data to validate
stock = yf.Ticker(ticker)
clean_data = stock.history(period='1y')
# Remove any bad prices
clean_data = clean_data[clean_data['Close'] > 0]
# Calculate quality metrics
total_days = len(clean_data)
missing = clean_data['Close'].isna().sum()
completeness = ((total_days - missing) / total_days) * 100
# Check price reasonableness
min_price = clean_data['Close'].min()
max_price = clean_data['Close'].max()
price_ratio = max_price / min_price
print(f"\n{ticker} Validation:")
print(f" Completeness: {completeness:.1f}%")
print(f" Price range: ${min_price:.2f} to ${max_price:.2f}")
print(f" Reasonable range: {'✅ Yes' if price_ratio < 3 else '❌ No'}")
🔍 AI Learning Support - Validation Strategy
Starting Prompt: “How do I know if my cleaned data is good?”
💡 Enhancement Ideas:
Ask for industry-specific validation rules
Request statistical tests for data quality
Include before/after comparisons
Ask about documentation standards
E - Evolve#
Enhancement Opportunities
Your basic cleaning can evolve to:
Automated daily data quality reports
Machine learning anomaly detection
Multi-source data reconciliation
Real-time data quality monitoring
🎯 AI Learning Support - Next Steps
Starting Prompt: “What else should I add to my data cleaning?”
🚀 Better Elements:
Describe current capabilities first
Specify scale (how many stocks?)
Ask about industry best practices
Request prioritized enhancement list
R - Reflect#
Key Insights
Most data has quality issues - always check!
Simple cleaning solves 80% of problems
Documentation is as important as cleaning
Validation prevents embarrassing errors
📋 AI Learning Support - Learning Synthesis
Starting Prompt: “What did I learn about data cleaning?”
💡 Reflection Improvements:
List specific techniques you used
Describe problems you solved
Ask about edge cases you missed
Request career development insights
Section 5: Assignment#
Scenario#
You’re a data analyst at a hedge fund. Your portfolio manager suspects data quality issues are affecting trading signals. Investigate and clean financial data to ensure accurate analysis.
Requirements#
Create a video (approximately 10-15 minutes) demonstrating:
Systematic discovery of data quality issues in a stock with known challenges
Implementation of cleaning solutions for identified problems
Validation of data integrity before and after cleaning
Impact analysis showing how bad data affects investment decisions
Execution Format#
Use your completed Jupyter notebook or Python script
Run your code cell-by-cell while explaining what each part does
Show outputs and interpret them immediately
Demonstrate before/after comparisons of your cleaning
Deliverables#
Video demonstration showing code execution and analysis
Python code file (.py or .ipynb)
Section 6: Reflect & Connect - Financial Insights Discussion#
Part 2: Step-by-Step Implementation (40% - 5 minutes)#
2.1 Live Data Cleaning (3 minutes)
Load messy data
Identify issues systematically
Fix each issue with clear explanation
Explain WHY each fix is appropriate
2.2 Validation Process (2 minutes)
Prove your cleaning worked
Show before/after comparisons
Run quality checks
Be ready for: “What if the data had negative prices?”
Part 3: Results & Professional Impact (30% - 3 minutes)#
3.1 Present Clean Data (1.5 minutes)
Show summary statistics
Highlight what changed
Demonstrate data is analysis-ready
Calculate a simple metric with clean vs. dirty data
3.2 Business Implications (1.5 minutes)
Explain how cleaning affected results
Discuss what would happen without cleaning
Reflect on scalability
Consider regulatory requirements
Assessment Criteria#
Your grade depends on demonstrating understanding:
Can you identify data issues systematically?
Do you understand WHY each cleaning step matters?
Can you validate your cleaning worked?
Can you explain business impact clearly?
Section 6: Reflect & Connect - Financial Insights Discussion#
Individual Reflection (10 minutes)#
Write your thoughts on:
What surprised you about data quality issues?
Which cleaning technique was most valuable?
How will data cleaning impact your analysis?
What edge cases worry you most?
Small Group Discussion (15 minutes)#
In groups of 3-4, discuss:
Share cleaning challenges
What issues were hardest to spot?
Which fixes were tricky?
How did you validate success?
Compare approaches
Different ways to handle missing data
When to remove vs. fix data
Documentation strategies
Real-world applications
How would this scale to 500 stocks?
What about intraday data?
International stock challenges?
Class Synthesis (15 minutes)#
Key Discussion Topics:
The Data Quality Iceberg
Visible issues (missing data, zeros)
Hidden issues (splits, dividends, errors)
Systematic detection methods
Professional Standards
Audit trail requirements
Regulatory compliance
Client trust implications
Career Applications
Entry level: Clean data manually
Senior level: Build cleaning systems
Management: Ensure data governance
Connecting to Practice#
Industry Insight: “At State Street, we process $43 trillion in assets. A single data error can impact thousands of client portfolios. That’s why we have entire teams dedicated to data quality.”
Jennifer Chen, Head of Data Quality, State Street
Key Takeaways Board#
Create class list of:
Most common data issues found
Best cleaning techniques
Validation must-haves
Documentation essentials
Section 7: Looking Ahead#
Skills Mastered#
✅ Technical Competencies:
Systematic issue detection
Simple cleaning techniques
Validation frameworks
Documentation practices
✅ Professional Skills:
Quality-first mindset
Attention to detail
Process documentation
Risk awareness
Building Bridges to Session 4#
Next session’s Time Series Analysis builds on clean data:
From Clean to Analyzed: Today we cleaned; next we find patterns
From Static to Dynamic: Today’s snapshot becomes tomorrow’s trends
From Quality to Insights: Clean data enables reliable forecasting
From Single to Series: One day’s price becomes historical analysis
Connecting Concepts#
Session 3: Clean Data → Session 4: Time Patterns → Session 5: Risk Metrics
↓ ↓ ↓
Remove errors Find trends/cycles Calculate volatility
Fix splits Seasonal patterns Correlation analysis
Validate quality Forecast future Risk-adjusted returns
Preview Challenge#
Before next session, think about:
How do stock prices change over time?
What patterns repeat (daily, weekly, yearly)?
How do you measure if prices are trending?
Can past patterns predict future moves?
Professional Development Path#
Your data cleaning skills enable:
Immediate: Trust your analysis results
Next Month: Automate cleaning workflows
Six Months: Handle complex corporate actions
One Year: Design enterprise data quality systems
Preparation for Session 4#
Technical Prep:
Ensure you have 2+ years of clean data
Practice basic statistics (mean, std dev)
Review datetime handling in Python
Conceptual Prep:
Think about patterns in stock prices
Consider what drives price changes
Review basic statistics concepts
Professional Prep:
Research time series in finance
Understand trending vs. mean reversion
Consider forecast applications
Section 8: Appendix - Solutions & Implementation Guide#
Complete Solution Code#
# COMPLETE DATA CLEANING SOLUTION
# Simple enough for beginners, thorough enough for pros
# ===== PART 1: SETUP =====
import yfinance as yf
import pandas as pd
import numpy as np
# Portfolio to clean
stocks = ['JNJ', 'KO', 'PEP', 'PG', 'WMT']
results = []
# ===== PART 2: CLEAN EACH STOCK =====
print("Starting portfolio data cleaning...\n")
for ticker in stocks:
print(f"Cleaning {ticker}...")
# Download data
stock = yf.Ticker(ticker)
data = stock.history(period='1y')
# Count issues before cleaning
issues_before = {
'missing': data['Close'].isna().sum(),
'zeros': (data['Close'] == 0).sum(),
'negative': (data['Close'] < 0).sum()
}
# Clean the data
# Step 1: Remove obviously bad data
data = data[data['Close'] > 0]
# Step 2: Fill missing values
data['Close'] = data['Close'].fillna(method='ffill')
data['Volume'] = data['Volume'].fillna(0)
# Step 3: Check for extreme moves
data['Return'] = data['Close'].pct_change()
extreme_days = abs(data['Return']) > 0.20
extreme_count = extreme_days.sum()
# Count issues after cleaning
issues_after = {
'missing': data['Close'].isna().sum(),
'zeros': (data['Close'] == 0).sum(),
'negative': (data['Close'] < 0).sum()
}
# Calculate quality score
total_days = len(data)
clean_days = total_days - issues_after['missing']
quality_score = (clean_days / total_days) * 100
# Store results
results.append({
'ticker': ticker,
'total_days': total_days,
'issues_fixed': sum(issues_before.values()),
'extreme_moves': extreme_count,
'quality_score': quality_score
})
print(f" Fixed {sum(issues_before.values())} issues")
print(f" Quality score: {quality_score:.1f}%\n")
# ===== PART 3: SUMMARY REPORT =====
print("=== CLEANING SUMMARY ===")
print(f"{'Stock':<6} {'Days':<6} {'Fixed':<6} {'Quality':<8}")
print("-" * 30)
for result in results:
print(f"{result['ticker']:<6} {result['total_days']:<6} "
f"{result['issues_fixed']:<6} {result['quality_score']:<8.1f}%")
# Overall portfolio quality
avg_quality = sum(r['quality_score'] for r in results) / len(results)
print(f"\nPortfolio Quality: {avg_quality:.1f}%")
Common Student Mistakes#
Over-complicating code
Wrong: Creating classes and complex functions
Right: Simple step-by-step cleaning
Removing too much data
Wrong: Delete any row with issues
Right: Fix what you can, document what you remove
No validation
Wrong: Assume cleaning worked
Right: Always verify with checks
Poor documentation
Wrong: No comments or explanation
Right: Document every decision
Data Quality Checklist#
Before analysis, verify:
No missing prices in critical periods
No zero or negative prices
No impossible returns (>50% daily)
Consistent date progression
Reasonable price ranges
Volume data present
Dividends captured (if applicable)
Splits adjusted properly
Implementation Notes for Instructors#
Pre-Session Setup#
Prepare dataset with known issues
Include recent stock split examples
Have backup data ready
Test cleaning code
Common Issues & Solutions#
“No bad data found”: Use older data or penny stocks
“Too much bad data”: Start with blue-chip stocks
“Code too complex”: Enforce line-by-line approach
“Can’t explain decisions”: Require comments for each step
Assessment Tips#
Watch for copy-pasted solutions
Ask them to clean a new stock live
Test understanding of WHY, not just HOW
Verify they can identify issues manually
Data Files#
Note: Students should work with real messy data.
Suggested problem stocks:
TSLA - Recent splits and volatility
Penny stocks - Missing data and errors
International ADRs - Currency issues
Biotech stocks - Extreme moves
Instructor Resources:
Pre-made dirty datasets
Common error patterns
Validation templates
Grading rubrics
End of Session 3: Financial Data Cleaning & Validation
Next Session: Time Series Analysis - Finding Patterns in Clean Financial Data