Session 3: Financial Data Cleaning & Validation - Making Your Data Analysis-Ready

Session 3: Financial Data Cleaning & Validation - Making Your Data Analysis-Ready#

Learning Objectives#

By the end of this session, you will be able to:

Identify and fix common data quality issues in financial datasets that cause analysis errors
Clean messy stock data using simple, systematic Python techniques
Handle missing values and outliers without complex programming
Validate your cleaned data to ensure it’s ready for analysis
Document your cleaning process for transparency and reproducibility
Present your data quality improvements clearly in a professional video

Section 1: The Financial Hook#

The $440 Million Data Error#

In August 2012, Knight Capital Group lost $440 million in 45 minutes. The cause? Bad data in their trading system that wasn’t properly cleaned and validated.

Here’s a simpler example you might face:

Your Analysis: “Tesla stock crashed 67% on August 25, 2022!” Reality: Tesla had a 3-for-1 stock split. The price went from $900 to $300, but shareholders now had 3x more shares.

Without proper data cleaning, you’d panic sell. With cleaned data, you’d see no actual loss occurred.

Common Data Problems That Cost Money#

FINANCIAL DATA NIGHTMARES:
├── Stock splits (fake crashes)
├── Missing dividend data (undervalued returns)
├── Weekend gaps (analysis errors)
├── Wrong decimal places ($0.01 vs $1.00)
└── Outliers from data errors (not real trades)

Why This Matters for Your Career#

Entry-Level Analyst: Clean data manually, one stock at a time Senior Analyst: Build automated cleaning systems for portfolios Portfolio Manager: Trust clean data for million-dollar decisions Your Goal: Learn systematic cleaning that scales with your career

Real-World Timeline#

Day 1: Download messy data
Day 2: Try analysis, get nonsense results
Day 3: Learn data cleaning (today!)
Day 4: Produce accurate analysis
Week 2: Automate for multiple stocks
Month 2: Handle complex corporate actions

Section 2: Foundational Financial Concepts & Models#

What is Data Cleaning?#

Data Cleaning: The process of detecting and correcting errors in financial data to ensure accurate analysis.

Think of it like this:

Raw Data = Ingredients from the store (might have bad items)
Data Cleaning = Washing and preparing ingredients
Clean Data = Ready to cook with

Types of Data Quality Issues#

1. Missing Values#

Date        AAPL Price
2024-01-01  $192.35
2024-01-02  $191.80
2024-01-03  NaN        ← Missing! Market was closed?
2024-01-04  $190.50

2. Outliers and Errors#

Date        MSFT Price
2024-02-01  $405.20
2024-02-02  $4.05      ← Decimal error! Should be $405.00
2024-02-03  $406.10

3. Corporate Actions#

What are Corporate Actions? Corporate actions are events initiated by a company that affect its stock price or structure. Common examples include:

Stock splits: Dividing shares into more pieces (like cutting a pizza into more slices)
Dividends: Cash payments to shareholders
Mergers: Two companies combining into one
Spin-offs: One company splitting into separate companies

These events can make prices look wrong if you don’t know they happened!

Date        TSLA Price
2022-08-24  $891.30
2022-08-25  $297.10    ← 3:1 split, not a crash!
2022-08-26  $299.50

Mathematical Framework#

What is an Outlier? An outlier is a data point that’s extremely different from other data points. In stock prices, this could be a real event (like a crash) or a data error (like a missing decimal point).

Outlier Detection Using Standard Deviation:

The Z-score tells us how many standard deviations away from average a value is:

\[Z\text{-}score = \frac{Value - Mean}{Standard\ Deviation}\]

Understanding Z-scores:

Z-score = 0: The value is exactly average
Z-score = 1: The value is 1 standard deviation above average
Z-score = -2: The value is 2 standard deviations below average
|Z-score| > 3: The value is so extreme it’s likely an outlier (happens less than 0.3% of the time)

Example: If Apple’s average daily return is 0.1% with standard deviation 2%, and today it shows 10% return: Z-score = (10% - 0.1%) / 2% = 4.95 → Likely an outlier!

Data Completeness Score: $$Completeness = \frac{Non\text{-}Missing\ Values}{Total\ Expected\ Values} \times 100\%$$

Return Calculation Check: $$Daily\ Return = \frac{Today's\ Price - Yesterday's\ Price}{Yesterday's\ Price}$$

If daily return > 50% or < -50%, investigate for splits!

🎯 AI Learning Support - Understanding Data Quality#

Learning Goal: Develop the ability to identify data quality issues systematically.

Starting Prompt: “I have stock price data with some weird values. How do I find problems?”

🚀 Hints to Improve Your Prompt:

Add specific context (what stock? what time period?)
Describe the weird values you’re seeing
Mention what analysis you’re trying to do
Ask for a systematic approach, not just a quick fix

💡 Better Version Hints:

Include actual numbers from your data
Specify if you’re looking for splits, errors, or missing data
Ask for both detection AND correction methods
Request validation steps to confirm fixes worked

🎯 Your Challenge: Rewrite the prompt to get comprehensive help for YOUR specific data issues.

Section 3: The Financial Gym - Partner Practice & AI Copilot Learning#

Solo Warm-Up (15 minutes)#

Exercise 1: Spot the Data Problem

# Apple stock data - can you spot the issues?
# Each line has a comment explaining what we're doing

# Step 1: Create sample data with problems
dates = ['2024-01-02', '2024-01-03', '2024-01-04', '2024-01-05']
prices = [192.35, 0.0, 191.80, 1915.50]  # Problems hidden here!

# Step 2: Look at each price
print("Apple Daily Prices:")
for i in range(len(dates)):
    print(f"{dates[i]}: ${prices[i]}")

# What problems do you see?
# Problem 1: ________________
# Problem 2: ________________

AI Copilot Learning Phase (20 minutes)#

🎯 AI Learning Support - Data Cleaning Strategy

Starting Prompt: “I have stock prices with zeros and weird values. What should I do?”

🚀 Make It Better By Adding:

The actual data you’re working with
What “weird” means (too high? too low? missing?)
What you’ve already tried
Whether you need to preserve all data or can remove some

💡 Level-Up Hints:

Ask for step-by-step cleaning process
Request validation methods
Include downstream analysis needs
Ask about documentation requirements

Reciprocal Teaching Component (25 minutes)#

Partner A Teaching Points (10 min):

Explain what makes data “dirty” vs “clean”
Show how to spot a stock split in price data
Demonstrate filling missing values

Partner B Teaching Points (10 min):

Explain why outliers matter in financial data
Show how to calculate Z-scores simply
Demonstrate validating cleaned data

Together (5 min):

Find a real stock with a recent split and download its data
Identify the split date together

Collaborative Challenge (20 minutes)#

Clean Tesla’s Split Data Together

# SIMPLE DATA CLEANING CHALLENGE
# Work together to fix this data

# Tesla prices around the August 2022 split
dates = ['2022-08-24', '2022-08-25', '2022-08-26']
prices = [891.30, 297.10, 299.50]

# Step 1: Calculate daily returns
print("Daily returns:")
for i in range(1, len(prices)):
    daily_return = (prices[i] - prices[i-1]) / prices[i-1]
    print(f"{dates[i]}: {daily_return * 100:.1f}%")

# Step 2: Identify the problem
# Your code here: Detect if any return is too extreme

# Step 3: Fix the split issue
# Your code here: Adjust pre-split prices

💡 AI Learning Support - Troubleshooting#

Starting Prompt: “My code shows Tesla dropped 67%. How do I fix this?”

🚀 Improvement Hints:

Specify the dates you’re analyzing
Mention you suspect a stock split
Ask for the split ratio and how to adjust
Request verification method

🎯 Your Challenge: Create a prompt that would help someone who has never seen a stock split before.

Section 4: The Financial Coaching - Your DRIVER Learning Guide#

Complete DRIVER Case Study: Cleaning a Dividend Portfolio#

Scenario: Your manager asks: “Clean our dividend portfolio data for quarterly analysis. We need accurate prices and dividends for JNJ, KO, and PEP. Make sure the data is reliable.”

D - Define & Discover#

Understanding the Problem

Data quality issues that affect dividend analysis:

Missing dividend payments
Stock splits affecting price continuity
Weekend/holiday gaps
Data vendor errors

💻 AI Learning Support - Problem Discovery

Starting Prompt: “I need to clean data for dividend stocks. What should I check?”

🚀 Enhancement Hints:

Specify which stocks and time period
Mention you need both prices AND dividends
Ask about dividend-specific quality issues
Request priority order for cleaning tasks

Design Requirements:

Handle 3 dividend stocks
Clean 1 year of daily data
Validate prices and dividends
Document all changes

R - Represent#

Data Cleaning Workflow

1. Load raw data
   ↓
2. Check for missing values
   ↓
3. Identify outliers/errors
   ↓
4. Fix identified issues
   ↓
5. Validate the fixes
   ↓
6. Document changes

🚀 AI Learning Support - Workflow Design

Starting Prompt: “Help me plan data cleaning steps.”

💡 Better Prompt Hints:

Specify your data types (prices, dividends, volumes)
Ask for decision points (when to remove vs. fix data)
Request validation methods for each step
Include regulatory/audit requirements

I - Implement#

Simple Data Cleaning Implementation

# DIVIDEND PORTFOLIO DATA CLEANING
# Everything step-by-step, no complex code

# Step 1: Setup our portfolio
stocks = ['JNJ', 'KO', 'PEP']
stock_names = ['Johnson & Johnson', 'Coca-Cola', 'PepsiCo']

# Step 2: Import what we need
import yfinance as yf
import pandas as pd

# Step 3: Download and clean each stock
for i in range(len(stocks)):
    ticker = stocks[i]
    name = stock_names[i]
    
    print(f"\n=== Cleaning {name} ({ticker}) ===")
    
    # Download 1 year of data
    stock = yf.Ticker(ticker)
    data = stock.history(period='1y')
    
    # Check 1: Missing values
    missing_count = data['Close'].isna().sum()
    print(f"Missing values: {missing_count}")
    
    # Check 2: Zero or negative prices
    bad_prices = (data['Close'] <= 0).sum()
    print(f"Bad prices: {bad_prices}")
    
    # Check 3: Extreme daily moves (possible splits)
    data['Daily_Return'] = data['Close'].pct_change()
    extreme_moves = (abs(data['Daily_Return']) > 0.20).sum()
    print(f"Extreme moves (>20%): {extreme_moves}")
    
    # Simple cleaning
    # Fix 1: Fill missing prices with previous day
    data['Close'] = data['Close'].fillna(method='ffill')
    
    # Fix 2: Remove any remaining bad data
    data = data[data['Close'] > 0]
    
    # Validation
    final_missing = data['Close'].isna().sum()
    print(f"\nAfter cleaning:")
    print(f"Missing values: {final_missing}")
    print(f"Data points: {len(data)}")

💻 AI Learning Support - Implementation Help

Starting Prompt: “How do I fill missing values in stock data?”

🚀 Better Prompt Elements:

Specify forward-fill vs interpolation vs removal
Ask about finance-specific considerations
Request validation after filling
Include example code with your data structure

V - Validate#

Validation Checks

# VALIDATION: Make sure cleaning worked

# Step 1: Check data completeness
print("\n=== DATA VALIDATION ===")

# For each stock, validate the cleaning
for ticker in ['JNJ', 'KO', 'PEP']:
    # Get fresh data to validate
    stock = yf.Ticker(ticker) 
    clean_data = stock.history(period='1y')
    
    # Remove any bad prices
    clean_data = clean_data[clean_data['Close'] > 0]
    
    # Calculate quality metrics
    total_days = len(clean_data)
    missing = clean_data['Close'].isna().sum()
    completeness = ((total_days - missing) / total_days) * 100
    
    # Check price reasonableness
    min_price = clean_data['Close'].min()
    max_price = clean_data['Close'].max()
    price_ratio = max_price / min_price
    
    print(f"\n{ticker} Validation:")
    print(f"  Completeness: {completeness:.1f}%")
    print(f"  Price range: ${min_price:.2f} to ${max_price:.2f}")
    print(f"  Reasonable range: {'✅ Yes' if price_ratio < 3 else '❌ No'}")

🔍 AI Learning Support - Validation Strategy

Starting Prompt: “How do I know if my cleaned data is good?”

💡 Enhancement Ideas:

Ask for industry-specific validation rules
Request statistical tests for data quality
Include before/after comparisons
Ask about documentation standards

E - Evolve#

Enhancement Opportunities

Your basic cleaning can evolve to:

Automated daily data quality reports
Machine learning anomaly detection
Multi-source data reconciliation
Real-time data quality monitoring

🎯 AI Learning Support - Next Steps

Starting Prompt: “What else should I add to my data cleaning?”

🚀 Better Elements:

Describe current capabilities first
Specify scale (how many stocks?)
Ask about industry best practices
Request prioritized enhancement list

R - Reflect#

Key Insights

Most data has quality issues - always check!
Simple cleaning solves 80% of problems
Documentation is as important as cleaning
Validation prevents embarrassing errors

📋 AI Learning Support - Learning Synthesis

Starting Prompt: “What did I learn about data cleaning?”

💡 Reflection Improvements:

List specific techniques you used
Describe problems you solved
Ask about edge cases you missed
Request career development insights

Section 5: Assignment#

Scenario#

You’re a data analyst at a hedge fund. Your portfolio manager suspects data quality issues are affecting trading signals. Investigate and clean financial data to ensure accurate analysis.

Requirements#

Create a video (approximately 10-15 minutes) demonstrating:

Systematic discovery of data quality issues in a stock with known challenges
Implementation of cleaning solutions for identified problems
Validation of data integrity before and after cleaning
Impact analysis showing how bad data affects investment decisions

Execution Format#

Use your completed Jupyter notebook or Python script
Run your code cell-by-cell while explaining what each part does
Show outputs and interpret them immediately
Demonstrate before/after comparisons of your cleaning

Deliverables#

Video demonstration showing code execution and analysis
Python code file (.py or .ipynb)

Section 6: Reflect & Connect - Financial Insights Discussion#

Part 2: Step-by-Step Implementation (40% - 5 minutes)#

2.1 Live Data Cleaning (3 minutes)

Load messy data
Identify issues systematically
Fix each issue with clear explanation
Explain WHY each fix is appropriate

2.2 Validation Process (2 minutes)

Prove your cleaning worked
Show before/after comparisons
Run quality checks
Be ready for: “What if the data had negative prices?”

Part 3: Results & Professional Impact (30% - 3 minutes)#

3.1 Present Clean Data (1.5 minutes)

Show summary statistics
Highlight what changed
Demonstrate data is analysis-ready
Calculate a simple metric with clean vs. dirty data

3.2 Business Implications (1.5 minutes)

Explain how cleaning affected results
Discuss what would happen without cleaning
Reflect on scalability
Consider regulatory requirements

Assessment Criteria#

Your grade depends on demonstrating understanding:

Can you identify data issues systematically?
Do you understand WHY each cleaning step matters?
Can you validate your cleaning worked?
Can you explain business impact clearly?

Section 6: Reflect & Connect - Financial Insights Discussion#

Individual Reflection (10 minutes)#

Write your thoughts on:

What surprised you about data quality issues?
Which cleaning technique was most valuable?
How will data cleaning impact your analysis?
What edge cases worry you most?

Small Group Discussion (15 minutes)#

In groups of 3-4, discuss:

Share cleaning challenges
- What issues were hardest to spot?
- Which fixes were tricky?
- How did you validate success?
Compare approaches
- Different ways to handle missing data
- When to remove vs. fix data
- Documentation strategies
Real-world applications
- How would this scale to 500 stocks?
- What about intraday data?
- International stock challenges?

Class Synthesis (15 minutes)#

Key Discussion Topics:

The Data Quality Iceberg
- Visible issues (missing data, zeros)
- Hidden issues (splits, dividends, errors)
- Systematic detection methods
Professional Standards
- Audit trail requirements
- Regulatory compliance
- Client trust implications
Career Applications
- Entry level: Clean data manually
- Senior level: Build cleaning systems
- Management: Ensure data governance

Connecting to Practice#

Industry Insight: “At State Street, we process $43 trillion in assets. A single data error can impact thousands of client portfolios. That’s why we have entire teams dedicated to data quality.”

Jennifer Chen, Head of Data Quality, State Street

Key Takeaways Board#

Create class list of:

Most common data issues found
Best cleaning techniques
Validation must-haves
Documentation essentials

Section 7: Looking Ahead#

Skills Mastered#

✅ Technical Competencies:

Systematic issue detection
Simple cleaning techniques
Validation frameworks
Documentation practices

✅ Professional Skills:

Quality-first mindset
Attention to detail
Process documentation
Risk awareness

Building Bridges to Session 4#

Next session’s Time Series Analysis builds on clean data:

From Clean to Analyzed: Today we cleaned; next we find patterns
From Static to Dynamic: Today’s snapshot becomes tomorrow’s trends
From Quality to Insights: Clean data enables reliable forecasting
From Single to Series: One day’s price becomes historical analysis

Connecting Concepts#

Session 3: Clean Data → Session 4: Time Patterns → Session 5: Risk Metrics
     ↓                        ↓                          ↓
Remove errors          Find trends/cycles        Calculate volatility
Fix splits             Seasonal patterns         Correlation analysis
Validate quality       Forecast future           Risk-adjusted returns

Preview Challenge#

Before next session, think about:

How do stock prices change over time?
What patterns repeat (daily, weekly, yearly)?
How do you measure if prices are trending?
Can past patterns predict future moves?

Professional Development Path#

Your data cleaning skills enable:

Immediate: Trust your analysis results
Next Month: Automate cleaning workflows
Six Months: Handle complex corporate actions
One Year: Design enterprise data quality systems

Preparation for Session 4#

Technical Prep:
- Ensure you have 2+ years of clean data
- Practice basic statistics (mean, std dev)
- Review datetime handling in Python
Conceptual Prep:
- Think about patterns in stock prices
- Consider what drives price changes
- Review basic statistics concepts
Professional Prep:
- Research time series in finance
- Understand trending vs. mean reversion
- Consider forecast applications

Section 8: Appendix - Solutions & Implementation Guide#

Complete Solution Code#

# COMPLETE DATA CLEANING SOLUTION
# Simple enough for beginners, thorough enough for pros

# ===== PART 1: SETUP =====
import yfinance as yf
import pandas as pd
import numpy as np

# Portfolio to clean
stocks = ['JNJ', 'KO', 'PEP', 'PG', 'WMT']
results = []

# ===== PART 2: CLEAN EACH STOCK =====
print("Starting portfolio data cleaning...\n")

for ticker in stocks:
    print(f"Cleaning {ticker}...")
    
    # Download data
    stock = yf.Ticker(ticker)
    data = stock.history(period='1y')
    
    # Count issues before cleaning
    issues_before = {
        'missing': data['Close'].isna().sum(),
        'zeros': (data['Close'] == 0).sum(),
        'negative': (data['Close'] < 0).sum()
    }
    
    # Clean the data
    # Step 1: Remove obviously bad data
    data = data[data['Close'] > 0]
    
    # Step 2: Fill missing values
    data['Close'] = data['Close'].fillna(method='ffill')
    data['Volume'] = data['Volume'].fillna(0)
    
    # Step 3: Check for extreme moves
    data['Return'] = data['Close'].pct_change()
    extreme_days = abs(data['Return']) > 0.20
    extreme_count = extreme_days.sum()
    
    # Count issues after cleaning
    issues_after = {
        'missing': data['Close'].isna().sum(),
        'zeros': (data['Close'] == 0).sum(),
        'negative': (data['Close'] < 0).sum()
    }
    
    # Calculate quality score
    total_days = len(data)
    clean_days = total_days - issues_after['missing']
    quality_score = (clean_days / total_days) * 100
    
    # Store results
    results.append({
        'ticker': ticker,
        'total_days': total_days,
        'issues_fixed': sum(issues_before.values()),
        'extreme_moves': extreme_count,
        'quality_score': quality_score
    })
    
    print(f"  Fixed {sum(issues_before.values())} issues")
    print(f"  Quality score: {quality_score:.1f}%\n")

# ===== PART 3: SUMMARY REPORT =====
print("=== CLEANING SUMMARY ===")
print(f"{'Stock':<6} {'Days':<6} {'Fixed':<6} {'Quality':<8}")
print("-" * 30)

for result in results:
    print(f"{result['ticker']:<6} {result['total_days']:<6} "
          f"{result['issues_fixed']:<6} {result['quality_score']:<8.1f}%")

# Overall portfolio quality
avg_quality = sum(r['quality_score'] for r in results) / len(results)
print(f"\nPortfolio Quality: {avg_quality:.1f}%")

Common Student Mistakes#

Over-complicating code
- Wrong: Creating classes and complex functions
- Right: Simple step-by-step cleaning
Removing too much data
- Wrong: Delete any row with issues
- Right: Fix what you can, document what you remove
No validation
- Wrong: Assume cleaning worked
- Right: Always verify with checks
Poor documentation
- Wrong: No comments or explanation
- Right: Document every decision

Data Quality Checklist#

Before analysis, verify:

No missing prices in critical periods
No zero or negative prices
No impossible returns (>50% daily)
Consistent date progression
Reasonable price ranges
Volume data present
Dividends captured (if applicable)
Splits adjusted properly

Implementation Notes for Instructors#

Pre-Session Setup#

Prepare dataset with known issues
Include recent stock split examples
Have backup data ready
Test cleaning code

Common Issues & Solutions#

“No bad data found”: Use older data or penny stocks
“Too much bad data”: Start with blue-chip stocks
“Code too complex”: Enforce line-by-line approach
“Can’t explain decisions”: Require comments for each step

Assessment Tips#

Watch for copy-pasted solutions
Ask them to clean a new stock live
Test understanding of WHY, not just HOW
Verify they can identify issues manually

Data Files#

Note: Students should work with real messy data.

Session 3: Financial Data Cleaning & Validation - Making Your Data Analysis-Ready

Contents

Session 3: Financial Data Cleaning & Validation - Making Your Data Analysis-Ready#

Learning Objectives#

Section 1: The Financial Hook#

The $440 Million Data Error#

Common Data Problems That Cost Money#

Why This Matters for Your Career#

Real-World Timeline#

Section 2: Foundational Financial Concepts & Models#

What is Data Cleaning?#

Types of Data Quality Issues#

1. Missing Values#

2. Outliers and Errors#

3. Corporate Actions#

Mathematical Framework#

🎯 AI Learning Support - Understanding Data Quality#

Section 3: The Financial Gym - Partner Practice & AI Copilot Learning#

Solo Warm-Up (15 minutes)#

AI Copilot Learning Phase (20 minutes)#

Reciprocal Teaching Component (25 minutes)#

Collaborative Challenge (20 minutes)#

💡 AI Learning Support - Troubleshooting#

Section 4: The Financial Coaching - Your DRIVER Learning Guide#

Complete DRIVER Case Study: Cleaning a Dividend Portfolio#

D - Define & Discover#

R - Represent#

I - Implement#

V - Validate#

E - Evolve#

R - Reflect#

Section 5: Assignment#

Scenario#

Requirements#

Execution Format#

Deliverables#

Section 6: Reflect & Connect - Financial Insights Discussion#

Part 2: Step-by-Step Implementation (40% - 5 minutes)#

Part 3: Results & Professional Impact (30% - 3 minutes)#

Assessment Criteria#

Section 6: Reflect & Connect - Financial Insights Discussion#

Individual Reflection (10 minutes)#

Small Group Discussion (15 minutes)#

Class Synthesis (15 minutes)#

Connecting to Practice#

Key Takeaways Board#

Section 7: Looking Ahead#

Skills Mastered#

Building Bridges to Session 4#

Connecting Concepts#

Preview Challenge#

Professional Development Path#

Preparation for Session 4#

Section 8: Appendix - Solutions & Implementation Guide#

Complete Solution Code#

Common Student Mistakes#

Data Quality Checklist#

Implementation Notes for Instructors#

Pre-Session Setup#

Common Issues & Solutions#

Assessment Tips#

Data Files#