Session 3: Financial Data Cleaning & Validation - Making Your Data Analysis-Ready

Contents

Session 3: Financial Data Cleaning & Validation - Making Your Data Analysis-Ready#

Learning Objectives#

By the end of this session, you will be able to:

  1. Identify and fix common data quality issues in financial datasets that cause analysis errors

  2. Clean messy stock data using simple, systematic Python techniques

  3. Handle missing values and outliers without complex programming

  4. Validate your cleaned data to ensure it’s ready for analysis

  5. Document your cleaning process for transparency and reproducibility

  6. Present your data quality improvements clearly in a professional video


Section 1: The Financial Hook#

The $440 Million Data Error#

In August 2012, Knight Capital Group lost $440 million in 45 minutes. The cause? Bad data in their trading system that wasn’t properly cleaned and validated.

Here’s a simpler example you might face:

Your Analysis: “Tesla stock crashed 67% on August 25, 2022!” Reality: Tesla had a 3-for-1 stock split. The price went from $900 to $300, but shareholders now had 3x more shares.

Without proper data cleaning, you’d panic sell. With cleaned data, you’d see no actual loss occurred.

Common Data Problems That Cost Money#

FINANCIAL DATA NIGHTMARES:
├── Stock splits (fake crashes)
├── Missing dividend data (undervalued returns)
├── Weekend gaps (analysis errors)
├── Wrong decimal places ($0.01 vs $1.00)
└── Outliers from data errors (not real trades)

Why This Matters for Your Career#

Entry-Level Analyst: Clean data manually, one stock at a time Senior Analyst: Build automated cleaning systems for portfolios Portfolio Manager: Trust clean data for million-dollar decisions Your Goal: Learn systematic cleaning that scales with your career

Real-World Timeline#

Day 1: Download messy data
Day 2: Try analysis, get nonsense results
Day 3: Learn data cleaning (today!)
Day 4: Produce accurate analysis
Week 2: Automate for multiple stocks
Month 2: Handle complex corporate actions

Section 2: Foundational Financial Concepts & Models#

What is Data Cleaning?#

Data Cleaning: The process of detecting and correcting errors in financial data to ensure accurate analysis.

Think of it like this:

  • Raw Data = Ingredients from the store (might have bad items)

  • Data Cleaning = Washing and preparing ingredients

  • Clean Data = Ready to cook with

Types of Data Quality Issues#

1. Missing Values#

Date        AAPL Price
2024-01-01  $192.35
2024-01-02  $191.80
2024-01-03  NaN         Missing! Market was closed?
2024-01-04  $190.50

2. Outliers and Errors#

Date        MSFT Price
2024-02-01  $405.20
2024-02-02  $4.05       Decimal error! Should be $405.00
2024-02-03  $406.10

3. Corporate Actions#

What are Corporate Actions? Corporate actions are events initiated by a company that affect its stock price or structure. Common examples include:

  • Stock splits: Dividing shares into more pieces (like cutting a pizza into more slices)

  • Dividends: Cash payments to shareholders

  • Mergers: Two companies combining into one

  • Spin-offs: One company splitting into separate companies

These events can make prices look wrong if you don’t know they happened!

Date        TSLA Price
2022-08-24  $891.30
2022-08-25  $297.10     3:1 split, not a crash!
2022-08-26  $299.50

Mathematical Framework#

What is an Outlier? An outlier is a data point that’s extremely different from other data points. In stock prices, this could be a real event (like a crash) or a data error (like a missing decimal point).

Outlier Detection Using Standard Deviation:

The Z-score tells us how many standard deviations away from average a value is:

\[Z\text{-}score = \frac{Value - Mean}{Standard\ Deviation}\]

Understanding Z-scores:

  • Z-score = 0: The value is exactly average

  • Z-score = 1: The value is 1 standard deviation above average

  • Z-score = -2: The value is 2 standard deviations below average

  • |Z-score| > 3: The value is so extreme it’s likely an outlier (happens less than 0.3% of the time)

Example: If Apple’s average daily return is 0.1% with standard deviation 2%, and today it shows 10% return: Z-score = (10% - 0.1%) / 2% = 4.95 → Likely an outlier!

Data Completeness Score: $\(Completeness = \frac{Non\text{-}Missing\ Values}{Total\ Expected\ Values} \times 100\%\)$

Return Calculation Check: $\(Daily\ Return = \frac{Today's\ Price - Yesterday's\ Price}{Yesterday's\ Price}\)$

If daily return > 50% or < -50%, investigate for splits!

🎯 AI Learning Support - Understanding Data Quality#

Learning Goal: Develop the ability to identify data quality issues systematically.

Starting Prompt: “I have stock price data with some weird values. How do I find problems?”

🚀 Hints to Improve Your Prompt:

  • Add specific context (what stock? what time period?)

  • Describe the weird values you’re seeing

  • Mention what analysis you’re trying to do

  • Ask for a systematic approach, not just a quick fix

💡 Better Version Hints:

  • Include actual numbers from your data

  • Specify if you’re looking for splits, errors, or missing data

  • Ask for both detection AND correction methods

  • Request validation steps to confirm fixes worked

🎯 Your Challenge: Rewrite the prompt to get comprehensive help for YOUR specific data issues.


Section 3: The Financial Gym - Partner Practice & AI Copilot Learning#

Solo Warm-Up (15 minutes)#

Exercise 1: Spot the Data Problem

# Apple stock data - can you spot the issues?
# Each line has a comment explaining what we're doing

# Step 1: Create sample data with problems
dates = ['2024-01-02', '2024-01-03', '2024-01-04', '2024-01-05']
prices = [192.35, 0.0, 191.80, 1915.50]  # Problems hidden here!

# Step 2: Look at each price
print("Apple Daily Prices:")
for i in range(len(dates)):
    print(f"{dates[i]}: ${prices[i]}")

# What problems do you see?
# Problem 1: ________________
# Problem 2: ________________

AI Copilot Learning Phase (20 minutes)#

🎯 AI Learning Support - Data Cleaning Strategy

Starting Prompt: “I have stock prices with zeros and weird values. What should I do?”

🚀 Make It Better By Adding:

  • The actual data you’re working with

  • What “weird” means (too high? too low? missing?)

  • What you’ve already tried

  • Whether you need to preserve all data or can remove some

💡 Level-Up Hints:

  • Ask for step-by-step cleaning process

  • Request validation methods

  • Include downstream analysis needs

  • Ask about documentation requirements

Reciprocal Teaching Component (25 minutes)#

Partner A Teaching Points (10 min):

  1. Explain what makes data “dirty” vs “clean”

  2. Show how to spot a stock split in price data

  3. Demonstrate filling missing values

Partner B Teaching Points (10 min):

  1. Explain why outliers matter in financial data

  2. Show how to calculate Z-scores simply

  3. Demonstrate validating cleaned data

Together (5 min):

  • Find a real stock with a recent split and download its data

  • Identify the split date together

Collaborative Challenge (20 minutes)#

Clean Tesla’s Split Data Together

# SIMPLE DATA CLEANING CHALLENGE
# Work together to fix this data

# Tesla prices around the August 2022 split
dates = ['2022-08-24', '2022-08-25', '2022-08-26']
prices = [891.30, 297.10, 299.50]

# Step 1: Calculate daily returns
print("Daily returns:")
for i in range(1, len(prices)):
    daily_return = (prices[i] - prices[i-1]) / prices[i-1]
    print(f"{dates[i]}: {daily_return * 100:.1f}%")

# Step 2: Identify the problem
# Your code here: Detect if any return is too extreme

# Step 3: Fix the split issue
# Your code here: Adjust pre-split prices

💡 AI Learning Support - Troubleshooting#

Starting Prompt: “My code shows Tesla dropped 67%. How do I fix this?”

🚀 Improvement Hints:

  • Specify the dates you’re analyzing

  • Mention you suspect a stock split

  • Ask for the split ratio and how to adjust

  • Request verification method

🎯 Your Challenge: Create a prompt that would help someone who has never seen a stock split before.


Section 4: The Financial Coaching - Your DRIVER Learning Guide#

Complete DRIVER Case Study: Cleaning a Dividend Portfolio#

Scenario: Your manager asks: “Clean our dividend portfolio data for quarterly analysis. We need accurate prices and dividends for JNJ, KO, and PEP. Make sure the data is reliable.”

D - Define & Discover#

Understanding the Problem

Data quality issues that affect dividend analysis:

  • Missing dividend payments

  • Stock splits affecting price continuity

  • Weekend/holiday gaps

  • Data vendor errors

💻 AI Learning Support - Problem Discovery

Starting Prompt: “I need to clean data for dividend stocks. What should I check?”

🚀 Enhancement Hints:

  • Specify which stocks and time period

  • Mention you need both prices AND dividends

  • Ask about dividend-specific quality issues

  • Request priority order for cleaning tasks

Design Requirements:

  • Handle 3 dividend stocks

  • Clean 1 year of daily data

  • Validate prices and dividends

  • Document all changes

R - Represent#

Data Cleaning Workflow

1. Load raw data
   
2. Check for missing values
   
3. Identify outliers/errors
   
4. Fix identified issues
   
5. Validate the fixes
   
6. Document changes

🚀 AI Learning Support - Workflow Design

Starting Prompt: “Help me plan data cleaning steps.”

💡 Better Prompt Hints:

  • Specify your data types (prices, dividends, volumes)

  • Ask for decision points (when to remove vs. fix data)

  • Request validation methods for each step

  • Include regulatory/audit requirements

I - Implement#

Simple Data Cleaning Implementation

# DIVIDEND PORTFOLIO DATA CLEANING
# Everything step-by-step, no complex code

# Step 1: Setup our portfolio
stocks = ['JNJ', 'KO', 'PEP']
stock_names = ['Johnson & Johnson', 'Coca-Cola', 'PepsiCo']

# Step 2: Import what we need
import yfinance as yf
import pandas as pd

# Step 3: Download and clean each stock
for i in range(len(stocks)):
    ticker = stocks[i]
    name = stock_names[i]
    
    print(f"\n=== Cleaning {name} ({ticker}) ===")
    
    # Download 1 year of data
    stock = yf.Ticker(ticker)
    data = stock.history(period='1y')
    
    # Check 1: Missing values
    missing_count = data['Close'].isna().sum()
    print(f"Missing values: {missing_count}")
    
    # Check 2: Zero or negative prices
    bad_prices = (data['Close'] <= 0).sum()
    print(f"Bad prices: {bad_prices}")
    
    # Check 3: Extreme daily moves (possible splits)
    data['Daily_Return'] = data['Close'].pct_change()
    extreme_moves = (abs(data['Daily_Return']) > 0.20).sum()
    print(f"Extreme moves (>20%): {extreme_moves}")
    
    # Simple cleaning
    # Fix 1: Fill missing prices with previous day
    data['Close'] = data['Close'].fillna(method='ffill')
    
    # Fix 2: Remove any remaining bad data
    data = data[data['Close'] > 0]
    
    # Validation
    final_missing = data['Close'].isna().sum()
    print(f"\nAfter cleaning:")
    print(f"Missing values: {final_missing}")
    print(f"Data points: {len(data)}")

💻 AI Learning Support - Implementation Help

Starting Prompt: “How do I fill missing values in stock data?”

🚀 Better Prompt Elements:

  • Specify forward-fill vs interpolation vs removal

  • Ask about finance-specific considerations

  • Request validation after filling

  • Include example code with your data structure

V - Validate#

Validation Checks

# VALIDATION: Make sure cleaning worked

# Step 1: Check data completeness
print("\n=== DATA VALIDATION ===")

# For each stock, validate the cleaning
for ticker in ['JNJ', 'KO', 'PEP']:
    # Get fresh data to validate
    stock = yf.Ticker(ticker) 
    clean_data = stock.history(period='1y')
    
    # Remove any bad prices
    clean_data = clean_data[clean_data['Close'] > 0]
    
    # Calculate quality metrics
    total_days = len(clean_data)
    missing = clean_data['Close'].isna().sum()
    completeness = ((total_days - missing) / total_days) * 100
    
    # Check price reasonableness
    min_price = clean_data['Close'].min()
    max_price = clean_data['Close'].max()
    price_ratio = max_price / min_price
    
    print(f"\n{ticker} Validation:")
    print(f"  Completeness: {completeness:.1f}%")
    print(f"  Price range: ${min_price:.2f} to ${max_price:.2f}")
    print(f"  Reasonable range: {'✅ Yes' if price_ratio < 3 else '❌ No'}")

🔍 AI Learning Support - Validation Strategy

Starting Prompt: “How do I know if my cleaned data is good?”

💡 Enhancement Ideas:

  • Ask for industry-specific validation rules

  • Request statistical tests for data quality

  • Include before/after comparisons

  • Ask about documentation standards

E - Evolve#

Enhancement Opportunities

Your basic cleaning can evolve to:

  1. Automated daily data quality reports

  2. Machine learning anomaly detection

  3. Multi-source data reconciliation

  4. Real-time data quality monitoring

🎯 AI Learning Support - Next Steps

Starting Prompt: “What else should I add to my data cleaning?”

🚀 Better Elements:

  • Describe current capabilities first

  • Specify scale (how many stocks?)

  • Ask about industry best practices

  • Request prioritized enhancement list

R - Reflect#

Key Insights

  1. Most data has quality issues - always check!

  2. Simple cleaning solves 80% of problems

  3. Documentation is as important as cleaning

  4. Validation prevents embarrassing errors

📋 AI Learning Support - Learning Synthesis

Starting Prompt: “What did I learn about data cleaning?”

💡 Reflection Improvements:

  • List specific techniques you used

  • Describe problems you solved

  • Ask about edge cases you missed

  • Request career development insights


Section 5: Assignment#

Scenario#

You’re a data analyst at a hedge fund. Your portfolio manager suspects data quality issues are affecting trading signals. Investigate and clean financial data to ensure accurate analysis.

Requirements#

Create a video (approximately 10-15 minutes) demonstrating:

  • Systematic discovery of data quality issues in a stock with known challenges

  • Implementation of cleaning solutions for identified problems

  • Validation of data integrity before and after cleaning

  • Impact analysis showing how bad data affects investment decisions

Execution Format#

  • Use your completed Jupyter notebook or Python script

  • Run your code cell-by-cell while explaining what each part does

  • Show outputs and interpret them immediately

  • Demonstrate before/after comparisons of your cleaning

Deliverables#

  1. Video demonstration showing code execution and analysis

  2. Python code file (.py or .ipynb)


Section 6: Reflect & Connect - Financial Insights Discussion#

Part 2: Step-by-Step Implementation (40% - 5 minutes)#

2.1 Live Data Cleaning (3 minutes)

  • Load messy data

  • Identify issues systematically

  • Fix each issue with clear explanation

  • Explain WHY each fix is appropriate

2.2 Validation Process (2 minutes)

  • Prove your cleaning worked

  • Show before/after comparisons

  • Run quality checks

  • Be ready for: “What if the data had negative prices?”

Part 3: Results & Professional Impact (30% - 3 minutes)#

3.1 Present Clean Data (1.5 minutes)

  • Show summary statistics

  • Highlight what changed

  • Demonstrate data is analysis-ready

  • Calculate a simple metric with clean vs. dirty data

3.2 Business Implications (1.5 minutes)

  • Explain how cleaning affected results

  • Discuss what would happen without cleaning

  • Reflect on scalability

  • Consider regulatory requirements

Assessment Criteria#

Your grade depends on demonstrating understanding:

  • Can you identify data issues systematically?

  • Do you understand WHY each cleaning step matters?

  • Can you validate your cleaning worked?

  • Can you explain business impact clearly?


Section 6: Reflect & Connect - Financial Insights Discussion#

Individual Reflection (10 minutes)#

Write your thoughts on:

  1. What surprised you about data quality issues?

  2. Which cleaning technique was most valuable?

  3. How will data cleaning impact your analysis?

  4. What edge cases worry you most?

Small Group Discussion (15 minutes)#

In groups of 3-4, discuss:

  1. Share cleaning challenges

    • What issues were hardest to spot?

    • Which fixes were tricky?

    • How did you validate success?

  2. Compare approaches

    • Different ways to handle missing data

    • When to remove vs. fix data

    • Documentation strategies

  3. Real-world applications

    • How would this scale to 500 stocks?

    • What about intraday data?

    • International stock challenges?

Class Synthesis (15 minutes)#

Key Discussion Topics:

  1. The Data Quality Iceberg

    • Visible issues (missing data, zeros)

    • Hidden issues (splits, dividends, errors)

    • Systematic detection methods

  2. Professional Standards

    • Audit trail requirements

    • Regulatory compliance

    • Client trust implications

  3. Career Applications

    • Entry level: Clean data manually

    • Senior level: Build cleaning systems

    • Management: Ensure data governance

Connecting to Practice#

Industry Insight: “At State Street, we process $43 trillion in assets. A single data error can impact thousands of client portfolios. That’s why we have entire teams dedicated to data quality.”

  • Jennifer Chen, Head of Data Quality, State Street

Key Takeaways Board#

Create class list of:

  • Most common data issues found

  • Best cleaning techniques

  • Validation must-haves

  • Documentation essentials


Section 7: Looking Ahead#

Skills Mastered#

Technical Competencies:

  • Systematic issue detection

  • Simple cleaning techniques

  • Validation frameworks

  • Documentation practices

Professional Skills:

  • Quality-first mindset

  • Attention to detail

  • Process documentation

  • Risk awareness

Building Bridges to Session 4#

Next session’s Time Series Analysis builds on clean data:

  1. From Clean to Analyzed: Today we cleaned; next we find patterns

  2. From Static to Dynamic: Today’s snapshot becomes tomorrow’s trends

  3. From Quality to Insights: Clean data enables reliable forecasting

  4. From Single to Series: One day’s price becomes historical analysis

Connecting Concepts#

Session 3: Clean Data  Session 4: Time Patterns  Session 5: Risk Metrics
                                                       
Remove errors          Find trends/cycles        Calculate volatility
Fix splits             Seasonal patterns         Correlation analysis
Validate quality       Forecast future           Risk-adjusted returns

Preview Challenge#

Before next session, think about:

  • How do stock prices change over time?

  • What patterns repeat (daily, weekly, yearly)?

  • How do you measure if prices are trending?

  • Can past patterns predict future moves?

Professional Development Path#

Your data cleaning skills enable:

  • Immediate: Trust your analysis results

  • Next Month: Automate cleaning workflows

  • Six Months: Handle complex corporate actions

  • One Year: Design enterprise data quality systems

Preparation for Session 4#

  1. Technical Prep:

    • Ensure you have 2+ years of clean data

    • Practice basic statistics (mean, std dev)

    • Review datetime handling in Python

  2. Conceptual Prep:

    • Think about patterns in stock prices

    • Consider what drives price changes

    • Review basic statistics concepts

  3. Professional Prep:

    • Research time series in finance

    • Understand trending vs. mean reversion

    • Consider forecast applications


Section 8: Appendix - Solutions & Implementation Guide#

Complete Solution Code#

# COMPLETE DATA CLEANING SOLUTION
# Simple enough for beginners, thorough enough for pros

# ===== PART 1: SETUP =====
import yfinance as yf
import pandas as pd
import numpy as np

# Portfolio to clean
stocks = ['JNJ', 'KO', 'PEP', 'PG', 'WMT']
results = []

# ===== PART 2: CLEAN EACH STOCK =====
print("Starting portfolio data cleaning...\n")

for ticker in stocks:
    print(f"Cleaning {ticker}...")
    
    # Download data
    stock = yf.Ticker(ticker)
    data = stock.history(period='1y')
    
    # Count issues before cleaning
    issues_before = {
        'missing': data['Close'].isna().sum(),
        'zeros': (data['Close'] == 0).sum(),
        'negative': (data['Close'] < 0).sum()
    }
    
    # Clean the data
    # Step 1: Remove obviously bad data
    data = data[data['Close'] > 0]
    
    # Step 2: Fill missing values
    data['Close'] = data['Close'].fillna(method='ffill')
    data['Volume'] = data['Volume'].fillna(0)
    
    # Step 3: Check for extreme moves
    data['Return'] = data['Close'].pct_change()
    extreme_days = abs(data['Return']) > 0.20
    extreme_count = extreme_days.sum()
    
    # Count issues after cleaning
    issues_after = {
        'missing': data['Close'].isna().sum(),
        'zeros': (data['Close'] == 0).sum(),
        'negative': (data['Close'] < 0).sum()
    }
    
    # Calculate quality score
    total_days = len(data)
    clean_days = total_days - issues_after['missing']
    quality_score = (clean_days / total_days) * 100
    
    # Store results
    results.append({
        'ticker': ticker,
        'total_days': total_days,
        'issues_fixed': sum(issues_before.values()),
        'extreme_moves': extreme_count,
        'quality_score': quality_score
    })
    
    print(f"  Fixed {sum(issues_before.values())} issues")
    print(f"  Quality score: {quality_score:.1f}%\n")

# ===== PART 3: SUMMARY REPORT =====
print("=== CLEANING SUMMARY ===")
print(f"{'Stock':<6} {'Days':<6} {'Fixed':<6} {'Quality':<8}")
print("-" * 30)

for result in results:
    print(f"{result['ticker']:<6} {result['total_days']:<6} "
          f"{result['issues_fixed']:<6} {result['quality_score']:<8.1f}%")

# Overall portfolio quality
avg_quality = sum(r['quality_score'] for r in results) / len(results)
print(f"\nPortfolio Quality: {avg_quality:.1f}%")

Common Student Mistakes#

  1. Over-complicating code

    • Wrong: Creating classes and complex functions

    • Right: Simple step-by-step cleaning

  2. Removing too much data

    • Wrong: Delete any row with issues

    • Right: Fix what you can, document what you remove

  3. No validation

    • Wrong: Assume cleaning worked

    • Right: Always verify with checks

  4. Poor documentation

    • Wrong: No comments or explanation

    • Right: Document every decision

Data Quality Checklist#

Before analysis, verify:

  • No missing prices in critical periods

  • No zero or negative prices

  • No impossible returns (>50% daily)

  • Consistent date progression

  • Reasonable price ranges

  • Volume data present

  • Dividends captured (if applicable)

  • Splits adjusted properly

Implementation Notes for Instructors#

Pre-Session Setup#

  1. Prepare dataset with known issues

  2. Include recent stock split examples

  3. Have backup data ready

  4. Test cleaning code

Common Issues & Solutions#

  • “No bad data found”: Use older data or penny stocks

  • “Too much bad data”: Start with blue-chip stocks

  • “Code too complex”: Enforce line-by-line approach

  • “Can’t explain decisions”: Require comments for each step

Assessment Tips#

  • Watch for copy-pasted solutions

  • Ask them to clean a new stock live

  • Test understanding of WHY, not just HOW

  • Verify they can identify issues manually

Data Files#

Note: Students should work with real messy data.

Suggested problem stocks:

  1. TSLA - Recent splits and volatility

  2. Penny stocks - Missing data and errors

  3. International ADRs - Currency issues

  4. Biotech stocks - Extreme moves

Instructor Resources:

  • Pre-made dirty datasets

  • Common error patterns

  • Validation templates

  • Grading rubrics


End of Session 3: Financial Data Cleaning & Validation

Next Session: Time Series Analysis - Finding Patterns in Clean Financial Data