Week 3: Advanced Data Handling
Topics Covered:
- Advanced Pandas for Financial Data
- AI-Enhanced Data Processing
- SAS to Python Migration
- Efficient Data Processing Techniques
- Financial Time Series Operations
- Data Quality and Validation
Core Concepts and Implementation:
- AI-Powered Pandas Tools
- pandasai: AI-powered data analysis
- pandas-ai: Natural language queries for DataFrames
- dataprep: Automated data preparation
- autoviz: Automated visualization generation
- SAS to Python Migration
- Reading SAS (.sas7bdat) files with pandas
- SAS to Python syntax conversion
- WRDS data handling best practices
- Performance optimization for large SAS datasets
- Advanced Pandas Operations
- MultiIndex and hierarchical indexing
- Advanced groupby operations
- Rolling and expanding windows
- Efficient memory usage with categorical data
- Financial Data Processing
- Handling missing data in financial time series
- Adjusting for corporate actions
- Working with different time zones
- Managing point-in-time data
- Performance Optimization
- Vectorized operations
- Efficient data types
- Chunked data processing
- Using pandas.eval() for large datasets
AI-Enhanced Data Analysis Setup:
- LLM Options
- Ollama (Free, provided on department GPU server)
- OpenAI API (Student purchase required, ~$5-20/month)
- Anthropic API (Student purchase required, pricing varies)
- AI-Powered Tools
- pandasai: Supports multiple LLM backends
- pandas-ai: Natural language DataFrame operations
- dataprep & autoviz: Automated analysis tools
Getting Started:
- Ollama Setup
- Access provided on department GPU server
- See Ollama Model Library for available models and usage
- Follow the official documentation for model commands
- Optional Commercial APIs
# If you choose to purchase API access: # 1. Create accounts at openai.com or anthropic.com # 2. Purchase credits (student discounts may be available) # 3. Create a .env file (never commit this!) OPENAI_API_KEY=your_purchased_key ANTHROPIC_API_KEY=your_purchased_key
- Install Required Packages
pip install python-dotenv pandasai pandas-ai dataprep autoviz
Using AI Tools:
Important Note: The code examples below are for demonstration purposes only. They illustrate the general approach but are not production-ready. You will need to:
- Debug and adapt the code to your specific use case
- Handle errors and edge cases
- Test with your actual data structure
- Refer to the latest documentation as APIs may change
# Example code - requires debugging and adaptation
# Using Ollama (Available to all students)
from pandasai import SmartDataframe
from pandasai.llm import Ollama
llm_local = Ollama(model="llama2") # See Ollama docs for available models
df_local = SmartDataframe(your_dataframe, config={'llm': llm_local}) # Replace your_dataframe
result_local = df_local.chat('Generate summary statistics')
# If you've purchased API access:
from dotenv import load_dotenv
import os
load_dotenv()
# OpenAI example (if purchased)
from pandasai.llm import OpenAI
llm = OpenAI(api_token=os.getenv('OPENAI_API_KEY'))
df = SmartDataframe(your_dataframe, config={'llm': llm}) # Replace your_dataframe
Note: These examples assume certain package versions and configurations. Always check the current documentation and be prepared to debug integration issues.
Model Comparison:
- Ollama (Provided)
- Free access via department GPU server
- Good for initial development and testing
- Suitable for most course assignments
- No usage limits or costs
- Commercial APIs (Optional)
- Higher accuracy but requires payment
- OpenAI: Strong general performance (~$5-20/month)
- Anthropic: Detailed analysis (pricing varies)
- Consider for advanced projects or research
Weekly Assignment
Due: End of Week 3
Tasks:
- Data Analysis Setup
- Install pandas-ai and related packages
- Configure Ollama access
- Test basic functionality
- Financial Data Analysis
- Load and clean sample financial data
- Perform basic statistical analysis
- Create time series visualizations
- AI-Enhanced Analysis
- Use Ollama for data exploration
- Generate automated insights
- Compare with traditional analysis
Submit: As instructed in the weekly assignment
Week 3 Projects:
- Market Analysis (Using Ollama)
- Build data processing pipeline
- Implement natural language queries
- Generate automated reports
- Optional: Compare with commercial API results
- Data Processing Pipeline
- Automate data cleaning with AI assistance
- Create interactive analysis system
- Implement quality checks
- Generate comprehensive reports
Note: All course assignments can be completed using the provided Ollama setup-though not going to be perfect. Commercial APIs are optional but encourage exploration and at student's discretion.
Best Practices:
- Resource Management
- Check GPU server status before running jobs
- Use batch processing for large datasets
- Monitor GPU memory usage
- If Using Commercial APIs
- Monitor usage costs carefully
- Use .env files for API keys
- Never commit API keys to version control
Additional Resources:
Check the department's GPU server status page for Ollama availability and usage guidelines. For those interested in commercial APIs, compare pricing and features before purchasing.