DualLens Analytics — RAG-Powered Investment Analysis
Explore DualLens Analytics, a production-grade investment analysis platform that merges two complementary lenses — quantitative financial metrics and qualitative AI-strategy insights — into a single interactive dashboard. By combining real-time Yahoo Finance data with a Retrieval-Augmented Generation (RAG) pipeline grounded in company PDF reports, the system enables investors and analysts to make more informed assessments of a company's financial health and AI readiness.
Project Overview
DualLens Analytics delivers a dual-lens view of public companies by fusing two data streams: real-time stock prices and financial metrics from Yahoo Finance, and AI-strategy insights extracted from curated company PDF reports through a RAG pipeline powered by LangChain and ChromaDB. The result is an interactive four-tab Streamlit dashboard where users can explore financial trends, ask natural-language questions about AI initiatives, compare companies side-by-side, and view composite investment rankings — all grounded in a persisted vector knowledge base with 36+ automated tests and Docker-ready deployment.
Technology Stack & Architecture
Core Technologies & Architectural Decisions
- RAG Pipeline: LangChain orchestration with OpenAI
gpt-4o-minifor generation andtext-embedding-ada-002for semantic embeddings - Vector Store: ChromaDB with persistent storage, load-or-build logic for instant restarts, and cosine-similarity retrieval
- Financial Data Engine: Yahoo Finance (
yfinance) providing real-time stock prices, key metrics, and historical performance data - Interactive Frontend: Streamlit 4-tab layout with Plotly interactive charts, radar comparisons, and chat-style Q&A interface
- Configuration Management: Hydra / OmegaConf for structured YAML config with
.envsecrets management via python-dotenv - Production Deployment: Multi-stage Docker build, Docker Compose with health checks, NGINX reverse proxy with HTTPS
- Code Quality: Poetry packaging, Ruff linting/formatting, pytest (36+ tests), MkDocs + mkdocs-material documentation
RAG Architecture Innovation
- Intelligent Document Pipeline: Automated ZIP extraction → PDF parsing → recursive character splitting with tiktoken-based chunking (1000 tokens, 200 overlap)
- Persistent Vector Store: Load-or-build pattern that skips re-indexing on subsequent runs, reducing startup time from minutes to milliseconds
- LLM-as-Judge Evaluation: Automated groundedness and relevance scoring (1–5) with justification for every RAG response
- Composite Investment Ranking: Novel scoring algorithm blending quantitative financial metrics with LLM-assessed AI-strategy strength
- Context-Grounded Generation: System prompts enforce answers strictly from retrieved context, preventing hallucination on financial topics
- Multi-Company Analysis: Simultaneous tracking and comparison of GOOGL, MSFT, IBM, NVDA, and AMZN across both lenses
Interactive Dashboard & Features
Four Specialized Analysis Tabs
The Streamlit dashboard provides four distinct analytical perspectives, each combining data from multiple sources:
📊 Dashboard — Financial Overview
- Stock Price Trends: Interactive Plotly time-series charts with configurable date ranges and multi-company overlays
- Key Financial Metrics: Real-time Market Cap, P/E Ratio, Dividend Yield, Beta, and Total Revenue displayed in sortable tables
- Visual Comparisons: Bar charts and radar diagrams for quick multi-company financial health assessment
🤖 AI Q&A — RAG-Powered Intelligence
- Natural-Language Queries: Ask questions like "What is Google's approach to generative AI?" and receive context-grounded answers
- Source Transparency: Every answer shows the retrieved PDF excerpts used for generation, enabling verification
- Evaluation Scores: Automatic groundedness and relevance ratings via LLM-as-Judge for answer quality assurance
⚖️ Compare — Side-by-Side Analysis
- Financial Radar Charts: Multi-dimensional comparison of two companies across key financial metrics
- AI-Initiative Summaries: LLM-generated comparative summaries of each company's AI strategy and investments
- Dual-Lens Verdict: Combined quantitative + qualitative assessment for informed decision-making
🏆 Rankings — Composite Scoring
- Blended Investment Score: Proprietary algorithm combining financial health metrics with AI-strategy assessment
- LLM-Powered Justification: Each ranking includes a detailed explanation of scoring factors
- Dynamic Re-ranking: Scores update with latest financial data on every session
Technical Excellence & Engineering
RAG Pipeline Architecture
The core intelligence layer implements a five-stage pipeline optimized for financial document analysis:
- Document Ingestion: Automated extraction of company AI-initiative PDF reports from ZIP archive with intelligent directory discovery
- Chunking Strategy:
RecursiveCharacterTextSplitterwith tiktoken (cl100k_base) encoding — 1000-token chunks with 200-token overlap preserving cross-boundary context - Embedding & Indexing: OpenAI
text-embedding-ada-002vectors stored in ChromaDB with persistent SQLite-backed storage atdata/chroma_db/ - Similarity Retrieval: Cosine-similarity search returning top-k (configurable, default 10) most relevant document chunks per query
- Grounded Generation: Context-injected prompts sent to
gpt-4o-miniwith strict instructions to answer only from retrieved passages
Production-Grade Infrastructure
- Multi-Stage Docker Build: Poetry dependency installation in builder stage, slim Python 3.12 runtime image (~1.2 GB with ML libraries)
- Docker Compose Orchestration: Single-command deployment with
.envinjection, named volumes for data persistence, and container health checks - NGINX Reverse Proxy: WebSocket-aware proxy configuration with Let's Encrypt HTTPS for production domains
- CI/CD Pipeline: GitHub Actions workflow for automated MkDocs build and GitHub Pages deployment on every push to
main - Structured Configuration: Hydra/OmegaConf YAML config with CLI overrides for LLM parameters, chunking, retrieval, and company selection
Code Quality & Testing
- 36+ Automated Tests: Comprehensive pytest suite covering config loading, data processing, evaluation logic, financial calculations, and report generation
- Ruff Lint & Format: Zero-warning codebase with enforced import sorting, type annotation checks, and best-practice rules
- Google-Style Docstrings: Complete API documentation on all public functions and classes, auto-generated into MkDocs via mkdocstrings
- Makefile Automation: 11+ developer workflow commands —
make checkruns lint, format-check, and tests in a single pass
Flexible Configuration & Tuning
Every aspect of the pipeline is configurable through Hydra YAML, enabling rapid experimentation without code changes:
| Parameter | Config Key | Default | Effect |
|---|---|---|---|
| Chunk size | chunking.chunk_size | 1000 | Larger = more context per chunk |
| Chunk overlap | chunking.chunk_overlap | 200 | Higher = less info loss at boundaries |
| Top-k results | retriever.k | 10 | More docs = richer context but noisier |
| Temperature | llm.temperature | 0.0 | Lower = more deterministic answers |
| Max tokens | llm.max_tokens | 5000 | Cap on answer length |
| LLM model | llm.model | gpt-4o-mini | Swap to gpt-4o, Claude, etc. |
| Companies | companies | GOOGL, MSFT, IBM, NVDA, AMZN | Add or remove tickers |
Engineering Challenges & Solutions
Persistent Vector Store with Load-or-Build Pattern
Challenge: Re-embedding hundreds of PDF pages on every application restart wasted time and API tokens. Solution: Engineered a collection_exists() check that detects an existing ChromaDB collection and loads it instantly, only triggering the full embed pipeline on first run or after explicit cache invalidation.
Dual-Lens Composite Scoring
Challenge: Combining fundamentally different data types — numerical financial metrics and unstructured AI-strategy text — into a single investment ranking. Solution: Designed a hybrid scoring algorithm where financial metrics are normalized and weighted, then blended with LLM-assessed AI-strategy scores to produce a unified ranking with transparent justification.
Grounded Generation & Hallucination Prevention
Challenge: LLMs naturally generate plausible-sounding but fabricated financial data. Solution: Implemented strict context-grounding via system prompts that instruct the model to answer only from retrieved passages, paired with an LLM-as-Judge evaluation layer that scores groundedness and relevance for every response.
Production Docker Deployment
Challenge: Complex dependency tree (tiktoken, ChromaDB native libraries, PyPDF) requiring careful build management. Solution: Multi-stage Docker build with Poetry dependency resolution in an isolated builder stage, producing a clean slim runtime image with named volumes for persistent data across container restarts.
Scalability Vision & Future Roadmap
Planned Enhancements
- Extended Data Sources: Integration with SEC EDGAR filings, earnings call transcripts, and analyst reports for deeper qualitative analysis
- Real-Time Streaming: WebSocket-based live price feeds replacing periodic Yahoo Finance polling for intraday analysis
- Multi-Modal RAG: Chart and table extraction from PDF reports using vision models for richer document understanding
- User Portfolio Tracking: Personalized watchlists with alerting when AI-strategy scores change significantly
- Advanced Evaluation: A/B testing framework for prompt variations with automated metric tracking across retrieval strategies
- Federated Knowledge Base: Multi-tenant vector store supporting custom document uploads per user or organization
Impact & Technical Excellence
DualLens Analytics demonstrates end-to-end AI engineering mastery — from RAG pipeline design and LLM orchestration to production Docker deployment and automated documentation. The project bridges the gap between quantitative financial analysis and qualitative AI-strategy assessment, delivering a tool that provides genuinely novel insights unavailable from either data source alone.
The system showcases production-grade software engineering practices: comprehensive test coverage, type-safe configuration, structured logging, CI/CD automation, and security-conscious secrets management. Every architectural decision — from the persistent vector store to the multi-stage Docker build — reflects real-world deployment considerations for enterprise AI applications.
Key Technical Metrics: 36+ automated tests, zero lint warnings, sub-second vector store loads on subsequent runs, configurable pipeline with 7+ tunable parameters, and single-command Docker deployment with health monitoring. This project exemplifies the intersection of financial domain expertise and modern AI engineering.
Technical Resources & Documentation
Comprehensive technical implementation showcasing production-grade RAG architecture and investment analysis system design: