DualLens Analytics — RAG-Powered Investment Analysis

Explore DualLens Analytics, a production-grade investment analysis platform that merges two complementary lenses — quantitative financial metrics and qualitative AI-strategy insights — into a single interactive dashboard. By combining real-time Yahoo Finance data with a Retrieval-Augmented Generation (RAG) pipeline grounded in company PDF reports, the system enables investors and analysts to make more informed assessments of a company's financial health and AI readiness.

Project Overview

DualLens Analytics delivers a dual-lens view of public companies by fusing two data streams: real-time stock prices and financial metrics from Yahoo Finance, and AI-strategy insights extracted from curated company PDF reports through a RAG pipeline powered by LangChain and ChromaDB. The result is an interactive four-tab Streamlit dashboard where users can explore financial trends, ask natural-language questions about AI initiatives, compare companies side-by-side, and view composite investment rankings — all grounded in a persisted vector knowledge base with 36+ automated tests and Docker-ready deployment.

Technology Stack & Architecture

Core Technologies & Architectural Decisions

RAG Pipeline: LangChain orchestration with OpenAI gpt-4o-mini for generation and text-embedding-ada-002 for semantic embeddings
Vector Store: ChromaDB with persistent storage, load-or-build logic for instant restarts, and cosine-similarity retrieval
Financial Data Engine: Yahoo Finance (yfinance) providing real-time stock prices, key metrics, and historical performance data
Interactive Frontend: Streamlit 4-tab layout with Plotly interactive charts, radar comparisons, and chat-style Q&A interface
Configuration Management: Hydra / OmegaConf for structured YAML config with .env secrets management via python-dotenv
Production Deployment: Multi-stage Docker build, Docker Compose with health checks, NGINX reverse proxy with HTTPS
Code Quality: Poetry packaging, Ruff linting/formatting, pytest (36+ tests), MkDocs + mkdocs-material documentation

RAG Architecture Innovation

Intelligent Document Pipeline: Automated ZIP extraction → PDF parsing → recursive character splitting with tiktoken-based chunking (1000 tokens, 200 overlap)
Persistent Vector Store: Load-or-build pattern that skips re-indexing on subsequent runs, reducing startup time from minutes to milliseconds
LLM-as-Judge Evaluation: Automated groundedness and relevance scoring (1–5) with justification for every RAG response
Composite Investment Ranking: Novel scoring algorithm blending quantitative financial metrics with LLM-assessed AI-strategy strength
Context-Grounded Generation: System prompts enforce answers strictly from retrieved context, preventing hallucination on financial topics
Multi-Company Analysis: Simultaneous tracking and comparison of GOOGL, MSFT, IBM, NVDA, and AMZN across both lenses

Interactive Dashboard & Features

Four Specialized Analysis Tabs

The Streamlit dashboard provides four distinct analytical perspectives, each combining data from multiple sources:

📊 Dashboard — Financial Overview

Stock Price Trends: Interactive Plotly time-series charts with configurable date ranges and multi-company overlays
Key Financial Metrics: Real-time Market Cap, P/E Ratio, Dividend Yield, Beta, and Total Revenue displayed in sortable tables
Visual Comparisons: Bar charts and radar diagrams for quick multi-company financial health assessment

🤖 AI Q&A — RAG-Powered Intelligence

Natural-Language Queries: Ask questions like "What is Google's approach to generative AI?" and receive context-grounded answers
Source Transparency: Every answer shows the retrieved PDF excerpts used for generation, enabling verification
Evaluation Scores: Automatic groundedness and relevance ratings via LLM-as-Judge for answer quality assurance

⚖️ Compare — Side-by-Side Analysis

Financial Radar Charts: Multi-dimensional comparison of two companies across key financial metrics
AI-Initiative Summaries: LLM-generated comparative summaries of each company's AI strategy and investments
Dual-Lens Verdict: Combined quantitative + qualitative assessment for informed decision-making

🏆 Rankings — Composite Scoring

Blended Investment Score: Proprietary algorithm combining financial health metrics with AI-strategy assessment
LLM-Powered Justification: Each ranking includes a detailed explanation of scoring factors
Dynamic Re-ranking: Scores update with latest financial data on every session

Technical Excellence & Engineering

RAG Pipeline Architecture

The core intelligence layer implements a five-stage pipeline optimized for financial document analysis:

Document Ingestion: Automated extraction of company AI-initiative PDF reports from ZIP archive with intelligent directory discovery
Chunking Strategy: RecursiveCharacterTextSplitter with tiktoken (cl100k_base) encoding — 1000-token chunks with 200-token overlap preserving cross-boundary context
Embedding & Indexing: OpenAI text-embedding-ada-002 vectors stored in ChromaDB with persistent SQLite-backed storage at data/chroma_db/
Similarity Retrieval: Cosine-similarity search returning top-k (configurable, default 10) most relevant document chunks per query
Grounded Generation: Context-injected prompts sent to gpt-4o-mini with strict instructions to answer only from retrieved passages

Production-Grade Infrastructure

Multi-Stage Docker Build: Poetry dependency installation in builder stage, slim Python 3.12 runtime image (~1.2 GB with ML libraries)
Docker Compose Orchestration: Single-command deployment with .env injection, named volumes for data persistence, and container health checks
NGINX Reverse Proxy: WebSocket-aware proxy configuration with Let's Encrypt HTTPS for production domains
CI/CD Pipeline: GitHub Actions workflow for automated MkDocs build and GitHub Pages deployment on every push to main
Structured Configuration: Hydra/OmegaConf YAML config with CLI overrides for LLM parameters, chunking, retrieval, and company selection

Code Quality & Testing

36+ Automated Tests: Comprehensive pytest suite covering config loading, data processing, evaluation logic, financial calculations, and report generation
Ruff Lint & Format: Zero-warning codebase with enforced import sorting, type annotation checks, and best-practice rules
Google-Style Docstrings: Complete API documentation on all public functions and classes, auto-generated into MkDocs via mkdocstrings
Makefile Automation: 11+ developer workflow commands — make check runs lint, format-check, and tests in a single pass

Flexible Configuration & Tuning

Every aspect of the pipeline is configurable through Hydra YAML, enabling rapid experimentation without code changes:

Parameter	Config Key	Default	Effect
Chunk size	`chunking.chunk_size`	1000	Larger = more context per chunk
Chunk overlap	`chunking.chunk_overlap`	200	Higher = less info loss at boundaries
Top-k results	`retriever.k`	10	More docs = richer context but noisier
Temperature	`llm.temperature`	0.0	Lower = more deterministic answers
Max tokens	`llm.max_tokens`	5000	Cap on answer length
LLM model	`llm.model`	gpt-4o-mini	Swap to gpt-4o, Claude, etc.
Companies	`companies`	GOOGL, MSFT, IBM, NVDA, AMZN	Add or remove tickers

Engineering Challenges & Solutions

Persistent Vector Store with Load-or-Build Pattern

Challenge: Re-embedding hundreds of PDF pages on every application restart wasted time and API tokens. Solution: Engineered a collection_exists() check that detects an existing ChromaDB collection and loads it instantly, only triggering the full embed pipeline on first run or after explicit cache invalidation.

Dual-Lens Composite Scoring

Challenge: Combining fundamentally different data types — numerical financial metrics and unstructured AI-strategy text — into a single investment ranking. Solution: Designed a hybrid scoring algorithm where financial metrics are normalized and weighted, then blended with LLM-assessed AI-strategy scores to produce a unified ranking with transparent justification.

Grounded Generation & Hallucination Prevention

Challenge: LLMs naturally generate plausible-sounding but fabricated financial data. Solution: Implemented strict context-grounding via system prompts that instruct the model to answer only from retrieved passages, paired with an LLM-as-Judge evaluation layer that scores groundedness and relevance for every response.

Production Docker Deployment

Challenge: Complex dependency tree (tiktoken, ChromaDB native libraries, PyPDF) requiring careful build management. Solution: Multi-stage Docker build with Poetry dependency resolution in an isolated builder stage, producing a clean slim runtime image with named volumes for persistent data across container restarts.

Scalability Vision & Future Roadmap

Planned Enhancements

Extended Data Sources: Integration with SEC EDGAR filings, earnings call transcripts, and analyst reports for deeper qualitative analysis
Real-Time Streaming: WebSocket-based live price feeds replacing periodic Yahoo Finance polling for intraday analysis
Multi-Modal RAG: Chart and table extraction from PDF reports using vision models for richer document understanding
User Portfolio Tracking: Personalized watchlists with alerting when AI-strategy scores change significantly
Advanced Evaluation: A/B testing framework for prompt variations with automated metric tracking across retrieval strategies
Federated Knowledge Base: Multi-tenant vector store supporting custom document uploads per user or organization

Impact & Technical Excellence

DualLens Analytics demonstrates end-to-end AI engineering mastery — from RAG pipeline design and LLM orchestration to production Docker deployment and automated documentation. The project bridges the gap between quantitative financial analysis and qualitative AI-strategy assessment, delivering a tool that provides genuinely novel insights unavailable from either data source alone.

The system showcases production-grade software engineering practices: comprehensive test coverage, type-safe configuration, structured logging, CI/CD automation, and security-conscious secrets management. Every architectural decision — from the persistent vector store to the multi-stage Docker build — reflects real-world deployment considerations for enterprise AI applications.

Key Technical Metrics: 36+ automated tests, zero lint warnings, sub-second vector store loads on subsequent runs, configurable pipeline with 7+ tunable parameters, and single-command Docker deployment with health monitoring. This project exemplifies the intersection of financial domain expertise and modern AI engineering.

Technical Resources & Documentation

Comprehensive technical implementation showcasing production-grade RAG architecture and investment analysis system design:

Live Demo View Source Code Technical Documentation