DualLens Analytics — RAG-Powered Investment Analysis

DualLens Analytics - RAG-Powered Investment Analysis by Ali Zarreh

Explore DualLens Analytics, a production-grade investment analysis platform that merges two complementary lenses — quantitative financial metrics and qualitative AI-strategy insights — into a single interactive dashboard. By combining real-time Yahoo Finance data with a Retrieval-Augmented Generation (RAG) pipeline grounded in company PDF reports, the system enables investors and analysts to make more informed assessments of a company's financial health and AI readiness.

Project Overview

DualLens Analytics delivers a dual-lens view of public companies by fusing two data streams: real-time stock prices and financial metrics from Yahoo Finance, and AI-strategy insights extracted from curated company PDF reports through a RAG pipeline powered by LangChain and ChromaDB. The result is an interactive four-tab Streamlit dashboard where users can explore financial trends, ask natural-language questions about AI initiatives, compare companies side-by-side, and view composite investment rankings — all grounded in a persisted vector knowledge base with 36+ automated tests and Docker-ready deployment.

Technology Stack & Architecture

Core Technologies & Architectural Decisions

  • RAG Pipeline: LangChain orchestration with OpenAI gpt-4o-mini for generation and text-embedding-ada-002 for semantic embeddings
  • Vector Store: ChromaDB with persistent storage, load-or-build logic for instant restarts, and cosine-similarity retrieval
  • Financial Data Engine: Yahoo Finance (yfinance) providing real-time stock prices, key metrics, and historical performance data
  • Interactive Frontend: Streamlit 4-tab layout with Plotly interactive charts, radar comparisons, and chat-style Q&A interface
  • Configuration Management: Hydra / OmegaConf for structured YAML config with .env secrets management via python-dotenv
  • Production Deployment: Multi-stage Docker build, Docker Compose with health checks, NGINX reverse proxy with HTTPS
  • Code Quality: Poetry packaging, Ruff linting/formatting, pytest (36+ tests), MkDocs + mkdocs-material documentation

RAG Architecture Innovation

  • Intelligent Document Pipeline: Automated ZIP extraction → PDF parsing → recursive character splitting with tiktoken-based chunking (1000 tokens, 200 overlap)
  • Persistent Vector Store: Load-or-build pattern that skips re-indexing on subsequent runs, reducing startup time from minutes to milliseconds
  • LLM-as-Judge Evaluation: Automated groundedness and relevance scoring (1–5) with justification for every RAG response
  • Composite Investment Ranking: Novel scoring algorithm blending quantitative financial metrics with LLM-assessed AI-strategy strength
  • Context-Grounded Generation: System prompts enforce answers strictly from retrieved context, preventing hallucination on financial topics
  • Multi-Company Analysis: Simultaneous tracking and comparison of GOOGL, MSFT, IBM, NVDA, and AMZN across both lenses

Interactive Dashboard & Features

Four Specialized Analysis Tabs

The Streamlit dashboard provides four distinct analytical perspectives, each combining data from multiple sources:

📊 Dashboard — Financial Overview

  • Stock Price Trends: Interactive Plotly time-series charts with configurable date ranges and multi-company overlays
  • Key Financial Metrics: Real-time Market Cap, P/E Ratio, Dividend Yield, Beta, and Total Revenue displayed in sortable tables
  • Visual Comparisons: Bar charts and radar diagrams for quick multi-company financial health assessment

🤖 AI Q&A — RAG-Powered Intelligence

  • Natural-Language Queries: Ask questions like "What is Google's approach to generative AI?" and receive context-grounded answers
  • Source Transparency: Every answer shows the retrieved PDF excerpts used for generation, enabling verification
  • Evaluation Scores: Automatic groundedness and relevance ratings via LLM-as-Judge for answer quality assurance

⚖️ Compare — Side-by-Side Analysis

  • Financial Radar Charts: Multi-dimensional comparison of two companies across key financial metrics
  • AI-Initiative Summaries: LLM-generated comparative summaries of each company's AI strategy and investments
  • Dual-Lens Verdict: Combined quantitative + qualitative assessment for informed decision-making

🏆 Rankings — Composite Scoring

  • Blended Investment Score: Proprietary algorithm combining financial health metrics with AI-strategy assessment
  • LLM-Powered Justification: Each ranking includes a detailed explanation of scoring factors
  • Dynamic Re-ranking: Scores update with latest financial data on every session

Technical Excellence & Engineering

RAG Pipeline Architecture

The core intelligence layer implements a five-stage pipeline optimized for financial document analysis:

  • Document Ingestion: Automated extraction of company AI-initiative PDF reports from ZIP archive with intelligent directory discovery
  • Chunking Strategy: RecursiveCharacterTextSplitter with tiktoken (cl100k_base) encoding — 1000-token chunks with 200-token overlap preserving cross-boundary context
  • Embedding & Indexing: OpenAI text-embedding-ada-002 vectors stored in ChromaDB with persistent SQLite-backed storage at data/chroma_db/
  • Similarity Retrieval: Cosine-similarity search returning top-k (configurable, default 10) most relevant document chunks per query
  • Grounded Generation: Context-injected prompts sent to gpt-4o-mini with strict instructions to answer only from retrieved passages

Production-Grade Infrastructure

  • Multi-Stage Docker Build: Poetry dependency installation in builder stage, slim Python 3.12 runtime image (~1.2 GB with ML libraries)
  • Docker Compose Orchestration: Single-command deployment with .env injection, named volumes for data persistence, and container health checks
  • NGINX Reverse Proxy: WebSocket-aware proxy configuration with Let's Encrypt HTTPS for production domains
  • CI/CD Pipeline: GitHub Actions workflow for automated MkDocs build and GitHub Pages deployment on every push to main
  • Structured Configuration: Hydra/OmegaConf YAML config with CLI overrides for LLM parameters, chunking, retrieval, and company selection

Code Quality & Testing

  • 36+ Automated Tests: Comprehensive pytest suite covering config loading, data processing, evaluation logic, financial calculations, and report generation
  • Ruff Lint & Format: Zero-warning codebase with enforced import sorting, type annotation checks, and best-practice rules
  • Google-Style Docstrings: Complete API documentation on all public functions and classes, auto-generated into MkDocs via mkdocstrings
  • Makefile Automation: 11+ developer workflow commands — make check runs lint, format-check, and tests in a single pass

Flexible Configuration & Tuning

Every aspect of the pipeline is configurable through Hydra YAML, enabling rapid experimentation without code changes:

ParameterConfig KeyDefaultEffect
Chunk sizechunking.chunk_size1000Larger = more context per chunk
Chunk overlapchunking.chunk_overlap200Higher = less info loss at boundaries
Top-k resultsretriever.k10More docs = richer context but noisier
Temperaturellm.temperature0.0Lower = more deterministic answers
Max tokensllm.max_tokens5000Cap on answer length
LLM modelllm.modelgpt-4o-miniSwap to gpt-4o, Claude, etc.
CompaniescompaniesGOOGL, MSFT, IBM, NVDA, AMZNAdd or remove tickers

Engineering Challenges & Solutions

Persistent Vector Store with Load-or-Build Pattern

Challenge: Re-embedding hundreds of PDF pages on every application restart wasted time and API tokens. Solution: Engineered a collection_exists() check that detects an existing ChromaDB collection and loads it instantly, only triggering the full embed pipeline on first run or after explicit cache invalidation.

Dual-Lens Composite Scoring

Challenge: Combining fundamentally different data types — numerical financial metrics and unstructured AI-strategy text — into a single investment ranking. Solution: Designed a hybrid scoring algorithm where financial metrics are normalized and weighted, then blended with LLM-assessed AI-strategy scores to produce a unified ranking with transparent justification.

Grounded Generation & Hallucination Prevention

Challenge: LLMs naturally generate plausible-sounding but fabricated financial data. Solution: Implemented strict context-grounding via system prompts that instruct the model to answer only from retrieved passages, paired with an LLM-as-Judge evaluation layer that scores groundedness and relevance for every response.

Production Docker Deployment

Challenge: Complex dependency tree (tiktoken, ChromaDB native libraries, PyPDF) requiring careful build management. Solution: Multi-stage Docker build with Poetry dependency resolution in an isolated builder stage, producing a clean slim runtime image with named volumes for persistent data across container restarts.

Scalability Vision & Future Roadmap

Planned Enhancements

  • Extended Data Sources: Integration with SEC EDGAR filings, earnings call transcripts, and analyst reports for deeper qualitative analysis
  • Real-Time Streaming: WebSocket-based live price feeds replacing periodic Yahoo Finance polling for intraday analysis
  • Multi-Modal RAG: Chart and table extraction from PDF reports using vision models for richer document understanding
  • User Portfolio Tracking: Personalized watchlists with alerting when AI-strategy scores change significantly
  • Advanced Evaluation: A/B testing framework for prompt variations with automated metric tracking across retrieval strategies
  • Federated Knowledge Base: Multi-tenant vector store supporting custom document uploads per user or organization

Impact & Technical Excellence

DualLens Analytics demonstrates end-to-end AI engineering mastery — from RAG pipeline design and LLM orchestration to production Docker deployment and automated documentation. The project bridges the gap between quantitative financial analysis and qualitative AI-strategy assessment, delivering a tool that provides genuinely novel insights unavailable from either data source alone.

The system showcases production-grade software engineering practices: comprehensive test coverage, type-safe configuration, structured logging, CI/CD automation, and security-conscious secrets management. Every architectural decision — from the persistent vector store to the multi-stage Docker build — reflects real-world deployment considerations for enterprise AI applications.

Key Technical Metrics: 36+ automated tests, zero lint warnings, sub-second vector store loads on subsequent runs, configurable pipeline with 7+ tunable parameters, and single-command Docker deployment with health monitoring. This project exemplifies the intersection of financial domain expertise and modern AI engineering.

Technical Resources & Documentation

Comprehensive technical implementation showcasing production-grade RAG architecture and investment analysis system design:

Live Demo View Source Code Technical Documentation