AutoTrader

An ML-Powered Stock Prediction & Trading System

Personal Project · 2024–Present · Live in Production

MONEYSIGNALS.US → | DAILY PREDICTIONS →

The Story

AutoTrader started the way a lot of side projects do: with a question I couldn't let go of. I'd been working with recommendation systems at my day job, and it struck me that the core problem—predicting what a person will want next based on noisy, incomplete signals—isn't that different from predicting where a stock will move next based on noisy, incomplete market data.

So I started small. A few tickers, a basic feature set, a model that ran on my M3 laptop. But as I dug in, the scope grew naturally. A single ticker needed multi-timeframe analysis. Multi-timeframe analysis needed richer features. Richer features needed a real data pipeline. A real data pipeline needed cloud infrastructure. And before long, I was building a system that ingests data for 600+ tickers every night, engineers 500+ features from eight distinct sources, trains over 1,800 models, and delivers ranked predictions to subscribers before the opening bell.

Every component was designed and built by me from scratch. It runs autonomously on a multi-cloud setup (GCP + Azure) for about $235/month, and it's become the most technically satisfying project I've worked on—a place where I get to combine ML modeling, data engineering, infrastructure design, and product thinking all in one system.

1,800+ Trained Models
500+ Engineered Features
600+ Tickers Covered
~$235/mo Total Infrastructure Cost

System Architecture

The system is split across two GCP virtual machines, coordinated through Google Cloud Storage and PostgreSQL. The separation isn't arbitrary: feature engineering is I/O-bound (lots of API calls and database writes), while model training is CPU-bound (lots of number crunching). Putting them on different VMs means I can right-size each machine's resources without overpaying for either workload.

Everything is orchestrated by cron jobs that hand off data downstream in sequence. There are no manual steps in the daily workflow—from raw market data to delivered email predictions, the system runs end-to-end without intervention.

Daily Workflow (EST)

12:10 AM
Data Collection
VM2 pulls OHLCV data from 600+ tickers via the EODHD API, computes 500+ features per ticker, and writes everything to PostgreSQL and GCS. Takes about 45–60 minutes.
3:00 AM (Sat)
Comprehensive Training
VM3 retrains all ~1,800 models with Optuna hyperparameter optimization and walk-forward validation. Incremental training (100–200 models) runs on weekdays.
5:00 AM
Inference
VM3 loads every active model and generates predictions for the upcoming trading day. Each prediction combines a directional call with a magnitude estimate and a confidence score.
5:30 AM
Email Delivery
Tiered emails go out to subscribers with ranked predictions, market sentiment context, and analysis reports—all before the 9:30 AM open.

Infrastructure

VM2: Data & Execution

2 vCPU, 8 GB RAM · ~$49/mo


Data collection & features
Inference & email delivery
Trading execution (Alpaca)
LLM event signals (Claude)

VM3: Training (GCP)

4 vCPU, 32 GB RAM · ~$50-80/mo


Dual model training (XGBoost)
Optuna hyperparameter search
Preemptible (auto-recovery)
GCS model sync

PostgreSQL + GCS

2 vCPU, 8 GB · ~$52/mo


TimescaleDB (market data)
Model registry & predictions
GCS model artifact storage
PgBouncer connection pooling

Azure: Parallel Training

Parallel training node


Morning/evening/night sessions
Weekly & monthly models
Distributed lock coordination
SSH tunnel to GCP PostgreSQL

Data flow: EODHD API ($80/mo) → VM2 (collect & engineer) → PostgreSQL (store) → VM3 + Azure (train) → GCS (models) → VM2 (predict & trade) → Subscribers (email)

Pipeline Details

Click any section below to expand or collapse it.

Data Collection & Feature Engineering

This is where the raw ingredients come from. Every night, the pipeline pulls fresh market data from the EODHD API for all S&P 500 constituents plus 184 ETFs, then transforms that data into a rich set of 500+ engineered features spanning technical, fundamental, sentiment, behavioral, and alternative data dimensions.

How It Works

  • Ingest OHLCV market data across multiple timeframes for all tracked tickers
  • Prioritize high-liquidity names to ensure freshest data for major positions
  • Store to PostgreSQL with GCS redundancy
  • Run the feature computation pipeline to generate 500+ features per ticker across eight signal-family categories
  • Collect multi-source sentiment and alternative data signals
  • Compute proprietary training labels designed to capture directional intent rather than simple close-to-close returns

Feature Sources (8 Families, 500+ Features)

Technical Indicators
Trend, momentum, volatility, volume, and pattern-based signals across multiple timeframes
Fundamental Data
Valuation metrics, earnings estimates, and corporate event signals
Cross-Asset & Sector Signals
Inter-market relationships, sector rotation dynamics, and risk regime indicators
Price Microstructure
Higher-order derivatives of price dynamics and structural pattern recognition
Behavioral Economics
Cognitive bias indicators: anchoring, disposition effect, herding intensity, and loss aversion asymmetries
Alternative Data & Sentiment
Multi-source sentiment aggregation, social trend analysis, and event-driven signals with lagged impact modeling
Statistical & Regime Features
Mean reversion signals, market phase detection, and volatility regime classification
Proprietary Composite Signals
Calibrated multi-factor combinations derived from ongoing research into market microstructure
Model Training

The core insight behind the training architecture is that direction and magnitude are fundamentally different prediction tasks and benefit from being modeled separately. Every ticker/timeframe combination gets two XGBoost models: a classifier that predicts whether the stock goes up or down, and a regressor that predicts by how much.

Dual Model Architecture

Training two models per ticker lets each be optimized for what it's best at:

  • Direction Model: Predicts bullish or bearish. Optimized on classification accuracy. Trained on filtered data that removes noise days where direction is essentially random.
  • Magnitude Model: Predicts expected move size. Optimized on directional accuracy. Trained on the full dataset to capture the complete distribution of outcomes.

At inference time, the two predictions are combined into a single calibrated confidence score that captures both conviction and expected size of the move. Post-hoc calibration ensures the confidence values reflect true accuracy rates.

Training Process

  • Priority queue: Models queued by strategy (worst-performing first) so training time goes where it has the most impact
  • Data loading: Features and labels pulled from PostgreSQL with GCS fallback
  • Noise filtering: Low-movement days removed for direction model training to focus on meaningful signals
  • Walk-forward validation: Expanding-window folds that respect temporal ordering (no future data leakage)
  • Hyperparameter optimization: Automated search across model parameters using Bayesian optimization
  • Evaluation: Multiple accuracy metrics tracked per fold including directional accuracy
  • Lifecycle management: Top model versions retained per ticker/timeframe; older versions pruned automatically
Comprehensive training (all ~1,800 models) runs every Saturday and takes 2–4 hours. Incremental training (100–200 models) runs on weekdays in 20–60 minutes, focusing on new tickers and underperformers.
Inference & Prediction

Every weekday morning at 5:00 AM, the inference pipeline loads all active models and generates a prediction for each ticker/timeframe pair. The output is a ranked list of the day's highest-confidence predictions, ready for delivery.

How It Works

  • Query the model registry for all active models (status = active)
  • For each ticker/timeframe: load the classifier and regressor from GCS
  • Load the most recent features for the current prediction date
  • Generate a direction prediction (bullish/bearish) with probability
  • Generate a magnitude prediction (% expected move)
  • Combine into a single confidence-ranked score
  • Store all predictions in PostgreSQL and upload a snapshot to GCS
  • Rank by confidence and split into top bullish and top bearish lists

Current Production Scale

937 Predictions Generated Daily
358 Tickers with Active Models
~32/min Prediction Throughput
Email Delivery & Subscriptions

The delivery system takes predictions and wraps them in context: market sentiment, economic calendar events, and analysis reports. Subscribers receive content matched to their tier, delivered as polished HTML emails with optional attachments.

Delivery Workflow

  • Validate PostgreSQL tunnel connectivity (auto-start if needed)
  • Check data freshness via TradingDayValidator—trigger a sync if data is stale
  • Generate or load analysis reports for the current trading day
  • Load predictions from PostgreSQL
  • Collect market context: Put/Call ratio, Fear & Greed index, social sentiment (ApeWisdom), Forex Factory economic calendar
  • Load subscriber list and filter by tier
  • Render tier-specific HTML emails with appropriate attachments
  • Send via SMTP with a lock file to prevent duplicate sends
  • SMS notification to admin on success or failure
See it live: Browse the Daily Updates page for real examples of the Basic tier email output, published every trading day.

Subscription Tiers

Content scales with tier—everyone gets predictions, but the depth of analysis and number of picks increases as you move up.

Tier Predictions Analysis Extras
Basic SPY only F&G, headlines
Premium Top 50 PCR, social, congress, full news
Professional All 600+ LLM synthesis, entity tracker, alt-data CSV + heatmaps
Secret All 600+ Sonnet synthesis, raw model data CSV + heatmaps + API

Design Decisions

A system like this involves hundreds of small choices. Here are the ones that shaped the architecture most significantly—and the reasoning behind each.

Why dual models instead of one?

Early on I tried a single model that predicted signed returns directly. It was mediocre at both direction and magnitude. Splitting the problem into a classifier ("which way?") and a regressor ("how far?") lets each model focus on what it does best. The classifier trains on filtered data with noise days removed; the regressor sees the full distribution. The combined signal is stronger than either alone.

Why walk-forward validation?

Standard K-fold cross-validation would let the model see Tuesday's data while training on Thursday's. In financial data, that's cheating—any time-series pattern, regime change, or structural break gets leaked across the boundary. Walk-forward validation with expanding windows respects temporal ordering, which means the performance estimates I get are realistic rather than flattering.

Why custom training labels instead of simple returns?

Simple close-to-close returns miss the intraday story—a stock can gap up 2% then sell off all day. The training labels are designed to capture where actual trading conviction lies, producing better signal for the models even if they're noisier to compute.

Why separate VMs?

Feature engineering spends most of its time waiting on API responses and writing to databases (I/O-bound). Model training spends most of its time in XGBoost's gradient computations (CPU-bound). Running both on a single VM would mean paying for 16 GB of RAM during data collection when I only need 8, or paying for beefy CPUs during the data pipeline when they'd sit idle. The multi-VM split lets me right-size each workload—GCP handles data pipelines and a dedicated PostgreSQL instance, while an Azure VM provides parallel training capacity.

Why filter noise days for the direction model?

On days when a stock barely moves, predicting "up" or "down" is essentially a coin flip—and training on coin flips adds noise without signal. Filtering low-movement days lets the direction model focus on days with actual directional commitment, while the magnitude model still sees the full distribution.

Why build everything from scratch?

Partly because I wanted to understand every piece of the system at a level that using off-the-shelf solutions wouldn't give me. But also because the constraints of a personal project—tight budget, single maintainer, zero tolerance for pager fatigue—reward simplicity. Cron jobs, PostgreSQL, and GCS are boring, well-understood technologies. That's the point. I'd rather spend my engineering time on feature research and model architecture than debugging Kubernetes manifests.

Tech Stack

Machine Learning

XGBoost Optuna Scikit-learn TA-Lib FAISS Pandas NumPy SciPy

Data Sources

Ticker & Price Data Options Flow Market Sentiment Social Sentiment Economic Calendar Investor Surveys

Infrastructure

GCP Compute Engine Google Cloud Storage PostgreSQL SQLite Cron SSH Tunnels

Trading & Delivery

Alpaca API SMTP / Gmail SMS Alerts Stripe

Languages & Tools

Python Bash SQL Git Selenium

Subscribe

AutoTrader delivers ML-driven market predictions to your inbox every trading day before the opening bell. 1,800+ models, 600+ tickers, 500+ features — fully autonomous.

All paid tiers include a 7-day free preview of the Basic tier so you can see the system in action.

Email Subscriptions

Basic
Free
Daily market snapshot
  • SPY prediction
  • Fear & Greed index
  • Key events for tomorrow
  • 2 business news headlines
  • Market sentiment snapshot
Professional
$99/mo
Full signal access
  • All 600+ ticker predictions
  • All timeframes (daily, weekly, monthly)
  • LLM market synthesis & narrative
  • S&P 500 premium predictions
  • GICS heatmap attachments
  • CSV data exports
  • Full momentum table with AUC
  • Everything in Premium
Subscribe

API Access

Programmatic access to AutoTrader's predictions, features, and model data. Built for quants, algo traders, and fintech developers. View API docs →

API Starter
$49/mo
For side projects & exploration
  • Core prediction endpoints
  • 1,000 API calls / day
  • Daily predictions (JSON)
  • Market sentiment data
  • Standard rate limiting
Get Started
API Pro
$199/mo
For algo traders & small funds
  • All 18 endpoints
  • 10,000 API calls / day
  • Raw model predictions & features
  • Feature importance data
  • Historical prediction archive
  • CSV & JSON exports
Get Started
Enterprise
Custom
For teams & institutions
  • Unlimited API calls
  • Dedicated support & SLA
  • Custom endpoints & integrations
  • Bulk historical data access
  • Architecture licensing available
Contact

What's Next

AutoTrader is a living system—it runs in production daily, but it's also my primary playground for exploring new ideas. A few things on the roadmap:

  • Ensemble methods: Exploring how to combine predictions across timeframes (daily, weekly, monthly) into a single multi-horizon signal, weighted by each model's recent accuracy.
  • Transformer-based models: The current XGBoost approach works well on tabular features, but I'm curious whether attention mechanisms over raw price sequences could capture patterns that hand-engineered features miss.
  • Portfolio optimization: Moving beyond individual ticker predictions to portfolio-level allocation—factoring in correlation, sector exposure, and risk constraints.
  • Real-time inference: Currently predictions run once daily. Exploring whether intraday feature updates and streaming inference could capture opportunities that the overnight pipeline misses.