Group Project — S&P 500 Stock Predictor

Overview

A COMP208 group project building a stock prediction system for the S&P 500 index, using 30 years of historical data scraped via Yahoo Finance (yfinance).

Approach

We’re running predictions in progressive stages to assess how much detail is worth the computational cost:

S&P 500 index alone (no constituent stocks)
Top 10 constituent stocks
Top 50 constituent stocks
All 500 constituent stocks

Each stage is tested across three algorithms — K-Nearest Neighbours, LSTM (Long Short-Term Memory), and Linear Regression — to find where diminishing returns kick in.

The last 3–5 years of data are held out entirely for testing. The final model will be trained on all available data and attempt to predict weekly index values for coming months.

Database Design

The full dataset is nearly 800 MiB, so database efficiency is a primary concern. We’re testing normalisation from 1NF through 6NF and benchmarking read/write performance at each stage to find the right balance between storage efficiency and query speed.

Deliverable

Results will be presented on a hosted website showing:

Hindcast predictions across the 30-year window
Live forward predictions once the final model is trained

Tech Stack

Python · yfinance · MySQL · scikit-learn · PyTorch (LSTM)