MATH 5671 – Financial Data Mining and Big Data Analytics
Summer 2 & Spring – Open to undergraduate students
Instructor: Prof. Do
Office: MONT 130
Office hours: by appointment (Webex only)
Email: cuong.do@engineer.uconn.edu, cuong.do@math.uconn.edu
Office phone: 860.486.7132
Description
Financial industry specifically, and most of companies in general have been accumulating data for years and mine data to drive their financial decisions. Data are extremely large nowadays, and keep growing exponentially in the future, and become prohibitive to traditional machine learning and data mining methods. Mining data is said to be more valuable than mining oil.
This course introduces standard machine learning and data mining algorithms with financial applications and prepares students to work with large sized data sets. In the first part, students learn pre-processing, supervised learning algorithms such as logistic regression, naïve bayes, k-nearest neighbors, decision trees, neural networks, SVM, and unsupervised learning algorithms such as k-means clustering, and agglomerative clustering, association rule mining. Starting from neural networks, the course introduces an overview of convolutional neural networks (CNN) and typical architectures, recurrent neural networks (RNN), unidirectional and bidirectional long short-term memory networks, unidirectional and bidirectional gated recurrent unit networks, and neural transfer learning. Recommender systems are introduced, including content-based filtering, and collaborative filtering. Reinforcement learning is introduced. Students learn and practice manipulating data using resilient distributed datasets (RDDs) and data frames, and modeling using MLlib on Google Cloud Platform (GCP). A brief introduction of Hive/Pig for data analysis is provided.
The course uses external educational materials such as books, code, videos, and websites to support teaching, and accelerate student learning. Students are expected to spend significant amount of time to digest the assigned materials.
Prerequisite
MATH 5670 is highly recommended.
Software
Programming languages: Python, used with Jupyter Notebook or Google Colaboratory.
Packages: numpy, scipy, pandas, imblearn, tensorflow, keras, stable_baselines , gym, pyspark
Cloud: Students learn to use GCP products specifically for Big Data such as Dataproc, BigQuery, BigQuery ML, etc.
Textbook
A textbook is not required. The following books are recommended for hands-on practice. These books are available for reading on https://learning.oreilly.com using UConn credentials. You need to be on the campus network or use vpn.uconn.edu from off-campus.
- Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, by Geron Aurelien
https://learning.oreilly.com/library/view/hands-on-machine-learning/9781492032632/ch01.html
- Data Analytics with Spark Using Python, First edition, by Jeffrey Aven
https://learning.oreilly.com/library/view/data-analytics-with/9780134844855/
- Dive Into Deep Learning, by Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola
Teaching material (slides, instruction, code, and data) will be posted on HuskyCT at https://lms.uconn.edu.
Reference Books
- P.-N. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining,
- C. M. Bishop, Pattern Recognition and Machine Learning, Springer Science Business Media, LLC: 2006.
- R. O. Duda, P. E. Hart, and D. G. Stork, Pattern classification 2nd, (Wiley Interscience: 2001).
- S. Theodoridis, K. Koutroumbas, Pattern Recognition, 4th Edition
- J. Han, M. Kamber, J. Pei, Data Mining: Concepts and Techniques
- D. T. Larose, C. D. Larose, Discovering Knowledge in Data: An Introduction to Data Mining
- G. Dougherty, Pattern Recognition and Classification an Introduction
- B. Kovalerchuk, E. Vityaev, Data Mining in Finance: Advances in Relational and Hybrid Methods (The Springer International Series in Engineering and Computer Science)
- S. Chakrabarti, Data mining know it all
- A. Aldo Faisal, Cheng Soon Ong, and Marc Peter Deisenroth, Mathematics for Machine Learning
- Aston Zhang Zachary C. Lipton Mu Li Alexander J. Smola, Dive into Deep Learning
- François Chollet, Deep Learning with Python
- Mohammed Guller, Big Data Analytics with Spark: A Practitioner’s Guide to Using Spark for Large Scale Data Analysis
Grading
Grades are based on group assignments and projects following this distribution:
Class participation 10%
Group assignments 40%
Group project 50%
——-
100%
Students can check projects in previous semesters at http://datascience.uconn.edu/
Note: The instructor reserves the right to make changes to the syllabus as needed.
If there is any change, you will be notified in class or by your UConn e-mail address