Regular Teaching

MATH 5671 – Financial Data Mining and Big Data Analytics

Summer 2 & Spring – Open to undergraduate students

Instructor: Prof. Do

Office: MONT 130

Office hours: by appointment (Webex only)

Email: cuong.do@engineer.uconn.edu, cuong.do@math.uconn.edu

Office phone: 860.486.7132

Description

Financial industry specifically, and most of companies in general have been accumulating data for years and mine data to drive their financial decisions. Data are extremely large nowadays, and keep growing exponentially in the future, and become prohibitive to traditional machine learning and data mining methods. Mining data is said to be more valuable than mining oil.

This course introduces standard machine learning and data mining algorithms with financial applications and prepares students to work with large sized data sets. In the first part, students learn pre-processing, supervised learning algorithms such as logistic regression, naïve bayes, k-nearest neighbors, decision trees, neural networks, SVM, and unsupervised learning algorithms such as k-means clustering, and agglomerative clustering, association rule mining. Starting from neural networks, the course introduces an overview of convolutional neural networks (CNN) and typical architectures, recurrent neural networks (RNN), unidirectional and bidirectional long short-term memory networks, unidirectional and bidirectional gated recurrent unit networks, and neural transfer learning. Recommender systems are introduced, including content-based filtering, and collaborative filtering. Reinforcement learning is introduced. Students learn and practice manipulating data using resilient distributed datasets (RDDs) and data frames, and modeling using MLlib on Google Cloud Platform (GCP). A brief introduction of Hive/Pig for data analysis is provided.

The course uses external educational materials such as books, code, videos, and websites to support teaching, and accelerate student learning. Students are expected to spend significant amount of time to digest the assigned materials.

Prerequisite

MATH 5670 is highly recommended.

Software

Programming languages: Python, used with Jupyter Notebook or Google Colaboratory.

Packages: numpy, scipy, pandas, imblearn, tensorflow, keras, stable_baselines , gym, pyspark

Cloud: Students learn to use GCP products specifically for Big Data such as Dataproc, BigQuery, BigQuery ML, etc.

Textbook

A textbook is not required. The following books are recommended for hands-on practice. These books are available for reading on https://learning.oreilly.com using UConn credentials. You need to be on the campus network or use vpn.uconn.edu from off-campus.

  1. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, by Geron Aurelien

https://learning.oreilly.com/library/view/hands-on-machine-learning/9781492032632/ch01.html

  1. Data Analytics with Spark Using Python, First edition, by Jeffrey Aven

https://learning.oreilly.com/library/view/data-analytics-with/9780134844855/

  1. Dive Into Deep Learning, by Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola

Teaching material (slides, instruction, code, and data) will be posted on HuskyCT at https://lms.uconn.edu.

Reference Books

  1. P.-N. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining,
  2. C. M. Bishop, Pattern Recognition and Machine Learning, Springer Science Business Media, LLC: 2006.
  3. R. O. Duda, P. E. Hart, and D. G. Stork, Pattern classification 2nd, (Wiley Interscience: 2001).
  4. S. Theodoridis, K. Koutroumbas, Pattern Recognition, 4th Edition
  5. J. Han, M. Kamber, J. Pei, Data Mining: Concepts and Techniques
  6. D. T. Larose, C. D. Larose, Discovering Knowledge in Data: An Introduction to Data Mining
  7. G. Dougherty, Pattern Recognition and Classification an Introduction
  8. B. Kovalerchuk, E. Vityaev, Data Mining in Finance: Advances in Relational and Hybrid Methods (The Springer International Series in Engineering and Computer Science)
  9. S. Chakrabarti, Data mining know it all
  10. A. Aldo Faisal, Cheng Soon Ong, and Marc Peter Deisenroth, Mathematics for Machine Learning
  11. Aston Zhang Zachary C. Lipton Mu Li Alexander J. Smola, Dive into Deep Learning
  12. François Chollet, Deep Learning with Python
  13. Mohammed Guller, Big Data Analytics with Spark: A Practitioner’s Guide to Using Spark for Large Scale Data Analysis

    Grading

    Grades are based on group assignments and projects following this distribution:

    Class participation                                10%

    Group assignments                               40%

    Group project                                        50%

    ——-

    100%

    Students can check projects in previous semesters at http://datascience.uconn.edu/

    Note: The instructor reserves the right to make changes to the syllabus as needed.

    If there is any change, you will be notified in class or by your UConn e-mail address