Machine learning & Big Data Workshops

Machine learning & Big Data Workshop series

Tentatively in Stamford campus

Description

The needs for data mining, especially big data mining have dramatically exploded for the recent years. The financial industry has been accumulating large amount of data over many years. The workshops introduce and explain the fundamentals of data mining and machine learning techniques being used in the financial industry. The assumption is that each student will have access to MATLAB/Octave, R, and Python for modeling, a DBMS system of choice, and installed Cloudera distribution with Pig, Hive, Mahout during the workshops.

Session I: Understanding machine learning and data mining algorithms

MATLAB implementation of univariate linear regression, multiple linear regression, polynomial regression, logistic regression, neural networks, decision trees.

The session discusses MATLAB programming with heavy array operations. Students are assumed to have at least basic MATLAB skills. Students write MATLAB code to implement the algorithms from the scratch to get full understanding of the math, materialize it, and apply it to financial data for insights.

Session II: Modeling using commercial and open source software

The session introduces toolboxes and packages provided in MATLAB, R, and Python that are heavily used for data mining in the industry. Students get understanding how to use modeling functions provided MATLAB, R, and Python for financial data mining.

Session III: Relational Database Management Systems (Choices are MySQL, Oracle, or MSSQL)

The session discusses the concept of data objects managed by relational database management systems, and relationships between them. It introduces data engineering techniques for pre-processing data and make them ready for data mining tasks. Students experience each of the above concepts by writing simple SQL statements, and writing MATLAB code that queries and handles data.

Session IV: Big Data in Finance, Part I

The session teaches students basic Linux commands, and HDFS commands. It discusses data mining and machine learning tasks at large scale using Mahout. Students write shell code to run Mahout applications. Examples that are going to be introduced are Random Forests, Naïve Bayes, K-means Clustering, and User-based Recommender.

Session V: Big Data in Finance, Part II

The session teaches students data processing tools such as Apache Pig and Apache Hive for large data sets. Student add files, explore and transform data using Apache Pig, split and join data, perform data analytics on Big Data using the DataFu Pig library, write simple Hive-SQL queries, etc. An introduction level of NoSQL with HBase is provided as well.