Big Data / Analytics
The latest addition to the Advanced Computing platform is the Big Data/Analytics core. The Big Data/Analytics core is focused on providing not just the infrastructure for Big Data programs but also the expertise in Analytics and Machine Learning to address real world problems. The Big Data Analytics Cluster (BDAC) service has over 256 cores and close to 75TB of big data storage. This storage utilizes several ETL platforms (including Kettle) to stage data from the W.A.D.E. storage cloud onto BDAC.
BDAC is designed around HDFS and Map-Reduce frameworks, presented by industry standard tools such as the NO-SQL database technology HBASE, Sqoop, Flume, and Spark technologies, and is heavily invested in Python and R bindings. PySpark, SparkR, and other intelligence engines (machine learning focus) are incorporated into BDAC providing UM researchers the most flexible Big Data system possible.
Along with infrastructure, the Big Data/Analytics core also features subject matter expertise in Analytics for all data sets. Focused on non-traditional analysis techniques and Machine Learning, core staff are currently working on projects ranging from sentiment analysis with genetic algorithms to medical billing informatics and the URIDE data analytics platform.
The Big Data/Analytics offers a range of specialized consulting services to UM’s research community in large scale data collection and storage, data processing pipeline development, data mining and machine learning, big data search, and presentation layer development. We aim to improve and optimize business processes through data-driven decision making.
Big Data/Analytics core projects:
- HPC cluster status and job statistics data collection and processing for performance, utilization, and efficiency research (dataset of more than 10 million records)
- hadoop cluster creation and management automation in OpenStack Cloud, and high-availability development
- Pentaho Kettle hadoop cluster integration
- collaboration with the Business School in a text mining analysis for Amazon product reviews to build a model that could improve future sales
- UHealth diagnosis and patient demographic data analysis to discover patterns among different elements of their clinic datasets (dataset of more than 5 million records)
- analysis of hospital EDI dataset to improve the accuracy of insurance claims