FACILITIES, EQUIPMENT, AND OTHER RESOURCES
In 2007, UM made a strategic investment in computational science by founding the Center for Computational Science (CCS). It was created to catalyze transdisciplinary research in science and engineering using cutting edge software and hardware and, more importantly, world-class expertise. CCS is organized around eight focus areas:
Physical Sciences and Engineering: Led by Dr. Benjamin Kirtman, this program is designed to support research and development involving oceanic, atmospheric, and climatic issues as well as problems related to fluid dynamics, solid structures, and materials.
Computational Biology and Bioinformatics: This program supports projects involving genomics, biological data management, molecular sequence and structures, marine ecology, biodiversity and population dynamics. Dr. Sunil Rao is this program’s lead.
Data Mining: This program, headed up by Dr. Mitsunori Ogihara, focuses on developing methods to understand large data sets to include data clustering and association rule mining. This program lends expertise in large-scale nonstatistical data analysis across many disciplines.
Social Systems Informatics: As implied by its name, this program is focused on the interaction within a social construct’s distinct but interacting components using tools such as data mining, storing and processing. Dr. C. Hendricks Brown leads this program.
Visualization: The visualization program aims to develop and improve core technologies for comprehensive computational modeling, simulation, analysis, and visualization of natural and synthetic phenomena. Our researchers are focused on developing visualization tools for GIS, Biomedicine, Epidemiology, and Geophysical applications. Dr. Gecheng Zha is the interim leader for this program.
Computational Chemistry: The computational chemistry program includes a range of problems at the interface of chemistry, biology, modeling, data mining, electronic structure modeling, dynamics and kinetics. Dr. Stephan C. Schurer is the interim leader for this program.
Software Engineering: Lead by Christopher Mader, the software engineering core provides expertise in software system design, development, integration and implementation. It provides services through two teams, software engineering and project management. Software engineering focuses on delivering technical content while the project management team provides leadership and coordination for systems projects.
High Performance Computing (HPC): Though considered a primary feature of the center, this focus area is at the heart of a university-wide resource with a mission to provide the academic and research community at large with comprehensive HPC resources ranging from hardware infrastructure to expertise in designing and implementing leading edge HPC and advanced storage solutions. The HPC program has special interest in research and development of storage-based systems and processes. Joel Zysman leads this focus area.
All programs report to Dr. Nicholas Tsinoremas, CCS’ founding director, who reports directly to Dr. Thomas LeBlanc, UM provost. It currently supports over 1,200 university users who represent all disciplines noted above as well as communities in music, economics, and finance.
CCS’ focus on data-driven science and engineering has enabled it to work on data services at, not only the campus level, but at the university level as well as within the State of Florida at partner and affiliate locations. The University of Miami has four main campuses and several distant locations, which are served by CCS for data and computing requirements. Many CCS members and associated faculty are already using national CI resources through XSEDE/TeraGrid, along with large-scale computations at NCAR, NOAA/GFDL, DOE/ORNL and NASA.
CCS provides access to several HPC systems (See the following “Facilities” section) and several large data pools that support a diverse set of scientific communities. The center currently operates a four-tier storage infrastructure designed to specifically accommodate data-driven scientific workflows from sensors/instruments to the desktop of the investigators and, as a broader impact, to science-based decision makers. CCS owns and centrally manages required facilities to support a broad user base’s requirements. This centralization of resources gives CCS great flexibility to approach large-scale research projects while still providing computer resources to all university faculty members, their students and group members.
The Center for Computational Science (CCS) holds offices on all three campuses at the University of Miami: Coral Gables, Miller School of Medicine, and The Rosenstiel School of Marine and Atmospheric Science. Each office is equipped with a dual processing workstations and essential software applications. CCS also has two dedicated conference rooms and communication technology to interact with advisors (phone, web- and video conferencing).
High Performance Computing (HPC):
UM maintains one of the largest centralized academic cyber infrastructures in the country with numerous assets.
Current High Performance Computing (HPC) systems: The CCS HPC core has been in continuous operation for the past five years. Over that time the core has grown from no HPC cyberinfrastructure to a regional high-performance computing environment that currently supports more than 1,200 users, 220 TFlops of computational power, and more than 3 Petabytes of disk storage. The center’s latest system acquisition, an IBM IDataPlex system, has been ranked at number 389 on the November 2012 Top 500 Supercomputer Sites list.
At present, CCS operates, Ares, an IBM p5-575 cluster consisting of 576 cores, Pegasus, a Linux Xeon Cluster consisting of 5,500 cores. This IBM blade system is composed of both Intel and AMD processors with over 19TB of RAM. The center also operates Chiron, an 896-core IBM Power5+ SMP cluster. Pegasus MKII, is the new IBM IDataPlex purpose-built HPC system. It is composed of an Intel MIC/Phi system, consisting of forty 61-core coprocessor nodes plus standard Sandy Bridge nodes (a total of 5,500 cores). The storage architecture for all systems is designed to provide high-speed access for both computation and presentation to client systems. CCS also maintains approximately 1000 cores of IBM Power 5+ AIX based systems, and 16 Solaris systems running an Oracle environment for its relational databases.
CCS offers an integrated storage environment for both structured (relational) and unstructured (flat file) data. These systems are specifically tuned for CCS’ data type and application requirements, whether they are serial access or highly parallelized. Each investigator or group has access to its own area and can present their data through a service-oriented architecture (SOA) model. Researchers can share their data via access control lists (ACLs), which ensure data integrity and security while allowing flexibility for collaboration.
CCS offers structured data services through the most common relational database formats, including: Oracle, MySQL, and PostgreSQL. Investigators and project teams can access their space through SOA and utilize their resources with the support of an integrated backend infrastructure.
The CCS storage environment is built as a four-tier solution:
Tier 1: (High Performance Scratch defined as requiring >160Gb/sec up to 400Gb/sec throughput or from 80K IOPs up to 200K IOPs) The HPC systems use a combination of high performance NAS (BlueArc/HDS Titan 2 and Isilon), FDR Infiniband attached (GPFS) Data Direct Networks GridScaler 12KE storage for high throughput or large scale parallel processing. CCS currently maintains approximately 700 TB raw storage for scratch. It also maintains 100 TB raw storage for the RDBMS systems including Oracle, My SQL and PostgreSQL.
Tier 2: (Mid-range performance less than 10 Gb/Sec throughput or between 10,000 – and 80,000 IOPs) CCS utilizes SAN-connected GPFS data stores for mid-range work. It currently has more than 2.5 PB Tier 2 space spread across 12 different NSDs. Most interactive and low thread count analysis takes place on this tier. CCS also utilizes this tier for presentation and visualization.
Tier 3: (Low performance disk) CCS maintains more than 1PB of Tier 3 storage. A disk-based disaster recovery system is designed using large pools of dense SATA storage in a combination of disk mirroring and TSM backup.
Tier 4: (Low performance tape based disaster recovery) For critical data with long retention times the center operates a Sun StorageTek tape library running IBM’s TSM. The current occupancy is slightly over 2PB of LTO4 tapes utilizing compression.
HPC Core Expertise:
The group has in-depth experience in various scientific research areas with extensive experience in parallelizing or distributing codes written in Fortran, C, Java, Perl, Python and R. The HPC team is active in contributing to Open Source software efforts including: R, Python, the Linux Kernel, Torque, Maui, XFS and GFS. The team also specializes in scheduling software (LSF) to optimize the efficiency of the HPC systems and adapt codes to the CCS environment. The HPC core also has a great deal of expertise in parallelizing code using both MPI and OpenMP depending on the programming paradigm. CCS has contributed several parallelization efforts back to the community in projects such as R, WRF, and HYCOM.
The core specializes in implementing and porting open source codes to CCS’ environment and often contributes changes back to the community. CCS currently supports more than 300 applications and optimized libraries on its computing environment. The core personnel are experts in implementing and designing solutions in the three different variants of Unix. CCS also maintains industry research partnerships with IBM, Schrodinger, Open Eye, and DDN.
HPC users have a complete software suite at their fingertips, including standard scientific libraries and numerous optimized libraries and algorithms tuned for the computing environment. All programs and algorithms are implemented in 64-bit mode in order to address large memory problems, and also offer compatible 32-bit libraries and algorithms. In addition, the LSF grid scheduling process maximizes the efficiency of the computational resources. Increased efficiency translates into the faster execution of programs, which provides researchers faster access to more resources. By utilizing the full suite of LSF tools we are able to provide both batch and interactive workloads while still retaining workload management features.
For more details about our HPC infrastructure, please visit our pages on this website at http://www.ccs.miami.edu/hpc
CCS’s Bioinformatics Program was established to conduct research and offer services and training in the management and analysis of biological and medical/health record data. Our mission is to spearhead bioinformatics capacity at the University of Miami for all biological and medical applications. This includes data management, data mining, and data analysis capacities. We aim to achieve this mission through infrastructure, education, and expertise. In particular, we are providing an online portal for bioinformatics databases and web tools, and offering a number of data analysis services. We are concomitantly leading educational and training initiatives in bioinformatics analysis, and nourishing these activities with high impact bioinformatics research.
iBIS – UM’s online Bioinformatics Integrated Services portal
iBIS is a bioinformatics portal that includes links to genomic databases, protein structure databases, clinical genetics databases, as well as numerous software tools for the analysis of gene expression, gene regulation, signaling and metabolic pathways, genomics, proteomics, and systems biology. In addition, the portal allows access to a suite of locally available tools and databases that we maintain on CCS’ HPC servers. Access to most components in the portal is freely available to anyone with a University of Miami login name and password. The portal also offers online tutorials for the major databases and web tools, and the CCS Bioinformatics team provide regular training workshops for new users.
Bioinformatics Data Analysis
We provide data analysis training and expertise at a three levels, consulting, preliminary data generation, and fully collaborative, based on the time and complexity of the service requested. The analyses are undertaken by skilled analysts, and overseen by experienced faculty. We have been working mostly with microarray data and next generation sequencing data, and our analytical services include, but are not limited to, the following:
• Gene expression analysis for transcriptome profiling and/or gene regulatory network building,
• prognostics and/or diagnostic biomarker discovery,
• microRNA target analysis,
• copy number variant analysis, in this context we are testing the few existing algorithms and developing new ones for accurate and unambiguous discovery of copy number variation in the human genome,
• genome or transcriptome assembly from next generation sequencing data, and its visualization,
• SNP functionality analysis,
• other projects include merging or correlating data from various data types for a holistic view of a particular pathway or disease process.
Advanced Data Mining
The Center’s Data Mining Research Group provides advanced data mining expertise and capabilities to further explore high dimensional data. The following are examples of the expertise areas covered by our faculty.
• Classification, which appears essentially in every subject area that involves collection of data of different types, such as disease diagnosis based on clinical and laboratory data. Methods include regression (linear and logistic), artificial neural nets (ANN), k-nearest neighborhood (KNN), support vector machines (SVM), Bayesian networks, decision trees and others.
• Clustering, which is used to partition the input data points into mutually similar groupings, such that data points from different groups are not similar. Methods include KMeans, hierarchical clustering, and self-organizing map (SOM), and are often accompanied by space decomposition methods to offer low dimensional representations of high dimensional data space. Methods of space decomposition include principal component analysis (PCA), independent component analysis (IDA), multidimensional scaling (MDA), Isomap, and manifold learning. Advanced topics in clustering include multifold clustering, graphical models, and semi-supervised clustering.
• Association data mining, which finds frequent combinations of attributes in databases of categorical attributes. The frequent combinations can be then used to develop prediction of categorical values.
• Analysis of sequential data involves mostly biological sequence and includes such diverse topics as extraction of common patterns in genomic sequences for motif discovery, sequence comparison for haplotype analysis, alignment of sequences, and phylogeny reconstruction.
• Text mining, particularly in terms of extracting information from published papers, thus transforming documents to vectors of relatively low dimension to enable the use of data mining methods mentioned above.
The Visualization program conducts both theoretical and applied research in the general areas of Machine Vision and Learning, and specifically in (i) computer vision and image processing, (ii) machine learning, (iii) biomedical image analysis, and (iv) computational biology and neuroscience. The goal is to provide expertise n this area to develop novel fully automated methods that can provide robustness, accuracy and computational efficiency. The program works towards finding better solutions to existing open problems in the above areas, as well as exploring different scientific fields where our research can provide useful interpretation, quantification and modeling.
CCS has a sophisticated cheminformatics and compute infrastructure with a significant level of support from the institution. CCS facilitates scientific interactions and enables efficient research using informatics and computational approaches. A variety of departments and centers at the university use high content and high throughout screening approaches – The Miami Project to Cure Paralysis, the Diabetes Research Institute, the Cancer Center, Bascom Palmer Eye Institute, the Dept. of Surgery, the John P. Hussman Institute for Human Genomics.
Cheminformatics and computational chemistry tools – running on HPC Linux cluster and high performance application server
• SciTegic Pipeline Pilot (visual work-flow-based programming environment (data pipelining); broad cheminformatics, reporting / visualization, modeling capabilities; integration of applications, databases, algorithms, data)
• Leadscope Enterprise (integrated cheminformatics data mining and visualization environment; unique chemical perception (~27K custom keys; user extensions); various algorithms, HTS analysis, SAR / R-group analysis, data modeling)
• ChemAxon tools and applications (cheminformatics software applications and tools; wide variety of cheminformatics functionality)
• Spotfire (highly interactive visualization and data analysis environment, various statistical algorithms with chemical structure visualization, HTS and SAR analysis)
• Open Eye ROCS, FRED, OMEGA, EON, etc. implemented on Linux cluster (suite of powerful applications and tool kits for high-throughput 3D manipulation of chemical structures, modeling of shape, electrostatics, protein-ligand interactions and various other aspects of structure- and ligand-based design; also includes powerful cheminformatics 2D structure tools)
• Schrodinger Glide, Prime, Macromodel, and various other tools implemented on Linux cluster (powerful state of the art docking, protein modeling and structure prediction tools and visualization)
• Desmond implemented on Linux Cluster (powerful state of the art explicit solvent molecular dynamics)
• TIP workgroup (powerful environment for global analysis of protein structures, binding sites, binding interactions; implemented automated homology modeling, binding site prediction, structure and site comparison for amplification of known protein structure space)