Access to the Vault storage service is controlled through several methods including the latest in multifactor authentication. All users are required to use Yubikey ™ 4 hardware USB keys in order to log on to the vault secure storage service. Vault also requires IP whitelisting for access through either the on-campus research network or campus based VPN services.
Here is a boilerplate of equipment, facilities, and other resources existing here at CCS that may be used for grant preparation:
Facilities, Equipment, and Other Resources
CCS systems are colocated at the Verizon Terremark NAP of the Americas (NOTA or NAP). The NAP Datacenter in Miami currently features a 750,000 square foot, purpose-built datacenter, Tier IV facility with N+2 14 Megawatt power and cooling infrastructure. The equipment floors start at 32 feet above sea level, and the roof slope was designed to aid in the drainage of floodwater in excess of 100-year-storm intensity, assisted by: 18 rooftop drains, architecture designed to withstand a Category 5 hurricane with approximately 19 million pounds of concrete roof ballast, and 7-inch-thick steel reinforced concrete exterior panels. Plus, the building is outside FEMA 500-year designated flood zone. The NAP uses a dry pipe fire-suppression system to minimize the risk of damage from leaks.
The NAP has a centrally located Command Center manned by 7×24 security and security sensors. In order to connect the University of Miami with the NOTA Datacenter, UM has invested in a Dense Wavelength Division Multiplexing (DWDM) optical ring for all of its campuses. UM CCS Advanced Computing resources occupy a discrete, secure wavelength on the ring, which provides a distinct 10 Gigabit HPC network to all UM campuses and facilities.
Given University of Miami’s past experience including several hurricanes and other natural disasters, we anticipate no service interruptions due to facilities issues. The NAP was designed and constructed for resilient operations. UM has gone through several hurricanes, power outages, and other severe weather crises without any loss of power or connectivity to the NAP. The NAP maintains its own generators with a flywheel power crossover system. This insures that power is not interrupted when the switch is made to auxiliary power. The NAP maintains a two-week fuel supply (at 100% utilization), and is on the primary list for fuel replacement due to its importance as a data-serving facility.
In addition to hosting the University of Miami’s computing infrastructure, the NAP of the Americas is home to the US SouthCOM, Amazon, EBay, and several telecommunications companies’ assets. The NAP at Miami hosts 97% of the network traffic between the US and Central/South America. The NAP is also the local access point for Florida LambdaRail (FLR), which is gated to Internet 2 (I2) to provide full support to the I2 Innovation Platform. The NAP also provides TLD information to the DNS infrastructure and is the local peering point for all networks in the area.
The University of Miami has made the NAP its primary Data Center occupying a very significant footprint on the third floor. Currently all UM CCS resources, clusters, storage and back-up system run from this facility, and serves all major campuses of UM.
The Center for Computational Science (CCS) holds offices on all three campuses at the University of Miami: Miller School of Medicine, The Rosenstiel School of Marine and Atmospheric Science, and its main operations on the Coral Gables campus at the Gables One Tower and the Ungar Building.
Each location is equipped with a dual processing workstations and essential software applications. CCS has three dedicated conference rooms and communication technology to interact with advisors (phone, web-, and video-conferencing), plus a Visualization Lab with 2D and 3D display walls (located at the Ungar Building).
The University of Miami’s Supercomputer, Pegasus, is a 350-node Lenovo cluster with each node having 2 Intel Sandy Bridge E5-2670 (2.6 GHz) 8C – with 32 GB 1600 MHz RAM (2GB/core) for a total of over 160 TFlops. Connected with an FDR Infiniband fabric, Pegasus was purpose-built for the style of data processing performed by biomedical research and analytics. In contrast with traditional Supercomputers where data flows along the slowest communication network possible (Ethernet), Pegasus was built on the principle that data needs to be on the fastest fabric possible. By utilizing the low latency high bandwidth IB fabric for data, Pegasus is able to access all three tiers (SSD, 15K RPM SAS, 7.2K NL-SAS) at unprecedented speeds.
Unlike traditional HPC storage, the 150 TB
/scratch filesystem is optimized for small random reads and writes; and can support over 125,000 sustained IOPs/second and 20 Gb/sec throughput at 4Kb file size. Composed of over 500 15K RPM SAS disks,
/scratch is ideal for the extremely demanding IO requirements of biomedical workloads.
For instances where even
/scratch is not fast enough, Pegasus now has access to over 8TB of burst buffer space clocked at over 1,000,000 IOPs. This buffer space provides biomedical researchers a good place for large file manipulation and transformation.
Along with the 350 nodes in the general processing queue, all researchers also have access to the 20 large memory nodes in the bigmem queue. With access to the entire suite of software available on Pegasus, the bigmem queue provides large memory access (256 GB) to researchers where parallelization is not an option. With 20 cores each, the bigmem servers provide an SMP-like environment well suited to biomedical research.
As many modern analysis tools require interaction, Pegasus has a unique feature of allowing ssh and graphical (GUI) access to programs using LSF. Tools ranging from Matlab to Knime and SAS to R are available to researchers in the interactive queue with full speed access to /scratch and the W.A.D.E. storage cloud.
The University of Miami’s Supercomputer, Pegasus, is a 350 node Lenovo cluster with each node having 2 Intel Sandy Bridge E5-2670 (2.6 GHz) 8C – with 32 GB 1600 MHz RAM (2GB/core) for a total of over 160 TFlops. Connected with an FDR Infiniband fabric, Pegasus was purpose built for the style of data processing performed by biomedical research and analytics.
Five Racks of iDataPlex in iDataPlex Racks
- One Standard Enterprise Rack for Networking and Management
- iDataPlex dx360 M4:
- Qty (2) Intel Sandy Bridge E5-2670 (2.6 GHz)- 32 GB 1600 MHz RAM (2GB/core)
- Mellanox Connect X3 Single-Port FDR
- Mellanox FDR MSX6036
- DNA SFA 12k:
- Qty (12) 3TB 7.2K RPM SATA (RAID 6 in 8+2)
- Qty (360) 600GB 15K SAS (RAID 6 in 8+2)
- Qty (10)e 400GB MLC SSD (RAID 1 Pairs)
- xCAT 2.7.x
- Platform LSF
- RHL 6.2 for Login/Management Nodes
Pegasus’ CPU Workhorse — The IBM iDataPlex dx360M4
|Compute Nodes||350 dx360 M4 Compute Nodes|
|Processor||Two 8-core Intel Sandy Bridge 2.6 GHz scalar, 2.33 GHz* AVX|
|Memory||32 GiB (2 GiB/core) using eight x 4GB 1600MHz DDR3 DIMMs|
|Clustering Network||One FDR InifiniBand HCA|
|Management Network||GB Ethernet NIC connected to the cluster management VLANs. IMM access shared through the eth0 port|
W.A.D.E. STORAGE CLOUD (Worldwide Advanced Data Environment)
At the heart of the Advanced Computing data services is the W.A.D.E. storage cloud, which currently provides over 6 PB of active data to the University of Miami research community ranging from small spreadsheets in sports medical research to multi terabyte high resolution image files and NGS datasets.
An upgrade to the W.A.D.E. storage cloud is coming soon.
The W.A.D.E. storage cloud is composed of four DDN storage clusters running the GPFS filesystem. The combination of IBM’s industrial strength filesystem and DDN’s high performance hardware gives researchers at UM the flexibility to process data on Pegasus and share that data with anyone, anywhere.
By utilizing several file service gateways, researchers can share large data sets securely on campus between Mac, Windows, and Linux operating systems. Data can also be presented outside of the University of Miami in several high-performance fashions. In addition to the common protocols of SCP and SFTP, we also provide high-speed parallel access through bbcp and Aspera. You can even share your data using standard web access (httpd) through our integrated web and cloud client service.
All access to W.A.D.E. is provided through UM’s 10 GB/sec Research Network internally and the UM Science DMZ externally. All Internet traffic flows through either the Science DMZs 10 Gb/Sec I2 link through Florida Lambda Rail or through the Research Network’s 1 Gb/sec commercial internet connect.
VAULT SECURE STORAGE
The Vault secure storage service is designed to address the ongoing challenge of storing Limited Research Datasets. Built on enterprise quality hardware with 24×7 support, Vault provides CTSI-approved researchers access to over 150TB of usable redundant (300 TB raw) storage. All data is encrypted according to U.S. Federal Information Processing Standards (FIPS). At rest, data is encrypted using AES encryption with 128 bit keys. In motion, all transfers are encrypted using FIPS 140-2 compliant AES with 256 bit keys. All data is encrypted and decrypted on access automatically.
The CCS Visualization Lab is a tool for all University of Miami students and faculty to present graphical and performance intensive 2D and 3D simulations. With a direct connection to all CCS resources, the Viz Lab is the perfect tool for high performance parallel visualization, data exploration, and other advanced 2D and 3D simulations. First time use of this space requires Orientation with the CCS support team. For more information about Orientation and reservations, visit Visualization Lab.
With the 2D display wall, users are able to present their work at a paramount level while analyzing details at a granular level. The 2D display is composed of ten 55-inch thin bezel LCD Planar panels and spanning 22ft wide for an ultra-wide angle 21-megapixel display that supports a resolution of 9600×2160.
The 3D display wall supports stereoscopic 3D, for users looking to captivate audiences with something a little more eye-popping or simply looking to add depth to their work. It is composed of four 46-inch ultra-thin LCD Planar panels and supports resolutions up to 5120×2880.
SPS SECURE PROCESSING SERVICE – beta
Our most secure data processing offering is SPS, currently in beta. SPS is designed for secure access to extremely sensitive data sets including PHI. In addition to the security protocols used in the Vault data services, SPS requires additional administrative action for the certified placement and/or destruction of data. Advanced Computing staff (all CITI trained and IRB approved) act as data managers for several federal agencies including NSF, NIH, DoL, DoD, and VA projects.
Once our staff has loaded and secured your data, you can remotely access one of the SPS servers (either Windows or Linux) which has access to the most common data analytic tools including R, SAS, Matlab, and Python. Additional tools are available on request. 50 TB of highly secure redundant storage (100TB raw).
HPC Core Expertise
The HPC team has in-depth experience in various scientific research areas with extensive experience in parallelizing or distributing codes written in Fortran, C, Java, Perl, Python and R. The team is active in contributing to Open Source software efforts including: R, Python, the Linux Kernel, Torque, Maui, XFS and GFS. The team also specializes in scheduling software (LSF) to optimize the efficiency of the HPC systems and adapt codes to the CCS environment. The HPC core has expertise in parallelizing code using both MPI and OpenMP depending on the programming paradigm. CCS has contributed several parallelization efforts back to the community in projects such as R, WRF, and HYCOM.
The core specializes in implementing and porting open source codes to CCS’ environment and often contributes changes back to the community. CCS currently supports more than 300 applications and optimized libraries on its computing environment. The core personnel are experts in implementing and designing solutions in the three different variants of Unix. CCS also maintains industry research partnerships with IBM, Schrodinger, Open Eye, and DDN.
Software on the Pegasus Cluster
CCS continually updates applications, compilers, system libraries, etc. To facilitate this task and to provide a uniform mechanism for accessing different revisions of software, CCS uses the modules utility. At login, modules commands set up a basic environment for the default compilers, tools, and libraries such as: the $PATH, $MANPATH, and $LD_LIBRARY_PATH environment variables. Available software modules can be viewed on the CCS Portal Software Modules page, including description, version, and update date.
CCS’ Computational Biology and Bioinformatics Program (CBBP) was established to conduct research and offer services and training in the management and analysis of biological and medical/health record data. The program’s mission is to spearhead bioinformatics capacity at the University of Miami for all biological and medical applications. This includes data management, data mining, and data analysis capacities. The CBBP aims to achieve this mission through infrastructure, education, and expertise. In particular, CBBP provides an online portal for bioinformatics databases and web tools, and offers a number of data analysis services. CBBP are concomitantly leading educational and training initiatives in bioinformatics analysis, and nourishing these activities with high impact bioinformatics research.
Computational Biology & Bioinformatics
The team provides data analysis training and expertise at a three levels, consulting, preliminary data generation, and fully collaborative, based on the time and complexity of the service requested. The analyses are undertaken by skilled analysts, and overseen by experienced faculty. The group has been working mostly with microarray data and next generation sequencing data, and the analytical services include, but are not limited to, the following:
- gene expression analysis for transcriptome profiling and/or gene regulatory network building
- prognostics and/or diagnostic biomarker discovery
- microRNA target analysis
- copy number variant analysis, in this context we are testing the few existing algorithms and developing new ones for accurate and unambiguous discovery of copy number variation in the human genome
- genome or transcriptome assembly from next generation sequencing data, and its visualization
- SNP functionality analysis
- other projects include merging or correlating data from various data types for a holistic view of a particular pathway or disease process.
Big Data Analytics & Data Mining
CCS’ Big Data Analytics & Data Mining research group provide advanced data mining expertise and capabilities to further explore high-dimensional data. The following are examples of the expertise areas covered by our team:
- Classification, which appears essentially in every subject area that involves collection of data of different types, such as disease diagnosis based on clinical and laboratory data. Methods include regression (linear and logistic), artificial neural nets (ANN), k-nearest neighborhood (KNN), support vector machines (SVM), Bayesian networks, decision trees and others.
- Clustering, which is used to partition the input data points into mutually similar groupings, such that data points from different groups are not similar. Methods include KMeans, hierarchical clustering, and self-organizing map (SOM), and are often accompanied by space decomposition methods to offer low dimensional representations of high dimensional data space. Methods of space decomposition include principal component analysis (PCA), independent component analysis (IDA), multidimensional scaling (MDA), Isomap, and manifold learning. Advanced topics in clustering include multifold clustering, graphical models, and semi-supervised clustering.
- Association data mining, which finds frequent combinations of attributes in databases of categorical attributes. The frequent combinations can be then used to develop prediction of categorical values.
- Analysis of sequential data involves mostly biological sequence and includes such diverse topics as extraction of common patterns in genomic sequences for motif discovery, sequence comparison for haplotype analysis, alignment of sequences, and phylogeny reconstruction.
- Text mining, particularly in terms of extracting information from published papers, thus transforming documents to vectors of relatively low dimension to enable the use of data mining methods mentioned above.
The Visualization program conducts both theoretical and applied research in the general areas of Machine Vision and Learning, and specifically in:
- Computer Vision and Image Processing
- Machine Learning
- Biomedical Image Analysis
- Computational Biology and Neuroscience
The goal is to provide expertise in this area to develop novel fully automated methods that can provide robustness, accuracy and computational efficiency. The program works towards finding better solutions to existing open problems in the above areas, as well as exploring different scientific fields where our research can provide useful interpretation, quantification and modeling.
CCS has a sophisticated cheminformatics and compute infrastructure with a significant level of support from the institution. CCS facilitates scientific interactions and enables efficient research using informatics and computational approaches. A variety of departments and centers at the University use high content and high throughout screening approaches—The Miami Project to Cure Paralysis, the Diabetes Research Institute, the Sylvester Comprehensive Cancer Care Center, Bascom Palmer Eye Institute, the Department of Surgery, the John P. Hussman Institute for Human Genomics.
Cheminformatics and computational chemistry tools—running on HPC Linux cluster and high performance application server.
- SciTegic Pipeline Pilot—visual work-flow-based programming environment (data pipelining); broad cheminformatics, reporting / visualization, modeling capabilities; integration of applications, databases, algorithms, data.
- Leadscope Enterprise—integrated cheminformatics data mining and visualization environment; unique chemical perception (~27K custom keys; user extensions); various algorithms, HTS analysis, SAR / R-group analysis, data modeling.
- ChemAxon Tools and Applications—cheminformatics software applications and tools; wide variety of cheminformatics functionality.
- Spotfire—highly interactive visualization and data analysis environment, various statistical algorithms with chemical structure visualization, HTS and SAR analysis.
- Open Eye ROCS, FRED, OMEGA, EON, etc. implemented on Linux cluster – suite of powerful applications and tool kits for high-throughput 3D manipulation of chemical structures, modeling of shape, electrostatics, protein-ligand interactions and various other aspects of structure- and ligand-based design; also includes powerful cheminformatics 2D structure tools.
- Schrodinger Glide, Prime, Macromodel, and various other tools implemented on Linux Cluster—powerful state of the art docking, protein modeling and structure prediction tools and visualization.
- Desmond implemented on Linux Cluster—powerful state of the art explicit solvent molecular dynamics.
- TIP workgroup—powerful environment for global analysis of protein structures, binding sites, binding interactions; implemented automated homology modeling, binding site prediction, structure and site comparison for amplification of known protein structure space.
MASTHEAD IMAGE SOURCE: University of Nebraska Medical Center, used with permission from the Public Relations Office.