Publications

Automated Transcription Service: Enabling Transcription at Scale for Social Science Research

Black line drawing of a stack of papers on a white circle inside a dark green square.

Abstract: "Many social scientists collect and analyze qualitative data, often originating in audio recordings drawn from interviews, focus groups, and other face-to-face settings. A longstanding challenge for these researchers has been generating a text transcript of these audio files for use in analysis. Recent advances in artificial intelligence (AI) have transformative potential, but concerns about quality, cost, language support, security, and usability represent a significant barrier to individual researchers. The automated transcription service (ATS) is a completely serverless cloud solution that enables a managed service for social science researchers. ATS provides an easy-to-use service that converts audio in multiple languages into…

Date: Jul 17, 2024

Type: Conference Paper

A Model for Managing Multi-Tenant Research Databases

The Research Database Complex (RDC) at Indiana University offers a cost-effective, secure, and reliable database-as-a-service (DBaaS) for researchers, alleviating the challenges and expenses of managing relational databases. This paper explores the history of RDC and its evolution through collaboration between IU’s Research Technologies and Enterprise Systems.

Date: Jul 21-25, 2024

Type: Conference Paper

The Coded Language of Empire: Digital History, Archival Deep-Dives and US Imperialism in Cuba's Third War of Independence

Kalani Craig, Arlene J. Díaz, and David Kloster (Univ. of Indiana Bloomington) develop and deploy what they term Mixed-Method Approaches to Collaborative History (MMATCH) that blends more traditional close readings with digital tools including computational text analysis to explore the language of empire and the struggles for Cuban independence from 1895 to 1898 from both American and Cuban perspectives. They also reflect on what it means to undertake a collaborative historical research project with nontraditional methods, foregrounding the importance of overlapping interpretative dialogue with each other around sources and methods for successfully realizing their project. Jo Guldi (Emory Univ.) in “Text Mining for Historical Analysis”…

Date: Jun 2024

Type: Journal Article

Evaluating Return on Investment for Cyberinfrastructure Using the International Integrated Reporting <IR> Framework.

This paper investigates the return on investment (ROI) in cyberinfrastructure (CI) facilities and services by comparing the value of end products created to the cost of operations. We assessed the cost of a US CI facility called XSEDE and the value of the end products created using this facility, categorizing end products according to the International Integrated Reporting Framework. The US federal government invested approximately $0.3B in operating the XSEDE ecosystem from 2016–2022. The estimated value of end products facilitated by XSEDE ranges from around $4.7B to $22.7B or more. Credit for the majority of these end products is shared among various contributors, including the XSEDE ecosystem. Granting the XSEDE ecosystem a seemingly…

Date: May 2024

Type: Journal Article

Scholarly Data Share 2.0: Granular Access to Research Data

The Scholarly Data Share (SDS) is a lightweight web interface that facilitates access to large, curated research datasets in long-term storage. The first version, SDS 1.0, facilitated sharing public datasets without access restrictions. The new version, SDS 2.0, provides controlled access to datasets at various stages in the research data life cycle. This update enables granular and customizable access control for a variety of research domains and use cases. In this paper, we discuss the features, implementation, and use cases of SDS 2.0, as well as outlining our plans for future enhancements to the service.

Date: Jul 2023

Type: Conference Paper

Parallel Software for Million-scale Exact Kernel Regression

We present the design and the implementation of a kernel principal component regression software that handles training datasets with a million or more observations. Kernel regressions are nonlinear and interpretable models that have wide downstream applications, and are shown to have a close connection to deep learning. Nevertheless, the exact regression of large-scale kernel models using currently available software has been notoriously difficult because it is both compute and memory intensive and it requires extensive tuning of hyperparameters.

Date: Jun 2023

Type: Journal Article

Use of Accounting Concepts to Study Research: Return on Investment in XSEDE

This paper uses accounting concepts—particularly the concept of Return on Investment (ROI)—to reveal the quantitative value of scientific research pertaining to a major US cyberinfrastructure project (XSEDE—the eXtreme Science and Engineering Discovery Environment). XSEDE provides operational and support services for advanced information technology systems, cloud systems, and supercomputers supporting non-classified US research, with an average budget for XSEDE of US$20M+ per year over the period studied (2014–2021). To assess the financial effectiveness of these services, we calculated a proxy for ROI, and converted quantitative measures of XSEDE service delivery into financial values using costs for service from the US marketplace. We…

Date: Feb 2023

Type: Journal Article

HPC Data Analysis Pipeline for Neuronal Cluster Detection

Obtaining neural clusters from data sets collected over different developmental stages poses a computational challenge that is complicated by the number of data sets, clustering methods, and hyperparameters. We used MATLAB parallel toolkit to parallelize the execution of the hyperparameter sweeps as well as developed a workflow for parallelizing the data processing. We present a run-time performance comparison of the workflow for two clustering methods on Stampede2 supercomputer. Our study explored the performance of MATLAB implementations of the K-means and Louvain algorithms for cluster detection, using covariance and cosine similarity matrices, and investigated hyperparameter settings for each algorithm.

Date: Jul 2022

Type: Poster

Cyberinfrastructure value: a survey on perceived importance and usage

The research landscape in science and engineering is heavily reliant on computation and data storage. The intensity of computation required for many research projects illustrates the importance of the availability of high performance computing (HPC) resources and services. This paper summarizes the results of a recent study among principal investigators that attempts to measure the impact of the cyberinfrastructure resources allocated by the XSEDE (eXtreme Science and Engineering Discovery Environment) project to various research activities across the United States. Critical findings from this paper include: a majority of respondents report that the XSEDE environment is important or very important in completing their funded work, and…

Date: Jul 2022

Type: Conference Paper

Institutional Value of a Nobel Prize

The Nobel Prize is awarded each year to individuals who have conferred the greatest benefit to humankind in Physics, Chemistry, Medicine, Economics, Literature, and Peace, and is considered by many to be the most prestigious recognition for one’s body of work. Receiving a Nobel prize confers a sense of financial independence and significant prestige, vaulting its recipients to global prominence. Apart from the prize money (approximately US$1,145,000), a Nobel laureate can expect to benefit in a number of ways, including increased success in securing grants, wider adoption and promulgation of one’s theories and ideas, increased professional and academic opportunities, and, in some cases, a measure of celebrity. A Nobel laureate’s affiliated…

Date: Jul 2022

Type: Conference Paper

Return on Investment in Research Cyberinfrastructure: State of the Art

"What is the Return On Investment (ROI) for a cyberinfrastructure system or service?” seems like a natural question to ask. Existing literature shows strong evidence of good return on investment in cyberinfrastructure. This paper summarizes key points from historical studies of ROI in cyberinfrastructure for the US research community. In so doing, we can draw new conclusions based on existing studies. A wide variety of studies show that many types of important “returns” increase in response to more investment in or use of advanced cyberinfrastructure facilities. Published analyses show a positive (>1) ROI for investment in cyberinfrastructure by higher education institutions and federal funding agencies.

Date: Jul 2022

Type: Conference Paper

Metrics of Financial Effectiveness: Return On Investment in XSEDE...

This paper explores the financial effectiveness of a national advanced computing support organization within the United States (US) called the eXtreme Science and Engineering Discovery Environment (XSEDE). XSEDE was funded by the National Science Foundation (NSF) in 2011 to manage delivery of advanced computing support to researchers in the US working on non-classified research. In this paper, we describe the methodologies employed to calculate the return on investment (ROI) for governmental expenditures on XSEDE and present a lower bound on the US government’s ROI for XSEDE from 2014 to 2020. For each year of the XSEDE project considered, XSEDE delivered measurable value to the US that exceeded the cost incurred by the Federal Government…

Date: Jul 2022

Type: Conference Paper

Data Management Workflows in Interdisciplinary Highly Collaborative Research

Data curation is an important aspect in research projects. Effective data management is critical for data curation, and it not only contributes to the success of projects but makes research outputs findable, accessible, interoperable and reusable. We have examined interdisciplinary highly collaborative research (IHCR) practices in selected projects to propose data management workflows. This synopsis of work in progress discusses one of these workflows that helps locate information when there are multiple collaborators, and the digital assets are spread across multiple storage systems and institutions.

Date: Jul 2022

Type: Poster

Scholarly Data Share: A Model for Sharing Big Data in Academic Research

The Scholarly Data Share (SDS) is a lightweight web interface that facilitates access to large, curated research datasets stored in a tape archive. SDS addresses the common needs of research teams working with and managing large and complex datasets, and the associated storage. In this paper, we describe the development of the SDS and the implementation of an instance to provide access to a large collection of geospatial datasets.

Date: Jul 2022

Type: Conference Paper

CADRE: A Cloud-Based Data Service for Big Bibliographic Data

Large bibliographic data sets hold the promise of revolutionizing the scientific enterprise when combined with state-of-the-science computational capabilities. Providing high-quality data services for large network datasets such as the Microsoft Academic Graph, which contains more than two billion citation links, poses significant difficulties for universities. Data systems based on the property graph model are capable of delivering efficient graph query services for large networks. However, real-life queries often combine multiple types of data models. To satisfy the needs of different user groups, we developed and deployed a cloud-based data system consisting of scalable graph and text-indexed query engines. For non-expert users, the…

Date: Aug 2021

Type: Conference Paper

Service Provisioning through High Level, Complexity Hiding Interfaces

Over the past decade, cyberinfrastructure community like XSEDE has substantially fostered and enriched knowledge discovery of scholars, researchers, and engineers from a variety of domains through enabling access to advanced computing systems, where continuing support for classic packages and parallel computing frameworks (e.g., MPI and OpenMP) has been well established. However, with the rise of "Big Data" era, an ever increasing demand from user community is the desire to run sophisticated, state-of-the-art distributed frameworks that handle various data related tasks. Examples include Hadoop and Spark for data processing and analytics, Cassandra and Redis for scalable on-disk and in-memory data stores, Apache Airflow for distributed…

Date: Dec 2020

Type: Conference Paper

FutureWater Indiana: A science gateway for spatio-temporal modeling of water in Wabash basin

In this manuscript, we describe the FutureWater Science Gateway, which simulates regional watersheds spatially and temporally to derive hydrological changes due to changes in critical effectors such as climate, land use and management, and soil conditions. We also discuss the gateway design, creation, and production deployment and how the resulting data is organized and explored. The FutureWater gateway is built based on the Apache Airavata gateway middleware framework and hosted under the SciGaP project at Indiana University. The gateway provides an integrated infrastructure for simulations based on parallelized Soil and Water Assessment Tool (SWAT) and SWAT-MODFLOW software execution on Extreme Science and Engineering Discovery…

Date: Jul 2020

Type: Conference Paper

Parallelized Topological Relaxation Algorithm

Geometric problems of interest to mathematical visualization applications involve changing structures, such as the moves that transform one knot into an equivalent knot. In this paper, we describe mathematical entities (curves and surfaces) as link-node graphs, and make use of energy-driven relaxation algorithms to optimize their geometric shapes by moving knots and surfaces to their simplified equivalence. Furthermore, we design and configure parallel functional units in the relaxation algorithms to accelerate the computation these mathematical deformations require. Results show that we can achieve significant performance optimization via the proposed threading model and level of parallelization.

Date: Dec 2019

Type: Conference Paper

A Lightweight Framework for Research Data Management

We describe a framework for managing live research data involving two major components. First, a system for the scalable scheduling and execution of automated policies for moving, organizing, and archiving data. Second, a system for managing metadata to facilitate curation and discovery with minimal change to existing workflows. Our approach is guided by four main principles: 1) to be non-invasive and to allow for easy integration into existing workflows and computing environments; 2) to be built on established, cloud-aware, open-source tools; 3) to be easily extensible and configurable, and thus, adaptable to different academic disciplines; and 4) to integrate with and take advantage of infrastructure and services available on academic…

Date: July 2019

Type: Conference Paper

A Computational Notebook Approach to Large-scale Text Analysis

Large-scale text analysis algorithms are important to many fields as they interrogate reams of textual data to extract evidence, correlations, and trends not readily discoverable by a human reader. Unfortunately, there is often an expertise mismatch between computational researchers who have the technical and programming skills necessary to develop workflows at scale and domain scholars who have knowledge of the literary, historical, scientific, or social factors that can affect data as it is manipulated. Our work focuses on the use of scalable computational notebooks as a model to bridge the accessibility gap for domain scholars, putting the power of HPC resources directly in the hands of the researchers who have scholarly questions. The…

Date: July 2018

Type: Conference Paper

A High Performance Photogrammetry for Academic Research

Photogrammetry is the process of computationally extracting a three-dimensional surface model from a set of two-dimensional photographs of an object or environment. It is used to build models of everything from terrains to statues to ancient artifacts. In the past, the computational process was done on powerful PCs and could take weeks for large datasets. Even relatively small objects often required many hours of compute time to stitch together. With the availability of parallel processing options in the latest release of state-of-the-art photogrammetry software, it is possible to leverage the power of high performance computing systems on large datasets. In this paper we present a particular implementation of a high performance…

Date: July 2018

Type: Conference Paper