DATA SCIENCE
Scientific progress increasingly depends on the ability to manage, analyze, and interpret data at scales that strain conventional methods. INSTAR's data science research addresses statistical rigor, pipeline reproducibility, and the infrastructure choices that determine whether a finding holds under scrutiny — questions that matter to grant reviewers, peer reviewers, and federal program officers evaluating research credibility.
Statistical Learning
Extracting defensible conclusions from noisy, heterogeneous scientific data requires more than off-the-shelf machine learning. INSTAR examines statistical learning methodology — Bayesian inference, causal discovery, and nonparametric estimation — with attention to uncertainty quantification and the conditions under which findings generalize. Reproducibility is a first-class concern: we study how methodological choices propagate into published results.
Learn More
Data Engineering
Scientific data collection now routinely outpaces the infrastructure capacity of individual laboratories. INSTAR researches scalable pipeline design, distributed query architectures, and storage-compute tradeoffs for large-scale scientific datasets — including the provenance and version-control mechanisms that allow pipelines to be audited, reproduced, and extended by independent researchers.
Learn More
Machine Learning Pipelines
INSTAR studies the design of end-to-end ML systems — feature engineering, model selection, hyperparameter search, and post-deployment monitoring — with particular interest in making these pipelines accessible and interpretable to domain scientists who are experts in their field but not in machine learning methodology. Reducing the barrier between domain knowledge and computational capability is a public-benefit research goal with broad implications across health, energy, and environmental science.
Learn More
Data Visualization
High-dimensional scientific data resists standard display. INSTAR explores visual analytics approaches that support genuine exploratory analysis rather than presentation-only graphics — interactive representations that let researchers interrogate structure, identify outliers, and formulate hypotheses across domains including genomics, earth observation, and materials characterization.
Learn More
Natural Language Analytics
The scientific literature grows faster than any researcher can track. INSTAR investigates information extraction, entity recognition, and relationship mining across scientific text corpora — with the goal of surfacing emerging research directions, mapping interdisciplinary connections, and identifying knowledge gaps that human review alone would miss. The approach complements INSTAR's interdisciplinary consortium model.
Learn More
Geospatial Analytics
Geospatial data — satellite imagery, sensor networks, GPS telemetry, and remotely sensed environmental measurements — is now available at scales and resolutions that create both analytical opportunity and methodological challenge. INSTAR examines spatial statistical methods, multi-source data fusion, and temporal analysis approaches applicable to environmental monitoring, public health, and land-use research. Researchers interested in this intersection are encouraged to explore the INSTAR Fellowship at /fellowship/.
Learn MoreGrounded in Open Data
INSTAR's data science research is built on public datasets that span government statistics, economic indicators, and machine learning benchmarks. Grounding our methods in open, well-documented sources allows our findings to be independently replicated and our pipelines to be audited by peer researchers.
Data.gov
The U.S. federal open data catalog provides public-domain datasets across health, environment, finance, and infrastructure — a primary source for testing data engineering pipelines and statistical learning methods in real-world conditions.
Data.govU.S. Census Data
American Community Survey and decennial census microdata support INSTAR's geospatial analytics and social indicator research, providing large-scale longitudinal population datasets for spatial and temporal modeling.
U.S. Census Bureau DataFRED — St. Louis Fed
Federal Reserve Economic Data provides over 800,000 U.S. and international time series used by INSTAR to benchmark econometric forecasting models and validate data pipeline reproducibility at scale.
FRED Economic DataUCI ML Repository
The UC Irvine Machine Learning Repository provides canonical benchmark datasets for evaluating statistical learning algorithms across classification, regression, and clustering tasks in INSTAR research programs.
UCI Machine Learning RepositoryFor Researchers
Join the INSTAR Fellowship
The INSTAR Fellowship is an open citizen-scientist program — no minimum degree required, selection based on fit with our research culture. Structured mentorship, interdisciplinary scope, and the freedom to pursue hard problems.