Trends in school and school district performance are important to recognize in order for school administrators to identify and benchmark areas of improvement. However, minimal data gathering and analysis is performed. Data is often in silos and not able to be comprehensively shared with a public audience. Modern Data Science approaches for visualizing data and predicting school performance have not been used broadly. Lack of available time to review school data and gaps in assessment literacy are also key barriers.

These factors limit our ability to introduce research-based interventions to improve the quality of educational systems and boost student outcomes. This comprehensive data-driven approach is especially important in periods of educational reform. Recent whitepapers from the Brookings Institute​, UNESCO​, ​ Center for Data Innovation​​ and US Department of Education​ share the benefits and challenges associated with moving towards data-driven educational systems.


We are building an open source dataset of key school metrics for the top 20 largest School Districts in the United States (n ~ 400 high schools). We use several Data Science tools for understanding big data (correlation matrices, histograms, SD, percentiles, trend lines and etc.) and follow this analysis with proof-of-concept Machine Learning models (various forms of regression) we have trained on this dataset to evaluate the predictive power of key metrics. We use Data Science to mine data and present trends that are too messy, scattered and difficult to see with the naked human eye.

The following quantitative variables will be fed into our Machine Learning models: (a) Total Student Enrollment (b) Total Expenditure per Pupil [USD] (c) Student to Teacher Ratio (d) Free and Reduced Lunch % (e) Students in AP Classes % (f) Number of AP Classes offered (g-i) Students exceeding state proficiency standards, English %; Math % and Science % (j) Attendance % (k) Graduation % (l) Dropout % and (m) Average Teacher Salary [USD].

This analysis will be simplified into a whitepaper and interactive webpage to communicate this information to both professional and lay audiences. This data analysis will be made available online for free. Expected launch date: February 2019.

Sequence of major upcoming projects:
  • Exploratory Data Analysis (EDA)
  • Meta-analysis of the largest 20 School Districts in the United States
  • Begin constructing Machine Learning models (ML)
  • Launch of open-access publication and other community tools
  • School Success Data Network: Identify characteristic features of prominent successful school models


This grassroots initiative aims to use the power of big data to examine which variables correspond most to school performance and display their respective predictive powers. These educational data science models will be helpful in illuminating future educational research and policy directions that can be explored in much further detail. This tool can be helpful for school superintendents, principals, educational researchers, policymakers and other groups.

This is purely a correlative research study and intends to promote model-building, we do not make any statements on causation. Further research is needed to evaluate the efficacy of specific interventions and how improvements in these metrics can drive increased student outcomes.

This initiative aims to serve as a foundational ‘first-step’ to illustrate the value of open educational datasets, big data in education and to advocate for a collaborative model of educational research.

powered by Typeform