Interested in setting up DataHub for your class?
DataHub creates on-demand cloud-based Jupyter notebook and R Studio notebook servers, which are the basis of the technical infrastructure for Data 8 and related courses.
The main DataHub deployment is at datahub.berkeley.edu. In addition there are several other hub archetypes serving diverse instructional needs of the Berkeley instructors.
Jupyter Notebook Examples
Notebook Example from City Planning 88
Introductory notebook in Python exploring the concept of node-based networks in connector course City Planning 88 taught by Marta Gonzalez.
Notebook Example from Political Science 3
Introductory notebook in R exploring the question of whether politicians racially discriminate against their constituents in Introduction to Empirical Analysis and Quantitative Methods course taught by David Broockman.
Notebook Example from Biology 1B
Notebook in Python analyzing the data collected from the North and South Forks of Strawberry Creek in Evolution, Ecology, and Organizational Diversity course
Notebook Example from Ethnic Studies 21AC
Notebook in Python analyzing the incarceration trends and impacts of prison realignment in California as part of Ethnic Studies modules taught by Victoria E Robinson.
Hub Archetypes
DataHub
DataHub provides standard computing infrastructure to many foundational courses across diverse disciplines. Instructors who are interested to run their Jupyter based workflow use DataHub. DataHub provides standard computing infrastructure, package management in Python, and storage solutions catering to the instructional requirement of many introductory data science courses.
R Hub
R Hub provides standard computing infrastructure to instructors using R-based tools (RStudio IDE, Jupyter R). R Hub is widely used by instructors teaching quantitative social science courses. Fun fact: Infrastructure team within Berkeley made an immense contribution to the Jupyterhub ecosystem by adding R Studio as part of the standard offering which improved access to R based instructors.
Biology Hub
Biology hub is a compute-intensive infrastructure tailored towards the needs of instructors in Biology and Genomics. Hub provides additional compute to support the complex data science use cases requiring large datasets as part of the courses taught eg: Hub supports compute intensive workflow to analyze large datasets in Genome sequencing.
Stat 159 Hub
Stat 159 Hub is an innovation hub tailored to the needs of the Stat 159 course taught by Fernando Perez. One of the objectives for this hub is to make it a "home away from home" for students enrolled in this course. Students will use the hub like their local setup and will utilize some of the advanced Datahub use-cases which include remote desktop environment in Linux, secure access to GitHub, Dropbox-like functionality to share files, Real-time collaboration, Real time file sharing etc.
DataHub Principles
Inclusion
DataHub is built with the principle of inclusion in mind. Any instructor irrespective of their domain can expose their students to data science workflow using DataHub.
Accessibility
DataHub completely removes the dependency on the student's local desktop configuration in order to run their Data Science workflow. DataHub provides the required infrastructure including the storage and compute in an equitable manner for all students
Open Source
DataHub is built with an open-source ethos in mind. DataHub is completely free of cost, and no licensing is required for the instructors/students to access the infrastructure. In addition, The team behind DataHub has a strong connection with the open-source ecosystem including the Jupyter ecosystem.
Scalability
DataHub was initially piloted in Spring 2017 as part of a small classroom of 50+ students in Data 8. At the start of Spring 2022 semester, DataHub supports almost 1500+ students who are enrolled in Data 8. DataHub Infrastructure’s ability to handle the growth in Data 8 is a huge testament to its scalability.
DataHub Metrics
4000+
daily active users
13,000+
monthly active users
200+
number of years spent using DataHub since 2019
$5
cloud costs per student per semester across all hubs
UC Berkeley Usage of DataHub
35
109
16127
in the 2023-2024 Academic Year