UC Berkeley's New Data Science Newsletter #1

May 5th, 2020

May 05, 2020

Introduction

If you are receiving this letter, you at some point contacted UC Berkeley’s Division of Computing, Data Science, and Society. In an effort to establish a more informed data science community, we have decided to construct a quarterly newsletter informing teachers and professors of pedagogical opportunities and techniques in Data Science. Hopefully, this is a fun space to engage with data science and learn a thing or two along the way!

Please share this newsletter with your colleagues! The goal is to spread the word on the amazing work that instructors across the country are doing to advance data science education for all undergraduates.

Upcoming Summer Workshop
New Tech Guide
Otter Autograding
School Spotlights
- The University of Illinois Urbana-Champaign
- Boise State University
Join Us On Slack

Upcoming Summer Workshop

We will be hosting our third annual National Workshop on Data Science Education this summer. Due to the current COVID-19 pandemic, we have decided to host this summer’s workshop online. Participation in the workshop is free for attendees.

This workshop will take place from June 22 - 26, 2020 and is organized by UC Berkeley's Division of Computing, Data Science, and Society with support from Microsoft and the West Big Data Innovation Hub.

The workshop will consist of a combination of pre-recorded talks, interactive workshops, and Q&A sessions. We will be publishing a set of videos that cover elements that the workshop would have contained. We will also be sharing supplementary materials about topics such as Jupyter infrastructure, Data 8 teaching guides, Data 100, and Data Science Modules and Connectors. Additionally, we will have experts answer questions during AMA office hours. We will host a series of live webinars that will have panel discussions on topics that include multiple stakeholders.

We hope that this format allows you to participate in the workshop at your own pace, while still being able to participate in live sessions and interact with other data science educators. Full event information will be available soon.

Conference topics include:

What it is like to develop and run an introductory foundation of data science course
Development of the affiliated connectors and modules in social, physical, and professional fields
Shared experiences of previous participants in implementing data science course
Logistics of deploying a Jupyterhub
Automatic grading solutions for Jupyterhub pedagogy materials

Please fill out the application at this link if you would like to participate in the workshop. We will be sending out registration and login information at a later date. Conference updates will be posted on the Website. Feel free to contact us at ds-help@berkeley.edu if you have any questions.

The New Tech Guide Is Here

We have a new tech guide for educators to create data science courses, adopted from the previous zero-to-data-8 guide! This online resource serves as a guide for professors/instructors that wish to adopt a data science classroom environment. These suggestions are based on UC Berkeley’s experience teaching introductory data science courses (i.e. Data 8) on campus since 2015. For a guide on the pedagogical and curriculum considerations, there is a separate guide here.

This guide is written with all levels of technical understanding in mind, so even if you are new to creating a computational course, this guide is for everyone!

Otter Autograding

Berkeley’s Latest Autograder

The Infrastructure team of the Data Science Education Program at UC Berkeley is excited to announce the beta release of Otter Grader v1.0.0, a light-weight autograding Python package built for courses with Jupyter Notebook environments. The open-source Python package enables instructors to autograde student Jupyter Notebook submissions locally on the instructor’s machine in batch. In addition, Otter can be configured to work with Gradescope’s proprietary autograding service to automatically grade student submissions via Gradescope’s infrastructure.

Autograding assignments using Otter is done with a few simple steps: instructors first collect student submissions through any LMS. After specifying setup requirements, data files, and test files in the ok test format, instructors can grade submissions in batch through their machine’s command-line interface themselves. Otter outputs a CSV file with the grades for each student, broken down by test, as well as PDFs of each student notebook’s manually graded questions. For instructors using Gradescope as their LMS, Otter generates a zip file that enables Gradescope to autograde each student’s submission on Gradescope’s own infrastructure. Typically, grading a class of 30 students takes less than 3 minutes due to the parallelization of Otter’s grading infrastructure.

Existing autograders such as okpy or gofer grader require active server overhead for full functionality. One advantage of Otter Grader is that it removes the need to set up and maintain a live grading server, making it easily adoptable for smaller classes where having a “live” autograder is unnecessary. “Otter was developed to reduce the burden on instructors who want to teach data science without worrying about the logistical challenges associated with maintaining a live server,” said Chris Pyles, the lead developer for Otter. Pyles also developed a test generator to help instructors generate correctly-formatted ok tests for Otter.

Otter Grader v1.0.0 also comes with a modified version of jassign. Jassign is a tool for generating Jupyter notebook assignments efficiently. Course instructors generally author assignments as Jupyter notebooks by creating a master notebook that contains setup code, questions, solutions, and tests to validate those solutions. Jassign then automatically prepares such an assignment to be distributed to students and later scored using the Otter grading framework discussed above.

For institutions that prefer to run their own autograding server, the Infrastructure team is currently developing an optional server component to Otter Grader, called Otter Service, to be released in the next few months. Paired with Otter Grader, Otter Service will allow instructors to better customize their grading environments on the server, which is valuable when scaling to class sizes of hundreds of students. Students in classes using Otter Service will be able to submit their work for autograding directly within their notebooks, instead of submitting their work through an LMS.

Numerous data science classes within Berkeley, as well as data science classes at other institutions, have already adopted Otter. Ian Castro, lead instructor for UC Berkeley’s Introduction to Data Science for Graduate Students (CP 298) notes that,

“As the instructor of a course with a small staff, Otter has been a great resource. It has significantly reduced the workload for us with its autograder and functionality with Gradescope. Students are also able to obtain instantaneous feedback on their code, which gives them the ability to debug errors on their own, which is incredibly helpful in a remote learning environment where students may not necessarily be able to receive timely feedback from instructors due to varying timezones, schedules, or other issues.”

CP 298, Data 88 (Economic Models), and IEOR 135 (Applied Data Science with Venture Applications) were some of the first beta testers for Otter on the UC Berkeley campus. The beta testing demonstrated Otter’s success with classes of varying sizes: CP 298 and Data 88 were classes of less than 50 students, and IEOR 135 was a class of over 200 students.

More detailed information about Otter can be found in the documentation and in the Github repository. The infrastructure team has set up the Otter Grader slack and the email ds-infra@berkeley.edu to field questions for those looking to adopt.

School Spotlights

Check out the latest school’s we’ve worked with and how they’ve implemented Data Science programs:

Data Science Education at University of Illinois Urbana-Champaign

An Example in Using Github instead of JupyterHub

How Handshake Is Helping UIUC Students Get a Job After College

Stat 107: Data Science Discovery is the University of Illinois Urbana-Champaign’s (UIUC) foundational data science course. Led by Professor Wade Fagen-Ulmschneider and Professor Karle Flanagan, Stat 107 is designed with no prerequisites with the goal that any student at UIUC is able to gain a comprehensive introduction to the “next BIG thing at Illinois”.

Development of the course officially began in Fall 2018, shortly after UIUC attended the National Workshop in Data Science Education in Summer 2018 and learned about Data 8: Foundations of Data Science at Berkeley. UIUC shared Berkeley’s enthusiasm in offering an introductory course in data science that was accessible at a large scale. The pilot offering of Stat 107 was in Spring 2019 with 20 students from 20 different majors, with a massive growth to 300 students in its Fall 2019 offering, coinciding with its offering as a general education requirement.

While Stat 107 is modeled off of Data 8 in terms of curriculum, its infrastructure is based on differing philosophies. Programming in Python in the class is based on the pandas package rather than Berkeley’s datascience package, and local deployment of notebooks through Github is favored over JupyterHub as a means for deploying assignments. Traditionally, classes that utilize the datascience package implement the package throughout the class with the aim of flattening the perceived steepness of the programming learning curve -- a concern stemming from there being no formal prerequisites aside from high school mathematics. However, UIUC considers introducing students to pandas from the get-go as a more suitable option for bringing industry-relevant experience to the class. Empirically, students only struggle for the first two weeks with the learning curve, which is accompanied by close attention and support from the course staff.

The theme of gearing students towards industry-related tools and skills is also exemplified with the usage of Github; the instructors give out starter code for pulling Jupyter notebooks from the course repository for students to follow through the course of the semester, with an explanation of the theory behind the code given in the second half of the semester. Furthermore, all exams are open-book, open-Google, and open-resource in general to mirror the workflow techniques and collaboration present in most of the industry.

UIUC is currently working towards a fully established data science program. Existing related programs include the B.S. in Statistics & CS degree and the CS+X degree, which allows students to specialize in one of 10+ concentrations in diverse fields such as advertising, chemistry, or music. UIUC hopes to kick off its data science specific programs by offering a minor in the near future. Over the next 3-5 years, UIUC also plans on expanding its 4 connector courses, which weave together core concepts and approaches from Data 8 with complementary ideas or areas like psychology, cognitive science, and business. These courses come with Stat 100: Statistics, an introductory statistics course, as a prerequisite. UIUC plans to include Stat 107 and Stat 100 in their data science degree curriculum, which will also include courses in statistics, computer science, mathematics, data ethics, and information.

Data Science Education at Boise State University

Implementing a Flipped-Classroom Model at a 4-year University

Boise State launches Idaho's first doctoral program in biomedical ...

Implementing a flipped-classroom model at a 4-year university

One of the earliest adopters of Berkeley’s Data 8: Foundations of Data Science is Boise State University. Professor Casey Kennington, an Assistant Professor in the Department of Computer Science at Boise State University, was inspired by Data 8 to start Computer Science 133: Foundations of Data Science, in Fall 2018. Beginning as a special topics course, the first iteration had about 30 students. Now in its fourth iteration, the three-unit course has about 40 students and is expected to have around 100 students in the future, projecting CS 133 to be one of the most popular classes at Boise State University in the near future.

In order to prepare for the first iteration of the course, Professor Kennington adopted course materials offered by Berkeley and worked with his institution’s IT team to set up the infrastructure needed. CS 133 uses Berkeley’s OkPy for autograding and uses Boise State University’s research server to host student Jupyter Notebooks environments.

CS 133 implements a flipped-classroom model, in which students learn through Data 8’s “Computational and Inferential Thinking” textbook at home and attend class to discuss topics and participate in lectures. One might think that reading the textbook at home can make students confused about certain topics. However, Professor Kennington says that when students come to class, most of the questions they asked are related to the Python syntax, not the concepts. This shows that the textbook is sufficient enough for students to learn the material on their own. However, there still are occasional lectures where Professor Kennington discusses specific ideas within different concepts, such as statistics or ethics.

The course has gained widespread interest across the university. For example, other quantitative departments have noticed how relevant and useful the computational skills taught in CS 133 are for their own majors. In the spirit of their own majors’ students to gain these skills, the departments encourage their students to take CS 133. For example, Physics majors are now required to take CS 133.

Within computer science, there is a general imbalance between the percent of each gender in the class. In CS 133, about 40-50% of the students identify as a woman. This is much higher compared to the average ratio of 15-20% of women-identifying students in computer science classes. Currently, approximately one-third of the course are computer science majors, another third are physics majors, and the last third are other majors, such as business, biology, and chemistry.

Data 8 and CS 133’s primary focuses are to introduce computational skills to students without any previous coding or statistics backgrounds. Many people who learn code for the first time are often intimidated by how much coding and statistics they need to learn. To alleviate this fear of code, instructors for both Data 8 and CS 133 use the datascience package, a Python library which is pedagogically and syntactically easier to understand than Pandas. Professor Kennington believes that starting students off with the datascience package helps create a foundation for computing that transitions to Pandas later in college coursework.

CS 133 is just one instance of Boise State University’s undergraduate data science efforts. Once students take CS 133, they are encouraged to take Data-LA 320 Principles of Data Science, which is similar to Berkeley’s Data 100. Data-LA 320 Principles of Data Science teaches pandas, natural language processing, supervised classification, and data cleaning and collection. The University also offers a Data Science for Liberal Arts Certificate and a Data Science for Liberal Arts Minor. The minor offers more advanced data science courses, such as time series and social network analysis.

Join Us on Slack!

If you haven’t already, please join our Data Science Education Slack community at this invite linkjoin our growing Slack Community of Practice here If you would like to hear more about our efforts in furthering data science education, feel free to email us at ds-help@berkeley.edu. Additionally, if you would like to contribute to future newsletters or have general comments about this one, please email us at the email listed above.

The Data Science Education Community Newsletter

Discussion about this post

Ready for more?