Caselets

 


Source: Deposit Photos

 

What are Caselets?

Caselets (i.e. bite-sized case studies) are self-paced case-based learning tools that bridge the gap between traditional, resource-heavy methods and scalable, diverse learning in data science. They offer both novice and experienced students broader exposure to real-world problem-solving scenarios, speeding up the development of adaptive expertise that would usually take years of professional practice to attain. (Chen 2018)

Caselets are broken up into three parts:

Problem Context: This describes a scenario where a student assumes the role of a data scientist, tasked with building a model to solve a specific problem. The problem context highlights the motivation and use case for utilizing models to develop a solution.

Data Summary: A detailed description of the dataset’s attributes, is provided for developing a simulated AI model.

Caselet Questions: Multiple-choice questions designed to guide and assess student understanding of key concepts, including data preprocessing, feature engineering, experiment design, model selection, model diagnosis, problem formulation, model configuration, data comprehension, and model deployment.

 

CISAAD Caselets:

This website comes with three caselets:

Download here: Caselet 1

Download here: Caselet 2

Developing Data Science Caselets for Audio Deepfake Detection

Advancements in deepfake detection research have highlighted innovative techniques, yet educational resources remain scarce. To address this gap, we developed educational Caselets—concise case studies for the CISAAD project—to teach data science students how to apply deepfake detection methods. This paper outlines our process for creating these caselets, breaking down the steps into eleven key decision points central to model development.

 

Caselet Design Process

Each caselet includes a Problem Context, where students take on the role of a data scientist tasked with solving a deepfake detection problem, and a Data Summary, providing detailed descriptions of the dataset. Additionally, Caselet Questions guide students through key steps in model development.

 

Key Decision Points in Caselet Development

The development process mirrors a typical machine learning workflow, broken into the following eleven key decision points:

  1. Data Collection: Reviewed research papers and datasets, discovering 18 relevant datasets.
  2. Data Preprocessing: Cleaned and prepared the datasets for model development.
  3. Model Selection: Selected appropriate models based on the caselet objectives (e.g., LFCC or EDLF-based models).
  4. Feature Selection: Choose relevant audio features to enhance deepfake detection accuracy.
  5. Split Dataset: Divided the data into training and test sets.
  6. Training Set: Used the training set to develop machine learning models.
  7. Model Training: Trained models to differentiate between real and fake audio.
  8. Validation Set: Tuned models using a validation set to prevent overfitting.
  9. Model Evaluation: Evaluated model performance using metrics like accuracy and F1 score.
  10. Hyperparameter Tuning: Adjusted hyperparameters to optimize model performance.
  11. Test Set & Final Evaluation: Assessed model performance on an unseen test set to ensure real-world applicability.

 

Conclusion

By structuring the caselets around these eleven decision points, we offer students a hands-on, practical approach to help students transition from novice problem solvers to adaptive experts capable of tackling novel challenges efficiently and flexibly. Our next steps involve piloting the caselets with students, and ensuring these resources continue to evolve alongside deepfake detection technology.

 

References

Lujie Chen and Artur Dubrawski. 2018. Accelerated apprenticeship: teaching data science problem solving skills at scale. In Proceedings of the Fifth Annual ACM Conference on Learning at Scale (L@S ’18). Association for Computing Machinery, New York, NY, USA, Article 39, 1–4. https://doi.org/10.1145/3231644.3231697

 

Acknowledgments

We gratefully acknowledge the support of the National Science Foundation which has supported this work through Awards #2118285 and #2346473.

We would like to acknowledge and thank the UMBC Data Scholars Program in partnership with the MData Lab for their support and the program managers, Dr. Vandana Janeja and Dr. Christine Mallinson, as well as our mentors, Mrs. Noshaba Nasir Bhalli and Dr. Karen Chen, and our supportive student colleagues including, Kiffy Nwosu, Chloe Evered, and Whitney Fils-Aime.

 

NSF Award #2346473