Data science 2

This is the homepage for cs/stat 387, spring 2021. You can find all the information about the course here.

  • Syllabus: please read it. It’s not great if you ask me a question and the answer is on the syllabus.
  • Lecture slides / notes
    • Lecture 1
    • Lecture 2
    • Lecture 3
    • Lecture 4
    • Lecture 5
    • Lecture 6
    • Lecture 7. Additionally, read Bingham, Eli, Jonathan P. Chen, Martin Jankowiak, Fritz Obermeyer, Neeraj Pradhan, Theofanis Karaletsos, Rohit Singh, Paul Szerlip, Paul Horsfall, and Noah D. Goodman. “Pyro: Deep universal probabilistic programming.” The Journal of Machine Learning Research 20, no. 1 (2019): 973-978, Pfeffer, Avi. “Figaro: An object-oriented probabilistic programming language.” Charles River Analytics Technical Report 137 (2009): 96, and Cusumano-Towner, Marco F., Feras A. Saad, Alexander K. Lew, and Vikash K. Mansinghka. “Gen: a general-purpose probabilistic programming system with programmable inference.” In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 221-236. 2019 for Thursday. We will also discuss midterm / final projects today.
    • Lecture 8. Additionally, please read Kschischang, Frank R., Brendan J. Frey, and H-A. Loeliger. “Factor graphs and the sum-product algorithm.” IEEE Transactions on information theory 47, no. 2 (2001): 498-519, Ranganath, Rajesh, Sean Gerrish, and David Blei. “Black box variational inference.” In Artificial intelligence and statistics, pp. 814-822. PMLR, 2014, and Neklyudov, Kirill, Max Welling, Evgenii Egorov, and Dmitry Vetrov. “Involutive mcmc: a unifying framework.” In International Conference on Machine Learning, pp. 7273-7282. PMLR, 2020 by next Thursday.
  • Reading list: I am nowhere near organized enough to know when in the semester we will read these papers, but I’m pretty sure that we will read them at some point. This list is not guaranteed to be in any order.
    • Kschischang, Frank R., Brendan J. Frey, and H-A. Loeliger. “Factor graphs and the sum-product algorithm.” IEEE Transactions on information theory 47, no. 2 (2001): 498-519.
    • Ranganath, Rajesh, Sean Gerrish, and David Blei. “Black box variational inference.” In Artificial intelligence and statistics, pp. 814-822. PMLR, 2014.
    • Hoffman, Matthew D., and Andrew Gelman. “The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo.” J. Mach. Learn. Res. 15, no. 1 (2014): 1593-1623.
    • Neklyudov, Kirill, Max Welling, Evgenii Egorov, and Dmitry Vetrov. “Involutive mcmc: a unifying framework.” In International Conference on Machine Learning, pp. 7273-7282. PMLR, 2020.
    • Russo, Daniel, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. “A tutorial on Thompson sampling.” arXiv preprint arXiv:1707.02038 (2017).
    • Letham, Benjamin, Brian Karrer, Guilherme Ottoni, and Eytan Bakshy. “Constrained Bayesian optimization with noisy experiments.” Bayesian Analysis 14, no. 2 (2019): 495-519.
    • Duvenaud, David, James Lloyd, Roger Grosse, Joshua Tenenbaum, and Ghahramani Zoubin. “Structure discovery in nonparametric regression through compositional kernel search.” In International Conference on Machine Learning, pp. 1166-1174. PMLR, 2013.
    • Le, Tuan Anh, Atilim Gunes Baydin, and Frank Wood. “Inference compilation and universal probabilistic programming.” In Artificial Intelligence and Statistics, pp. 1338-1348. PMLR, 2017.
    • Bingham, Eli, Jonathan P. Chen, Martin Jankowiak, Fritz Obermeyer, Neeraj Pradhan, Theofanis Karaletsos, Rohit Singh, Paul Szerlip, Paul Horsfall, and Noah D. Goodman. “Pyro: Deep universal probabilistic programming.” The Journal of Machine Learning Research 20, no. 1 (2019): 973-978.
    • Pfeffer, Avi. “Figaro: An object-oriented probabilistic programming language.” Charles River Analytics Technical Report 137 (2009): 96.
    • Cusumano-Towner, Marco F., Feras A. Saad, Alexander K. Lew, and Vikash K. Mansinghka. “Gen: a general-purpose probabilistic programming system with programmable inference.” In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 221-236. 2019.
    • Dutta, Saikat, August Shi, Rutvik Choudhary, Zhekun Zhang, Aryaman Jain, and Sasa Misailovic. “Detecting flaky tests in probabilistic and machine learning applications.” In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 211-224. 2020.
    • Merkel, Dirk. “Docker: lightweight linux containers for consistent development and deployment.” Linux journal 2014, no. 239 (2014): 2.
    • Nüst, Daniel, Vanessa Sochat, Ben Marwick, Stephen J. Eglen, Tim Head, Tony Hirst, and Benjamin D. Evans. “Ten simple rules for writing Dockerfiles for reproducible data science.” (2020): e1008316.
    • Peng, Roger D. “Reproducible research in computational science.” Science 334, no. 6060 (2011): 1226-1227.
    • Yu, Bin, and Karl Kumbier. “Veridical data science.” Proceedings of the National Academy of Sciences 117, no. 8 (2020): 3920-3929.
    • Stodden, Victoria. “The data science life cycle: a disciplined approach to advancing data science as a science.” Communications of the ACM 63, no. 7 (2020): 58-66.
    • Stonebraker, Michael, and Lawrence A. Rowe. “The design of Postgres.” ACM Sigmod Record 15, no. 2 (1986): 340-355.
    • Borthakur, Dhruba. “HDFS architecture guide.” Hadoop Apache Project 53, no. 1-13 (2008): 2.
    • Harter, Tyler, Dhruba Borthakur, Siying Dong, Amitanand Aiyer, Liyin Tang, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. “Analysis of {HDFS} Under HBase: A Facebook Messages Case Study.” In 12th {USENIX} Conference on File and Storage Technologies ({FAST} 14), pp. 199-212. 2014.
    • Győrödi, Cornelia, Robert Győrödi, George Pecherle, and Andrada Olah. “A comparative study: MongoDB vs. MySQL.” In 2015 13th International Conference on Engineering of Modern Electric Systems (EMES), pp. 1-6. IEEE, 2015.
    • Manyam, Ganiraju, Michelle A. Payton, Jack A. Roth, Lynne V. Abruzzo, and Kevin R. Coombes. “Relax with CouchDB—Into the non-relational DBMS era of bioinformatics.” Genomics 100, no. 1 (2012): 1-7.
    • Yang, Fangjin, Eric Tschetter, Xavier Léauté, Nelson Ray, Gian Merlino, and Deep Ganguli. “Druid: A real-time analytical data store.” In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pp. 157-168. 2014.
    • Wilson, Greg. “Software carpentry: getting scientists to write better code by making them more productive.” Computing in Science & Engineering 8, no. 6 (2006): 66-69.
    • Kim, Miryung, Thomas Zimmermann, Robert DeLine, and Andrew Begel. “The emerging role of data scientists on software development teams.” In 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE), pp. 96-107. IEEE, 2016.
    • Wilson, Greg, Dhavide A. Aruliah, C. Titus Brown, Neil P. Chue Hong, Matt Davis, Richard T. Guy, Steven HD Haddock et al. “Best practices for scientific computing.” PLoS Biol 12, no. 1 (2014): e1001745.
    • Raith, Florian, Ingo Richter, and Robert Lindermeier. “How project-management-tools are used in agile practice: benefits, drawbacks and potentials.” In Proceedings of the 21st International Database Engineering & Applications Symposium, pp. 30-39. 2017.