Ten Research Challenge Areas in Data Science
Keywords:
artificial intelligence, causal reasoning, computing systems, data life cycle, deep learning, ethics, machine learning, privacy, trustworthinessAbstract
To drive progress in the field of data science, we propose 10 challenge areas for the research community to pursue. Since data science is broad, with methods drawing from computer science, statistics, and other disciplines, and with applications appearing in all sectors, these challenge areas speak to the breadth of issues spanning science, technology, and society. We preface our enumeration with meta-questions about whether data science is a discipline. We then describe each of the 10 challenge areas. The goal of this article is to start a discussion on what could constitute a basis for a research agenda in data science, while recognizing that the field of data science is still evolving.
References
- Abadie, A., Diamond, A., & Hainmüller, J. (2010). Synthetic control methods for comparative case studies: Estimating the effect of California's Tobacco Control Program. Journal of the American Statistical Association, 105(490), 493-505. https://doi.org/10.1198/jasa.2009.ap08746
- Abowd, J. M. (2018). The U.S. Census Bureau adopts differential privacy. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2867. Association for Computing Machinery. https://doi.org/10.1145/3219819.3226070
- Adadi, A., & Berrada, M. (2018). Peeking Inside the Black-Box: A survey on explainable artificial intelligence (XAI). IEEE Access, 6, 52138-52160. https://doi.org/10.1109/ACCESS.2018.2870052
- Amjad, M., Misra, V., Shah, D., & Shen, D. (2019). mRSC: Multi-dimensional Robust Synthetic Control. Proceedings of the ACM on Measurement and Analysis of Computing Systems (Sigmetrics 2019), 3(2), 37:1-28 Association for Computing Machinery. http://dna-pubs.cs.columbia.edu/citation/paperfile/233/mRSC.pdf
- Arora, S. Ge, R., Neyshabur, B., & Zhang, Y. (2018). Stronger generalization bounds for deep nets via a compression approach. Proceedings of the 35th International Conference on Machine Learning. PMLR, 80, 254–263. http://proceedings.mlr.press/v80/arora18b.html
- Athey, S. (2016). Susan Athey on how economists can use machine learning to improve policy. Stanford Institute for Economic Policy Research. https://siepr.stanford.edu/news/susan-athey-how-economists-can-use-machine-learning-improve-policy
- Balestriero, R., & Baraniuk, R. G. (2018). A spline theory of deep networks. Proceedings of the 35th International Conference on Machine Learning. PMLR, 80, 374–383. http://proceedings.mlr.press/v80/balestriero18b.html
- Belmont Report. (1979). The Belmont Report: Ethical principles and guidelines for the protection of human subjects of Research. U.S. Department of Health, Education, and Welfare.
- Berger, J., He, X., Madigan, C., Murphy, S., Yu, B., & Wellner, J. (2019). Statistics at a crossroad: Who is for the challenge? NSF workshop report. National Science Foundation. https://hub.ki/groups/statscrossroad
- Biau, G., & Scornet, E. (2015). A random forest guided tour. TEST, 25, 197–227. https://doi.org/10.1007/s11749-016-0481-7
- Chen, C., Lin, K., Rudin, C., Shaposhnik, Y., Wang, S., & Wang, T. (2018). An interpretable model with globally consistent explanations for credit risk. NIPS 2018 Workshop on Challenges and Opportunities for AI in Financial Services: The Impact of Fairness, Explainability, Accuracy, and Privacy. https://arxiv.org/abs/1811.12615
- Connelly, M., Madigan, D., Jervis, R., Spirling, A., & Hicks, R. (2019). The History Lab. http://history-lab.org/
- Dally, W. J., Turakhia, Y., & Han, S. (2020). Domain-specific accelerators. Communications of the ACM, 63(7), 48–57.
- https://cacm.acm.org/magazines/2020/7/245701-domain-specific-hardware-accelerators/fulltext
- Dittrich, D., & Kenneally, E. (2011). The Menlo Report: Ethical principles guiding information and communication technology research. U.S. Department of Homeland Security. http://www.caida.org/publications/papers/2012/menlo_report_ethical_principles/
- Floridi, L., &Taddeo, M. (2016). What is data ethics? Philosophical Transactions of the Royal Society A, 374(2083), Article 20160360. https://doi.org/10.1098/rsta.2016.0360
- Google. (2020). Cloud AutoML. https://cloud.google.com/automl/
- Hawes, M. B. (2020). Implementing differential privacy: seven lessons from the 2020 United States Census. Harvard Data Science Review, 2(2). https://doi.org/10.1162/99608f92.353c6f99
- HIPAA (1996), Health Insurance Portability and Accountability Act, US Congress, Pub.L. 104–191, 110 Stat. 1936, enacted August 21, 1996.
- Ion, M., Kreuter, B., Nergiz, E., Patel, S., Saxena, S., Seth, K., Shananhan, D., & Yung, M. (2017). Private intersection-sum protocol with applications to attributing aggregate ad conversions. CryptologyePrint Archive, Report 2017/738. https://eprint.iacr.org/2017/738
- Johnstone, I. M., & Titterington, D. M. (2009). Statistical challenges of high-dimensional data. Philosophical transactions. Series A, Mathematical, Physical, and Engineering Sciences, 367(1906), 4237–4253. https://doi.org/10.1098/rsta.2009.0159
- Kamara, S., Mohassel, P., Raykova, M., and Sadeghian, S. (2014). Scaling private set intersection to billion element sets. In N. Christin & R. Safavi-Naini (Eds.), Financial cryptography and data security (pp. 195–215). Springer. https://doi.org/10.1007/978-3-662-45472-5_13
- Liebman, B. L., Roberts, M., Stern, R. E., & Wang, A. (2017). Mass digitization of Chinese court decisions: How to use text as data in the field of Chinese law. UC San Diego School of Global Policy and Strategy, 21st Century China Center Research Paper No. 2017-01; Columbia Public Law Research Paper No. 14-551.
- https://scholarship.law.columbia.edu/faculty_scholarship/2039
- Microsoft. (2020). What is automated machine learning (AutoML)? https://docs.microsoft.com/en-us/azure/machine-learning/concept-automated-ml
- Mueller, A. (2019). Data Analysis Baseline Library. GitHub. https://libraries.io/github/amueller/dabl
- Murdoch, W. J., Singh, C., Kumbier, K., Abbasi-Asl, R., & Yu, B. (2019). Definitions, methods, and applications in interpretable machine learning. Proceedings of the National Academy of Sciences of the United States of America, 116(44), 22071–22080. https://doi.org/10.1073/pnas.1900654116
- National Research Council. (1999). Trust in cyberspace. National Academies Press. https://doi.org/10.17226/6161
- Ratner, A., Bach, S., Ehrenberg, H., Fries, J., Wu, S., & Ré, C. (2018). Snorkel: Rapid training data creation with weak supervision. Proceedings of the 44thInternational Conference on Very Large Data Bases, 11(3), pp. 269-282.http://www.vldb.org/pvldb/vol11/p269-ratner.pdf
- Strubell E., Ganesh, A., & McCallum, A. (2019). Energy and policy considerations for deep learning in NLP. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 3645-3650.https://www.aclweb.org/anthology/P19-1355.pdf
- Taddy, M. (2019). Business data science: Combining machine learning and economics to optimize, automate, and accelerate business decisions. McGraw Hill.
- Thompson, N. C., Greenewald, K., Lee, K., & Manso, G. F. (2020). The computational limits of deep learning. https://arxiv.org/abs/2007.05558
- Trifacta. (2020). https://www.trifacta.com/
- Turek, M. (2016). Defense Advanced Research Projects Agency, Explainable AI Program. https://www.darpa.mil/program/explainable-artificial-intelligence
- Wang, Y., & Blei, D. M. (2019). The blessings of multiple causes. Journal of the American Statistical Association, 114(528), 1574-1596, https://doi.org/10.1080/01621459.2019.1686987
- Wing, J. M. (2019). The data life cycle. Harvard Data Science Review, 1(1).
- Wing, J. M. (2020). Trustworthy AI.https://arxiv.org/abs/2002.06276
- Wing, J. M., Janeia, V. P., Kloefkorn, T., & Erickson, L. C. (2018). Data Science Leadership Summit. Workshop Report. National Science Foundation. https://dl.acm.org/citation.cfm?id=3293458
Downloads
Published
Issue
Section
License
Copyright (c) IJSRST

This work is licensed under a Creative Commons Attribution 4.0 International License.