Ten Research Challenge Areas in Data Science

Jayesh Dattatray Kharat; Paurnima Kawale; Rashmi Ashtagi

doi:10.32628/IJSRST2291315

Authors

Jayesh Dattatray Kharat Department of Computer Engineering, Zeal Collage Engineering and Research Pune, Maharashtra, India
Paurnima Kawale Department of Computer Engineering, Zeal Collage Engineering and Research Pune, Maharashtra, India
Rashmi Ashtagi Department of Computer Engineering, Zeal Collage Engineering and Research Pune, Maharashtra, India

Keywords:

artificial intelligence, causal reasoning, computing systems, data life cycle, deep learning, ethics, machine learning, privacy, trustworthiness

Abstract

To drive progress in the field of data science, we propose 10 challenge areas for the research community to pursue. Since data science is broad, with methods drawing from computer science, statistics, and other disciplines, and with applications appearing in all sectors, these challenge areas speak to the breadth of issues spanning science, technology, and society. We preface our enumeration with meta-questions about whether data science is a discipline. We then describe each of the 10 challenge areas. The goal of this article is to start a discussion on what could constitute a basis for a research agenda in data science, while recognizing that the field of data science is still evolving.

References

Abadie, A., Diamond, A., & Hainmüller, J. (2010). Synthetic control methods for comparative case studies: Estimating the effect of California's Tobacco Control Program. Journal of the American Statistical Association, 105(490), 493-505. https://doi.org/10.1198/jasa.2009.ap08746
Abowd, J. M. (2018). The U.S. Census Bureau adopts differential privacy. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2867. Association for Computing Machinery. https://doi.org/10.1145/3219819.3226070
Adadi, A., & Berrada, M. (2018). Peeking Inside the Black-Box: A survey on explainable artificial intelligence (XAI). IEEE Access, 6, 52138-52160. https://doi.org/10.1109/ACCESS.2018.2870052
Amjad, M., Misra, V., Shah, D., & Shen, D. (2019). mRSC: Multi-dimensional Robust Synthetic Control. Proceedings of the ACM on Measurement and Analysis of Computing Systems (Sigmetrics 2019), 3(2), 37:1-28 Association for Computing Machinery. http://dna-pubs.cs.columbia.edu/citation/paperfile/233/mRSC.pdf
Arora, S. Ge, R., Neyshabur, B., & Zhang, Y. (2018). Stronger generalization bounds for deep nets via a compression approach. Proceedings of the 35th International Conference on Machine Learning. PMLR, 80, 254–263. http://proceedings.mlr.press/v80/arora18b.html
Athey, S. (2016). Susan Athey on how economists can use machine learning to improve policy. Stanford Institute for Economic Policy Research. https://siepr.stanford.edu/news/susan-athey-how-economists-can-use-machine-learning-improve-policy
Balestriero, R., & Baraniuk, R. G. (2018). A spline theory of deep networks. Proceedings of the 35th International Conference on Machine Learning. PMLR, 80, 374–383. http://proceedings.mlr.press/v80/balestriero18b.html
Belmont Report. (1979). The Belmont Report: Ethical principles and guidelines for the protection of human subjects of Research. U.S. Department of Health, Education, and Welfare.
Berger, J., He, X., Madigan, C., Murphy, S., Yu, B., & Wellner, J. (2019). Statistics at a crossroad: Who is for the challenge? NSF workshop report. National Science Foundation. https://hub.ki/groups/statscrossroad
Biau, G., & Scornet, E. (2015). A random forest guided tour. TEST, 25, 197–227. https://doi.org/10.1007/s11749-016-0481-7
Chen, C., Lin, K., Rudin, C., Shaposhnik, Y., Wang, S., & Wang, T. (2018). An interpretable model with globally consistent explanations for credit risk. NIPS 2018 Workshop on Challenges and Opportunities for AI in Financial Services: The Impact of Fairness, Explainability, Accuracy, and Privacy. https://arxiv.org/abs/1811.12615
Connelly, M., Madigan, D., Jervis, R., Spirling, A., & Hicks, R. (2019). The History Lab. http://history-lab.org/
Dally, W. J., Turakhia, Y., & Han, S. (2020). Domain-specific accelerators. Communications of the ACM, 63(7), 48–57.
https://cacm.acm.org/magazines/2020/7/245701-domain-specific-hardware-accelerators/fulltext
Dittrich, D., & Kenneally, E. (2011). The Menlo Report: Ethical principles guiding information and communication technology research. U.S. Department of Homeland Security. http://www.caida.org/publications/papers/2012/menlo_report_ethical_principles/
Floridi, L., &Taddeo, M. (2016). What is data ethics? Philosophical Transactions of the Royal Society A, 374(2083), Article 20160360. https://doi.org/10.1098/rsta.2016.0360
Google. (2020). Cloud AutoML. https://cloud.google.com/automl/
Hawes, M. B. (2020). Implementing differential privacy: seven lessons from the 2020 United States Census. Harvard Data Science Review, 2(2). https://doi.org/10.1162/99608f92.353c6f99
HIPAA (1996), Health Insurance Portability and Accountability Act, US Congress, Pub.L. 104–191, 110 Stat. 1936, enacted August 21, 1996.
Ion, M., Kreuter, B., Nergiz, E., Patel, S., Saxena, S., Seth, K., Shananhan, D., & Yung, M. (2017). Private intersection-sum protocol with applications to attributing aggregate ad conversions. CryptologyePrint Archive, Report 2017/738. https://eprint.iacr.org/2017/738
Johnstone, I. M., & Titterington, D. M. (2009). Statistical challenges of high-dimensional data. Philosophical transactions. Series A, Mathematical, Physical, and Engineering Sciences, 367(1906), 4237–4253. https://doi.org/10.1098/rsta.2009.0159
Kamara, S., Mohassel, P., Raykova, M., and Sadeghian, S. (2014). Scaling private set intersection to billion element sets. In N. Christin & R. Safavi-Naini (Eds.), Financial cryptography and data security (pp. 195–215). Springer. https://doi.org/10.1007/978-3-662-45472-5_13
Liebman, B. L., Roberts, M., Stern, R. E., & Wang, A. (2017). Mass digitization of Chinese court decisions: How to use text as data in the field of Chinese law. UC San Diego School of Global Policy and Strategy, 21st Century China Center Research Paper No. 2017-01; Columbia Public Law Research Paper No. 14-551.
https://scholarship.law.columbia.edu/faculty_scholarship/2039
Microsoft. (2020). What is automated machine learning (AutoML)? https://docs.microsoft.com/en-us/azure/machine-learning/concept-automated-ml
Mueller, A. (2019). Data Analysis Baseline Library. GitHub. https://libraries.io/github/amueller/dabl
Murdoch, W. J., Singh, C., Kumbier, K., Abbasi-Asl, R., & Yu, B. (2019). Definitions, methods, and applications in interpretable machine learning. Proceedings of the National Academy of Sciences of the United States of America, 116(44), 22071–22080. https://doi.org/10.1073/pnas.1900654116
National Research Council. (1999). Trust in cyberspace. National Academies Press. https://doi.org/10.17226/6161
Ratner, A., Bach, S., Ehrenberg, H., Fries, J., Wu, S., & Ré, C. (2018). Snorkel: Rapid training data creation with weak supervision. Proceedings of the 44thInternational Conference on Very Large Data Bases, 11(3), pp. 269-282.http://www.vldb.org/pvldb/vol11/p269-ratner.pdf
Strubell E., Ganesh, A., & McCallum, A. (2019). Energy and policy considerations for deep learning in NLP. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 3645-3650.https://www.aclweb.org/anthology/P19-1355.pdf
Taddy, M. (2019). Business data science: Combining machine learning and economics to optimize, automate, and accelerate business decisions. McGraw Hill.
Thompson, N. C., Greenewald, K., Lee, K., & Manso, G. F. (2020). The computational limits of deep learning. https://arxiv.org/abs/2007.05558
Trifacta. (2020). https://www.trifacta.com/
Turek, M. (2016). Defense Advanced Research Projects Agency, Explainable AI Program. https://www.darpa.mil/program/explainable-artificial-intelligence
Wang, Y., & Blei, D. M. (2019). The blessings of multiple causes. Journal of the American Statistical Association, 114(528), 1574-1596, https://doi.org/10.1080/01621459.2019.1686987
Wing, J. M. (2019). The data life cycle. Harvard Data Science Review, 1(1).
Wing, J. M. (2020). Trustworthy AI.https://arxiv.org/abs/2002.06276
Wing, J. M., Janeia, V. P., Kloefkorn, T., & Erickson, L. C. (2018). Data Science Leadership Summit. Workshop Report. National Science Foundation. https://dl.acm.org/citation.cfm?id=3293458

Ten Research Challenge Areas in Data Science

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite