Ten Research Challenge Areas in Data Science

Authors

  • Jayesh Dattatray Kharat  Department of Computer Engineering, Zeal Collage Engineering and Research Pune, Maharashtra, India
  • Paurnima Kawale  Department of Computer Engineering, Zeal Collage Engineering and Research Pune, Maharashtra, India
  • Rashmi Ashtagi  Department of Computer Engineering, Zeal Collage Engineering and Research Pune, Maharashtra, India

Keywords:

artificial intelligence, causal reasoning, computing systems, data life cycle, deep learning, ethics, machine learning, privacy, trustworthiness

Abstract

To drive progress in the field of data science, we propose 10 challenge areas for the research community to pursue. Since data science is broad, with methods drawing from computer science, statistics, and other disciplines, and with applications appearing in all sectors, these challenge areas speak to the breadth of issues spanning science, technology, and society. We preface our enumeration with meta-questions about whether data science is a discipline. We then describe each of the 10 challenge areas. The goal of this article is to start a discussion on what could constitute a basis for a research agenda in data science, while recognizing that the field of data science is still evolving.

References

  1. Abadie, A., Diamond, A., & Hainmüller, J. (2010). Synthetic control methods for comparative case studies: Estimating the effect of California's Tobacco Control Program. Journal of the American Statistical Association, 105(490), 493-505.  https://doi.org/10.1198/jasa.2009.ap08746
  2. Abowd, J. M. (2018). The U.S. Census Bureau adopts differential privacy. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2867.  Association for Computing Machinery. https://doi.org/10.1145/3219819.3226070
  3. Adadi, A., & Berrada, M. (2018). Peeking Inside the Black-Box: A survey on explainable artificial intelligence (XAI). IEEE Access, 6, 52138-52160. https://doi.org/10.1109/ACCESS.2018.2870052
  4. Amjad, M., Misra, V., Shah, D., & Shen, D. (2019). mRSC: Multi-dimensional Robust Synthetic Control. Proceedings of the ACM on Measurement and Analysis of Computing Systems (Sigmetrics 2019), 3(2), 37:1-28 Association for Computing Machinery. http://dna-pubs.cs.columbia.edu/citation/paperfile/233/mRSC.pdf
  5. Arora, S. Ge, R., Neyshabur, B., & Zhang, Y. (2018). Stronger generalization bounds for deep nets via a compression approach. Proceedings of the 35th International Conference on Machine Learning. PMLR, 80, 254–263. http://proceedings.mlr.press/v80/arora18b.html
  6. Athey, S. (2016). Susan Athey on how economists can use machine learning to improve policy. Stanford Institute for Economic Policy Research. https://siepr.stanford.edu/news/susan-athey-how-economists-can-use-machine-learning-improve-policy
  7. Balestriero, R., & Baraniuk, R. G. (2018). A spline theory of deep networks. Proceedings of the 35th International Conference on Machine Learning. PMLR, 80, 374–383. http://proceedings.mlr.press/v80/balestriero18b.html
  8. Belmont Report. (1979). The Belmont Report: Ethical principles and guidelines for the protection of human subjects of Research. U.S. Department of Health, Education, and Welfare.
  9. Berger, J., He, X., Madigan, C., Murphy, S., Yu, B., & Wellner, J. (2019). Statistics at a crossroad: Who is for the challenge? NSF workshop report. National Science Foundation. https://hub.ki/groups/statscrossroad
  10. Biau, G., & Scornet, E. (2015). A random forest guided tour. TEST, 25, 197–227. https://doi.org/10.1007/s11749-016-0481-7
  11. Chen, C., Lin, K., Rudin, C., Shaposhnik, Y., Wang, S., & Wang, T. (2018). An interpretable model with globally consistent explanations for credit risk. NIPS 2018 Workshop on Challenges and Opportunities for AI in Financial Services: The Impact of Fairness, Explainability, Accuracy, and Privacy. https://arxiv.org/abs/1811.12615
  12. Connelly, M., Madigan, D., Jervis, R., Spirling, A., & Hicks, R. (2019). The History Lab. http://history-lab.org/
  13. Dally, W. J., Turakhia, Y., & Han, S. (2020). Domain-specific accelerators. Communications of the ACM, 63(7), 48–57.
  14. https://cacm.acm.org/magazines/2020/7/245701-domain-specific-hardware-accelerators/fulltext
  15. Dittrich, D., & Kenneally, E. (2011). The Menlo Report: Ethical principles guiding information and communication technology research. U.S. Department of Homeland Security. http://www.caida.org/publications/papers/2012/menlo_report_ethical_principles/
  16. Floridi, L., &Taddeo, M. (2016). What is data ethics? Philosophical Transactions of the Royal Society A, 374(2083), Article 20160360. https://doi.org/10.1098/rsta.2016.0360
  17. Google. (2020). Cloud AutoML. https://cloud.google.com/automl/
  18. Hawes, M. B. (2020). Implementing differential privacy: seven lessons from the 2020 United States Census. Harvard Data Science Review, 2(2). https://doi.org/10.1162/99608f92.353c6f99
  19. HIPAA (1996), Health Insurance Portability and Accountability Act, US Congress, Pub.L. 104–191, 110 Stat. 1936, enacted August 21, 1996.
  20. Ion, M., Kreuter, B., Nergiz, E., Patel, S., Saxena, S., Seth, K., Shananhan, D., & Yung, M. (2017). Private intersection-sum protocol with applications to attributing aggregate ad conversions. CryptologyePrint Archive, Report 2017/738. https://eprint.iacr.org/2017/738
  21. Johnstone, I. M., & Titterington, D. M. (2009). Statistical challenges of high-dimensional data. Philosophical transactions. Series A, Mathematical, Physical, and Engineering Sciences, 367(1906), 4237–4253. https://doi.org/10.1098/rsta.2009.0159
  22. Kamara, S., Mohassel, P., Raykova, M., and Sadeghian, S. (2014). Scaling private set intersection to billion element sets. In N. Christin & R. Safavi-Naini (Eds.), Financial cryptography and data security (pp. 195–215). Springer. https://doi.org/10.1007/978-3-662-45472-5_13
  23. Liebman, B. L., Roberts, M., Stern, R. E., & Wang, A. (2017). Mass digitization of Chinese court decisions: How to use text as data in the field of Chinese law. UC San Diego School of Global Policy and Strategy, 21st Century China Center Research Paper No. 2017-01; Columbia Public Law Research Paper No. 14-551.
  24. https://scholarship.law.columbia.edu/faculty_scholarship/2039
  25. Microsoft. (2020). What is automated machine learning (AutoML)? https://docs.microsoft.com/en-us/azure/machine-learning/concept-automated-ml
  26. Mueller, A. (2019). Data Analysis Baseline Library. GitHub. https://libraries.io/github/amueller/dabl
  27. Murdoch, W. J., Singh, C., Kumbier, K., Abbasi-Asl, R., & Yu, B. (2019). Definitions, methods, and applications in interpretable machine learning. Proceedings of the National Academy of Sciences of the United States of America, 116(44), 22071–22080. https://doi.org/10.1073/pnas.1900654116
  28. National Research Council. (1999). Trust in cyberspace. National Academies Press. https://doi.org/10.17226/6161
  29. Ratner, A., Bach, S., Ehrenberg, H., Fries, J., Wu, S., & Ré, C. (2018). Snorkel: Rapid training data creation with weak supervision. Proceedings of the 44thInternational Conference on Very Large Data Bases, 11(3), pp. 269-282.http://www.vldb.org/pvldb/vol11/p269-ratner.pdf
  30. Strubell E., Ganesh, A., & McCallum, A. (2019). Energy and policy considerations for deep learning in NLP. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 3645-3650.https://www.aclweb.org/anthology/P19-1355.pdf
  31. Taddy, M. (2019). Business data science: Combining machine learning and economics to optimize, automate, and accelerate business decisions. McGraw Hill.
  32. Thompson, N. C., Greenewald, K., Lee, K., & Manso, G. F. (2020). The computational limits of deep learning. https://arxiv.org/abs/2007.05558
  33. Trifacta. (2020). https://www.trifacta.com/
  34. Turek, M. (2016). Defense Advanced Research Projects Agency, Explainable AI Program. https://www.darpa.mil/program/explainable-artificial-intelligence
  35. Wang, Y., & Blei, D. M. (2019). The blessings of multiple causes. Journal of the American Statistical Association, 114(528), 1574-1596, https://doi.org/10.1080/01621459.2019.1686987
  36. Wing, J. M. (2019). The data life cycle. Harvard Data Science Review, 1(1).
  37. Wing, J. M. (2020). Trustworthy AI.https://arxiv.org/abs/2002.06276
  38. Wing, J. M., Janeia, V. P., Kloefkorn, T., & Erickson, L. C. (2018). Data Science Leadership Summit. Workshop Report. National Science Foundation. https://dl.acm.org/citation.cfm?id=3293458

Downloads

Published

2022-03-30

Issue

Section

Research Articles

How to Cite

[1]
Jayesh Dattatray Kharat, Paurnima Kawale, Rashmi Ashtagi "Ten Research Challenge Areas in Data Science " International Journal of Scientific Research in Science and Technology(IJSRST), Online ISSN : 2395-602X, Print ISSN : 2395-6011,Volume 9, Issue 2, pp.462-469, March-April-2022.