Automatic differentiation
Backpropagation
- http://neuralnetworksanddeeplearning.com/chap2.html
- blog: https://adeveloperdiary.com/data-science/machine-learning/understand-and-implement-the-backpropagation-algorithm-from-scratch-in-python/
Neural networks
https://numpy.org/doc/stable/reference/generated/numpy.matmul.html
Pytorch
Resources
Top
- machine learning book: https://arxiv.org/abs/1901.05639
- deep learning book: https://www.deeplearningbook.org
- reinforcement learning book: http://www.incompleteideas.net/book/the-book-2nd.html
Other
- fast ai: https://course.fast.ai/
- forward mode vs reverse mode: https://math.stackexchange.com/questions/2195377/reverse-mode-differentiation-vs-forward-mode-differentiation-where-are-the-be
- deeplearning.ai: https://community.deeplearning.ai/latest
- karpathy yt: https://www.youtube.com/()/videos
- umar yt: https://www.youtube.com/()
- nn and dl book: http://neuralnetworksanddeeplearning.com/index.html
- practical dl for coders: https://course.fast.ai/
- matrix calculus needed for dl: https://explained.ai/matrix-calculus/
- matrix calculus for ml and beyond: https://ocw.mit.edu/courses/18-s096-matrix-calculus-for-machine-learning-and-beyond-january-iap-2023/pages/lecture-notes/
- matrix calculus: https://staff.fnwi.uva.nl/r.vandenboomgaard/MachineLearning/LectureNotes/Math/vectorderivatives.html
- ml notes (rein van den boomgaard): https://staff.fnwi.uva.nl/r.vandenboomgaard/MachineLearning/index.html
- pytorch autograd gentle intro: https://docs.pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html
- make autograp computational graph: https://www.datascienceweekly.org/tutorials/make-a-simple-pytorch-autograd-computational-graph
- pytorch tutorials: https://docs.pytorch.org/tutorials/
- learn pytorch: https://www.learnpytorch.io/00_pytorch_fundamentals/
- triton: https://triton-lang.org/main/getting-started/tutorials/index.html
- nvidia dl: https://developer.nvidia.com/deep-learning
- pytorch distributed: https://learnopencv.com/distributed-parallel-training-pytorch-multi-gpu-setup/
- numpy broadcasting: https://numpy.org/doc/stable/user/basics.broadcasting.html
- pytorch broadcasting: https://docs.pytorch.org/docs/stable/notes/broadcasting.html
- pytorch autograd: https://docs.pytorch.org/docs/stable/notes/autograd.html
- tensor streaming multiprocessor groq: https://groq.com/wp-content/uploads/2023/05/GroqISCAPaper2022_ASoftwareDefinedTensorStreamingMultiprocessorForLargeScaleMachineLearning-1.pdf
Research papers
The list https://punkx.org/jackdoe/30.html (supposedly) given to John Carmack by Ilya Sutskever. And other papers:
- (Shojaee et al., 2025)
- (Nasr et al., 2023)
- (Schmidt, 2019)
- (Borgeaud et al., 2021)
- (Gao et al., 2020)
- (Zhai et al., 2021)
- (Chukewad et al., 2020)
- (Perez et al., 2017)
- (Brohan et al., 2022)
- (Kendall et al., 2018)
- (Kirsch & Schmidhuber, 2022)
- (Smith et al., 2022)
- (Hafner et al., 2023)
- (Gozalo-Brizuela & Garrido-Merchan, 2023)
- (Millidge et al., 2020)
- (Ouyang et al., 2022)
- (Vaswani et al., 2017)
- (Hafner et al., 2020)
- (Hinton, 2022)
- (Solomitckii et al., 2016)
- (Hu et al., 2021)
- (Seita & Song, Accessed: 2023)
- (Dosovitskiy et al., 2021)
- (Birhane & McGann, 2024)
- (Muennighoff et al., 2025)
- (DeepSeek-AI et al., 2025)
- (Zhou et al., 2023)
- (Cho, 2015)
- (Goldberg, 2015)
- (Fein-Ashley, 2025)
- (Gandhi et al., 2025)
- (Wilson, 2025)
- (Zhu et al., 2025)
- (Singh et al., 2025)
- (Darlow et al., 2025)
- (Zhao et al., 2025)
- (Jaghouar et al., )
- (Laban et al., 2025)
- (Jha et al., 2025)
References
Shojaee, P., Mirzadeh, I., Alizadeh, K., Horton, M., Bengio, S. & Farajtabar, M. (2025). The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity. https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf
Nasr, M., Carlini, N., Hayase, J., Jagielski, M., Cooper, A.F., Ippolito, D., Choquette-Choo, C.A., Wallace, E., Tramèr, F. & Lee, K. (2023). Scalable Extraction of Training Data from (Production) Language Models. http://arxiv.org/abs/2311.17035
Schmidt, R.M. (2019). Recurrent Neural Networks (RNNs): A gentle Introduction and Overview. http://arxiv.org/abs/1912.05911
Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., Driessche, G.v.d., Lespiau, J., Damoc, B., Clark, A., Casas, D.d.L., Guy, A., Menick, J., Ring, R., Hennigan, T., Huang, S., Maggiore, L., Jones, C., Cassirer, A., Brock, A., Paganini, M., Irving, G., Vinyals, O., Osindero, S., Simonyan, K., Rae, J.W., Elsen, E. & Sifre, L. (2021). Improving language models by retrieving from trillions of tokens. http://arxiv.org/abs/2112.04426
Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S. & Leahy, C. (2020). The Pile: An 800GB Dataset of Diverse Text for Language Modeling. http://arxiv.org/abs/2101.00027
Zhai, S., Talbott, W., Srivastava, N., Huang, C., Goh, H., Zhang, R. & Susskind, J. (2021). An Attention Free Transformer. http://arxiv.org/abs/2105.14103
Chukewad, Y.M., James, J., Singh, A. & Fuller, S. (2020). RoboFly: An insect-sized robot with simplified fabrication that is capable of flight, ground, and water surface locomotion. http://arxiv.org/abs/2001.02320
Perez, E., Strub, F., de Vries, H., Dumoulin, V. & Courville, A. (2017). FiLM: Visual Reasoning with a General Conditioning Layer. http://arxiv.org/abs/1709.07871
Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Hsu, J., Ibarz, J., Ichter, B., Irpan, A., Jackson, T., Jesmonth, S., Joshi, N.J., Julian, R., Kalashnikov, D., Kuang, Y., Leal, I., Lee, K., Levine, S., Lu, Y., Malla, U., Manjunath, D., Mordatch, I., Nachum, O., Parada, C., Peralta, J., Perez, E., Pertsch, K., Quiambao, J., Rao, K., Ryoo, M., Salazar, G., Sanketi, P., Sayed, K., Singh, J., Sontakke, S., Stone, A., Tan, C., Tran, H., Vanhoucke, V., Vega, S., Vuong, Q., Xia, F., Xiao, T., Xu, P., Xu, S., Yu, T. & Zitkovich, B. (2022). RT-1: Robotics Transformer for Real-World Control at Scale. http://arxiv.org/abs/2212.06817
Kendall, A., Hawke, J., Janz, D., Mazur, P., Reda, D., Allen, J., Lam, V., Bewley, A. & Shah, A. (2018). Learning to Drive in a Day. http://arxiv.org/abs/1807.00412
Kirsch, L. & Schmidhuber, J. (2022). Meta Learning Backpropagation And Improving It. http://arxiv.org/abs/2012.14905
Smith, L., Kostrikov, I. & Levine, S. (2022). A Walk in the Park: Learning to Walk in 20 Minutes With Model-Free Reinforcement Learning. http://arxiv.org/abs/2208.07860
Hafner, D., Pasukonis, J., Ba, J. & Lillicrap, T. (2023). Mastering Diverse Domains through World Models. http://arxiv.org/abs/2301.04104
Gozalo-Brizuela, R. & Garrido-Merchan, E.C. (2023). ChatGPT is not all you need. A State of the Art Review of large Generative AI models. http://arxiv.org/abs/2301.04655
Millidge, B., Tschantz, A. & Buckley, C.L. (2020). Predictive Coding Approximates Backprop along Arbitrary Computation Graphs. http://arxiv.org/abs/2006.04182
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J. & Lowe, R. (2022). Training language models to follow instructions with human feedback. http://arxiv.org/abs/2203.02155
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L. & Polosukhin, I. (2017). Attention Is All You Need. http://arxiv.org/abs/1706.03762
Hafner, D., Lillicrap, T., Ba, J. & Norouzi, M. (2020). Dream to Control: Learning Behaviors by Latent Imagination. http://arxiv.org/abs/1912.01603
Hinton, G. (2022). The Forward-Forward Algorithm: Some Preliminary Investigations. http://arxiv.org/abs/2212.13345
Solomitckii, D., Li, Q.C., Balercia, T., da Silva, C.R.C.M., Talwar, S., Andreev, S. & Koucheryavy, Y. (2016). Characterizing the Impact of Diffuse Scattering in Urban Millimeter-Wave Deployments. #
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L. & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models. http://arxiv.org/abs/2106.09685
Seita, D. & Song, X.G.A.G.H.L.E.W.P.A.S.L.D. (Accessed: 2023). Koala: A Dialogue Model for Academic Research. http://bair.berkeley.edu/blog/2023/04/03/koala/
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J. & Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. http://arxiv.org/abs/2010.11929
Birhane, A. & McGann, M. (2024). Large Models of What? Mistaking Engineering Achievements for Human Linguistic Agency. http://arxiv.org/abs/2407.08790
Muennighoff, N., Yang, Z., Shi, W., Li, X.L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Candès, E. & Hashimoto, T. (2025). s1: Simple test-time scaling. http://arxiv.org/abs/2501.19393
DeepSeek-AI, , Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., Zhang, X., Yu, X., Wu, Y., Wu, Z.F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., Xue, B., Wang, B., Wu, B., Feng, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., Dai, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Luo, F., Hao, G., Chen, G., Li, G., Zhang, H., Bao, H., Xu, H., Wang, H., Ding, H., Xin, H., Gao, H., Qu, H., Li, H., Guo, J., Li, J., Wang, J., Chen, J., Yuan, J., Qiu, J., Li, J., Cai, J.L., Ni, J., Liang, J., Chen, J., Dong, K., Hu, K., Gao, K., Guan, K., Huang, K., Yu, K., Wang, L., Zhang, L., Zhao, L., Wang, L., Zhang, L., Xu, L., Xia, L., Zhang, M., Zhang, M., Tang, M., Li, M., Wang, M., Li, M., Tian, N., Huang, P., Zhang, P., Wang, Q., Chen, Q., Du, Q., Ge, R., Zhang, R., Pan, R., Wang, R., Chen, R.J., Jin, R.L., Chen, R., Lu, S., Zhou, S., Chen, S., Ye, S., Wang, S., Yu, S., Zhou, S., Pan, S., Li, S.S., Zhou, S., Wu, S., Ye, S., Yun, T., Pei, T., Sun, T., Wang, T., Zeng, W., Zhao, W., Liu, W., Liang, W., Gao, W., Yu, W., Zhang, W., Xiao, W.L., An, W., Liu, X., Wang, X., Chen, X., Nie, X., Cheng, X., Liu, X., Xie, X., Liu, X., Yang, X., Li, X., Su, X., Lin, X., Li, X.Q., Jin, X., Shen, X., Chen, X., Sun, X., Wang, X., Song, X., Zhou, X., Wang, X., Shan, X., Li, Y.K., Wang, Y.Q., Wei, Y.X., Zhang, Y., Xu, Y., Li, Y., Zhao, Y., Sun, Y., Wang, Y., Yu, Y., Zhang, Y., Shi, Y., Xiong, Y., He, Y., Piao, Y., Wang, Y., Tan, Y., Ma, Y., Liu, Y., Guo, Y., Ou, Y., Wang, Y., Gong, Y., Zou, Y., He, Y., Xiong, Y., Luo, Y., You, Y., Liu, Y., Zhou, Y., Zhu, Y.X., Xu, Y., Huang, Y., Li, Y., Zheng, Y., Zhu, Y., Ma, Y., Tang, Y., Zha, Y., Yan, Y., Ren, Z.Z., Ren, Z., Sha, Z., Fu, Z., Xu, Z., Xie, Z., Zhang, Z., Hao, Z., Ma, Z., Yan, Z., Wu, Z., Gu, Z., Zhu, Z., Liu, Z., Li, Z., Xie, Z., Song, Z., Pan, Z., Huang, Z., Xu, Z., Zhang, Z. & Zhang, Z. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. http://arxiv.org/abs/2501.12948
Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X., Efrat, A., Yu, P., Yu, L., Zhang, S., Ghosh, G., Lewis, M., Zettlemoyer, L. & Levy, O. (2023). LIMA: Less Is More for Alignment. http://arxiv.org/abs/2305.11206
Cho, K. (2015). Natural Language Understanding with Distributed Representation. http://arxiv.org/abs/1511.07916
Goldberg, Y. (2015). A Primer on Neural Network Models for Natural Language Processing. http://arxiv.org/abs/1510.00726
Fein-Ashley, J. (2025). The FFT Strikes Back: An Efficient Alternative to Self-Attention. http://arxiv.org/abs/2502.18394
Gandhi, K., Chakravarthy, A., Singh, A., Lile, N. & Goodman, N.D. (2025). Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs. http://arxiv.org/abs/2503.01307
Wilson, A.G. (2025). Deep Learning is Not So Mysterious or Different. http://arxiv.org/abs/2503.02113
Zhu, J., Chen, X., He, K., LeCun, Y. & Liu, Z. (2025). Transformers without Normalization. http://arxiv.org/abs/2503.10622
Singh, S., Nan, Y., Wang, A., D'Souza, D., Kapoor, S., Üstün, A., Koyejo, S., Deng, Y., Longpre, S., Smith, N., Ermis, B., Fadaee, M. & Hooker, S. (2025). The Leaderboard Illusion. http://arxiv.org/abs/2504.20879
Darlow, L., Regan, C., Risi, S., Seely, J. & Jones, L. (2025). Continuous Thought Machines. http://arxiv.org/abs/2505.05522
Zhao, A., Wu, Y., Yue, Y., Wu, T., Xu, Q., Yue, Y., Lin, M., Wang, S., Wu, Q., Zheng, Z. & Huang, G. (2025). Absolute Zero: Reinforced Self-play Reasoning with Zero Data. http://arxiv.org/abs/2505.03335
Jaghouar, S., Mattern, J., Ong, J.M., Straube, J., Basra, M., Pazdera, A., Ferrante, M.D., Thaman, K., Gabriel, F., Obeid, F., Erdem, K., Keiblinger, M. & Hagemann, J. (). INTELLECT-2: A Reasoning Model Trained Through Globally Decentralized Reinforcement Learning. #
Laban, P., Hayashi, H., Zhou, Y. & Neville, J. (2025). LLMs Get Lost In Multi-Turn Conversation. http://arxiv.org/abs/2505.06120
Jha, R., Zhang, C., Shmatikov, V. & Morris, J.X. (2025). Harnessing the Universal Geometry of Embeddings. http://arxiv.org/abs/2505.12540