Automatic differentiation

wikipedia: https://en.wikipedia.org/wiki/Automatic_differentiation

Backpropagation

Neural networks

https://numpy.org/doc/stable/reference/generated/numpy.matmul.html

z_1 = \bold W_1 \bold x + \bold b_1 \\ A_1 = \sigma (z_1) \\ z_2 = \bold W_2 \bold A_1 + \bold b_2 \\ y = \sigma (z_2) \\ L = (\hat y - y)^2

Pytorch

example: https://docs.pytorch.org/tutorials/beginner/pytorch_with_examples.html

Resources

Top

machine learning book: https://arxiv.org/abs/1901.05639
deep learning book: https://www.deeplearningbook.org
reinforcement learning book: http://www.incompleteideas.net/book/the-book-2nd.html

Other

fast ai: https://course.fast.ai/
forward mode vs reverse mode: https://math.stackexchange.com/questions/2195377/reverse-mode-differentiation-vs-forward-mode-differentiation-where-are-the-be
deeplearning.ai: https://community.deeplearning.ai/latest
karpathy yt: https://www.youtube.com/()/videos
umar yt: https://www.youtube.com/()
nn and dl book: http://neuralnetworksanddeeplearning.com/index.html
practical dl for coders: https://course.fast.ai/
matrix calculus needed for dl: https://explained.ai/matrix-calculus/
matrix calculus for ml and beyond: https://ocw.mit.edu/courses/18-s096-matrix-calculus-for-machine-learning-and-beyond-january-iap-2023/pages/lecture-notes/
matrix calculus: https://staff.fnwi.uva.nl/r.vandenboomgaard/MachineLearning/LectureNotes/Math/vectorderivatives.html
ml notes (rein van den boomgaard): https://staff.fnwi.uva.nl/r.vandenboomgaard/MachineLearning/index.html
pytorch autograd gentle intro: https://docs.pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html
make autograp computational graph: https://www.datascienceweekly.org/tutorials/make-a-simple-pytorch-autograd-computational-graph
pytorch tutorials: https://docs.pytorch.org/tutorials/
learn pytorch: https://www.learnpytorch.io/00_pytorch_fundamentals/
triton: https://triton-lang.org/main/getting-started/tutorials/index.html
nvidia dl: https://developer.nvidia.com/deep-learning
pytorch distributed: https://learnopencv.com/distributed-parallel-training-pytorch-multi-gpu-setup/
numpy broadcasting: https://numpy.org/doc/stable/user/basics.broadcasting.html
pytorch broadcasting: https://docs.pytorch.org/docs/stable/notes/broadcasting.html
pytorch autograd: https://docs.pytorch.org/docs/stable/notes/autograd.html
tensor streaming multiprocessor groq: https://groq.com/wp-content/uploads/2023/05/GroqISCAPaper2022_ASoftwareDefinedTensorStreamingMultiprocessorForLargeScaleMachineLearning-1.pdf

Research papers

The list https://punkx.org/jackdoe/30.html (supposedly) given to John Carmack by Ilya Sutskever. And other papers:

(Shojaee et al., 2025)
(Nasr et al., 2023)
(Schmidt, 2019)
(Borgeaud et al., 2021)
(Gao et al., 2020)
(Zhai et al., 2021)
(Chukewad et al., 2020)
(Perez et al., 2017)
(Brohan et al., 2022)
(Kendall et al., 2018)
(Kirsch & Schmidhuber, 2022)
(Smith et al., 2022)
(Hafner et al., 2023)
(Gozalo-Brizuela & Garrido-Merchan, 2023)
(Millidge et al., 2020)
(Ouyang et al., 2022)
(Vaswani et al., 2017)
(Hafner et al., 2020)
(Hinton, 2022)
(Solomitckii et al., 2016)
(Hu et al., 2021)
(Seita & Song, Accessed: 2023)
(Dosovitskiy et al., 2021)
(Birhane & McGann, 2024)
(Muennighoff et al., 2025)
(DeepSeek-AI et al., 2025)
(Zhou et al., 2023)
(Cho, 2015)
(Goldberg, 2015)
(Fein-Ashley, 2025)
(Gandhi et al., 2025)
(Wilson, 2025)
(Zhu et al., 2025)
(Singh et al., 2025)
(Darlow et al., 2025)
(Zhao et al., 2025)
(Jaghouar et al., )
(Laban et al., 2025)
(Jha et al., 2025)

References

Shojaee, P., Mirzadeh, I., Alizadeh, K., Horton, M., Bengio, S. & Farajtabar, M. (2025). The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity. https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf

Nasr, M., Carlini, N., Hayase, J., Jagielski, M., Cooper, A.F., Ippolito, D., Choquette-Choo, C.A., Wallace, E., Tramèr, F. & Lee, K. (2023). Scalable Extraction of Training Data from (Production) Language Models. http://arxiv.org/abs/2311.17035

Schmidt, R.M. (2019). Recurrent Neural Networks (RNNs): A gentle Introduction and Overview. http://arxiv.org/abs/1912.05911

Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., Driessche, G.v.d., Lespiau, J., Damoc, B., Clark, A., Casas, D.d.L., Guy, A., Menick, J., Ring, R., Hennigan, T., Huang, S., Maggiore, L., Jones, C., Cassirer, A., Brock, A., Paganini, M., Irving, G., Vinyals, O., Osindero, S., Simonyan, K., Rae, J.W., Elsen, E. & Sifre, L. (2021). Improving language models by retrieving from trillions of tokens. http://arxiv.org/abs/2112.04426

Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S. & Leahy, C. (2020). The Pile: An 800GB Dataset of Diverse Text for Language Modeling. http://arxiv.org/abs/2101.00027

Zhai, S., Talbott, W., Srivastava, N., Huang, C., Goh, H., Zhang, R. & Susskind, J. (2021). An Attention Free Transformer. http://arxiv.org/abs/2105.14103

Chukewad, Y.M., James, J., Singh, A. & Fuller, S. (2020). RoboFly: An insect-sized robot with simplified fabrication that is capable of flight, ground, and water surface locomotion. http://arxiv.org/abs/2001.02320

Perez, E., Strub, F., de Vries, H., Dumoulin, V. & Courville, A. (2017). FiLM: Visual Reasoning with a General Conditioning Layer. http://arxiv.org/abs/1709.07871

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Hsu, J., Ibarz, J., Ichter, B., Irpan, A., Jackson, T., Jesmonth, S., Joshi, N.J., Julian, R., Kalashnikov, D., Kuang, Y., Leal, I., Lee, K., Levine, S., Lu, Y., Malla, U., Manjunath, D., Mordatch, I., Nachum, O., Parada, C., Peralta, J., Perez, E., Pertsch, K., Quiambao, J., Rao, K., Ryoo, M., Salazar, G., Sanketi, P., Sayed, K., Singh, J., Sontakke, S., Stone, A., Tan, C., Tran, H., Vanhoucke, V., Vega, S., Vuong, Q., Xia, F., Xiao, T., Xu, P., Xu, S., Yu, T. & Zitkovich, B. (2022). RT-1: Robotics Transformer for Real-World Control at Scale. http://arxiv.org/abs/2212.06817

Kendall, A., Hawke, J., Janz, D., Mazur, P., Reda, D., Allen, J., Lam, V., Bewley, A. & Shah, A. (2018). Learning to Drive in a Day. http://arxiv.org/abs/1807.00412

Kirsch, L. & Schmidhuber, J. (2022). Meta Learning Backpropagation And Improving It. http://arxiv.org/abs/2012.14905

Smith, L., Kostrikov, I. & Levine, S. (2022). A Walk in the Park: Learning to Walk in 20 Minutes With Model-Free Reinforcement Learning. http://arxiv.org/abs/2208.07860

Hafner, D., Pasukonis, J., Ba, J. & Lillicrap, T. (2023). Mastering Diverse Domains through World Models. http://arxiv.org/abs/2301.04104

Gozalo-Brizuela, R. & Garrido-Merchan, E.C. (2023). ChatGPT is not all you need. A State of the Art Review of large Generative AI models. http://arxiv.org/abs/2301.04655

Millidge, B., Tschantz, A. & Buckley, C.L. (2020). Predictive Coding Approximates Backprop along Arbitrary Computation Graphs. http://arxiv.org/abs/2006.04182

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J. & Lowe, R. (2022). Training language models to follow instructions with human feedback. http://arxiv.org/abs/2203.02155

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L. & Polosukhin, I. (2017). Attention Is All You Need. http://arxiv.org/abs/1706.03762

Hafner, D., Lillicrap, T., Ba, J. & Norouzi, M. (2020). Dream to Control: Learning Behaviors by Latent Imagination. http://arxiv.org/abs/1912.01603

Hinton, G. (2022). The Forward-Forward Algorithm: Some Preliminary Investigations. http://arxiv.org/abs/2212.13345

Solomitckii, D., Li, Q.C., Balercia, T., da Silva, C.R.C.M., Talwar, S., Andreev, S. & Koucheryavy, Y. (2016). Characterizing the Impact of Diffuse Scattering in Urban Millimeter-Wave Deployments. #

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L. & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models. http://arxiv.org/abs/2106.09685

Seita, D. & Song, X.G.A.G.H.L.E.W.P.A.S.L.D. (Accessed: 2023). Koala: A Dialogue Model for Academic Research. http://bair.berkeley.edu/blog/2023/04/03/koala/

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J. & Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. http://arxiv.org/abs/2010.11929

Birhane, A. & McGann, M. (2024). Large Models of What? Mistaking Engineering Achievements for Human Linguistic Agency. http://arxiv.org/abs/2407.08790

Muennighoff, N., Yang, Z., Shi, W., Li, X.L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Candès, E. & Hashimoto, T. (2025). s1: Simple test-time scaling. http://arxiv.org/abs/2501.19393

DeepSeek-AI, , Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., Zhang, X., Yu, X., Wu, Y., Wu, Z.F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., Xue, B., Wang, B., Wu, B., Feng, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., Dai, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Luo, F., Hao, G., Chen, G., Li, G., Zhang, H., Bao, H., Xu, H., Wang, H., Ding, H., Xin, H., Gao, H., Qu, H., Li, H., Guo, J., Li, J., Wang, J., Chen, J., Yuan, J., Qiu, J., Li, J., Cai, J.L., Ni, J., Liang, J., Chen, J., Dong, K., Hu, K., Gao, K., Guan, K., Huang, K., Yu, K., Wang, L., Zhang, L., Zhao, L., Wang, L., Zhang, L., Xu, L., Xia, L., Zhang, M., Zhang, M., Tang, M., Li, M., Wang, M., Li, M., Tian, N., Huang, P., Zhang, P., Wang, Q., Chen, Q., Du, Q., Ge, R., Zhang, R., Pan, R., Wang, R., Chen, R.J., Jin, R.L., Chen, R., Lu, S., Zhou, S., Chen, S., Ye, S., Wang, S., Yu, S., Zhou, S., Pan, S., Li, S.S., Zhou, S., Wu, S., Ye, S., Yun, T., Pei, T., Sun, T., Wang, T., Zeng, W., Zhao, W., Liu, W., Liang, W., Gao, W., Yu, W., Zhang, W., Xiao, W.L., An, W., Liu, X., Wang, X., Chen, X., Nie, X., Cheng, X., Liu, X., Xie, X., Liu, X., Yang, X., Li, X., Su, X., Lin, X., Li, X.Q., Jin, X., Shen, X., Chen, X., Sun, X., Wang, X., Song, X., Zhou, X., Wang, X., Shan, X., Li, Y.K., Wang, Y.Q., Wei, Y.X., Zhang, Y., Xu, Y., Li, Y., Zhao, Y., Sun, Y., Wang, Y., Yu, Y., Zhang, Y., Shi, Y., Xiong, Y., He, Y., Piao, Y., Wang, Y., Tan, Y., Ma, Y., Liu, Y., Guo, Y., Ou, Y., Wang, Y., Gong, Y., Zou, Y., He, Y., Xiong, Y., Luo, Y., You, Y., Liu, Y., Zhou, Y., Zhu, Y.X., Xu, Y., Huang, Y., Li, Y., Zheng, Y., Zhu, Y., Ma, Y., Tang, Y., Zha, Y., Yan, Y., Ren, Z.Z., Ren, Z., Sha, Z., Fu, Z., Xu, Z., Xie, Z., Zhang, Z., Hao, Z., Ma, Z., Yan, Z., Wu, Z., Gu, Z., Zhu, Z., Liu, Z., Li, Z., Xie, Z., Song, Z., Pan, Z., Huang, Z., Xu, Z., Zhang, Z. & Zhang, Z. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. http://arxiv.org/abs/2501.12948

Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X., Efrat, A., Yu, P., Yu, L., Zhang, S., Ghosh, G., Lewis, M., Zettlemoyer, L. & Levy, O. (2023). LIMA: Less Is More for Alignment. http://arxiv.org/abs/2305.11206

Cho, K. (2015). Natural Language Understanding with Distributed Representation. http://arxiv.org/abs/1511.07916

Goldberg, Y. (2015). A Primer on Neural Network Models for Natural Language Processing. http://arxiv.org/abs/1510.00726

Fein-Ashley, J. (2025). The FFT Strikes Back: An Efficient Alternative to Self-Attention. http://arxiv.org/abs/2502.18394

Gandhi, K., Chakravarthy, A., Singh, A., Lile, N. & Goodman, N.D. (2025). Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs. http://arxiv.org/abs/2503.01307

Wilson, A.G. (2025). Deep Learning is Not So Mysterious or Different. http://arxiv.org/abs/2503.02113

Zhu, J., Chen, X., He, K., LeCun, Y. & Liu, Z. (2025). Transformers without Normalization. http://arxiv.org/abs/2503.10622

Singh, S., Nan, Y., Wang, A., D'Souza, D., Kapoor, S., Üstün, A., Koyejo, S., Deng, Y., Longpre, S., Smith, N., Ermis, B., Fadaee, M. & Hooker, S. (2025). The Leaderboard Illusion. http://arxiv.org/abs/2504.20879

Darlow, L., Regan, C., Risi, S., Seely, J. & Jones, L. (2025). Continuous Thought Machines. http://arxiv.org/abs/2505.05522

Zhao, A., Wu, Y., Yue, Y., Wu, T., Xu, Q., Yue, Y., Lin, M., Wang, S., Wu, Q., Zheng, Z. & Huang, G. (2025). Absolute Zero: Reinforced Self-play Reasoning with Zero Data. http://arxiv.org/abs/2505.03335

Jaghouar, S., Mattern, J., Ong, J.M., Straube, J., Basra, M., Pazdera, A., Ferrante, M.D., Thaman, K., Gabriel, F., Obeid, F., Erdem, K., Keiblinger, M. & Hagemann, J. (). INTELLECT-2: A Reasoning Model Trained Through Globally Decentralized Reinforcement Learning. #

Laban, P., Hayashi, H., Zhou, Y. & Neville, J. (2025). LLMs Get Lost In Multi-Turn Conversation. http://arxiv.org/abs/2505.06120

Jha, R., Zhang, C., Shmatikov, V. & Morris, J.X. (2025). Harnessing the Universal Geometry of Embeddings. http://arxiv.org/abs/2505.12540