LLMs, for the win

Mriganka Nath
5 min readSep 29, 2023

You really have to live under some stone if you have not heard of or used ChatGPT. It has taken the world by storm with its very good (but often not so sure) responses. The goal of this article is to give an intuition about LLMs (Large Language Models), the foundational unit of applications like ChatGPT, minus the very intricate details. We are at the dawn of Generative Artificial Intelligence (GenAI) and I believe that everyone should have a basic understanding of how LLMs work and create innovative and fun applications. Towards the end, I will also provide some additional resources for those who want to dive deeper or start building their own projects.

Also, from now on let's use Machine Learning (ML) over AI to define terms coming our way. AI is for people who don’t know the existence of ML, but now that you know, you’re already a #MachineLearningEnthusiast — feel free to add that to your LinkedIn bio 😉.

DATA!! the most crucial thing for any ML model. LLMs are trained for many language-related tasks like translation, generation, sentiment analysis etc. It needs textual data and the internet is filled with that. LLMs such as the one used in ChatGPT or others like GPT3, FALCON, Llama, among others, use almost the whole internet to scrape textual data, and this isn't just blogs or articles; this can be any textual thing, from code on Github to even NSFW topics. LLMs are what we feed them; if we feed the hate of Twitter, it will generate hateful content, if we feed it with spiritual books, it will generate spiritual things and so on. Hence, we have to be careful about the data. Some of the researchers even say LLMs just memorize the data like a parrot. (You can see yourself ChatGPT is not that accurate with maths)

LLMs are based on an architecture called “Transformer” (figure below), which we will discuss later. Transformers are incredibly data-hungry, requiring a massive amount of data to understand patterns and generate answers to our queries. Moreover, the data found in articles, books and similar sources often exists in diverse formats and may contain noise. Proper data preprocessing is crucial before feeding it to our hungry LLMs. One important preprocessing step is ‘tokenization,’ where we break the text into tokens before feeding them to our Transformer model.

For example,

Sentence: “Shine on you crazy diamond”.

Tokenized Words:
1. “Shine”
2. “on”
3. “you”
4. “crazy”
5. “diamond”

Very large transformer models make use of up to 1 trillion tokens, which is roughly 15 million books!

A very basic overview of Transformer architecture from https://jalammar.github.io/illustrated-transformer/

The ‘Transformer’ is an architecture that takes tokens as input and produces an ‘embedding’ for each token. Embeddings contain semantic information about the text. An embedding can be thought of as a mapping from a token or word to a vector, represented as [a, b, c, …], which is a list of numbers. For example, the sentence ‘Cogito ergo sum’ is represented by

1. Cogito -> [0.3, 1.6, -0.4, 0.5]
2. ergo -> [0.2, 0.3, -1.8, 0.6]
3. sum -> [0.1, 0.4, -0.4, 1.7]

Each word/token is represented by 4 numbers, hence we can say the embedding dimension is 4 here. (Here I have just written random numbers.)

These embeddings are not fixed initially; they start as random values, as shown in this example. As the Transformer model is fed the data, it begins to uncover patterns. With this understanding of the data, it maps each token to an embedding. An interesting aspect of embeddings is that if words are highly similar, such as 'mobile' and ‘phone,’ their embeddings will also exhibit similarity.

The transformer architecture (as in the above diagram) has two blocks: the encoder and the decoder. The encoder produces embeddings, whereas the decoder takes input from the encoder and produces back text!

So how is Transformer so good at understanding texts? It achieves it through the “Attention” mechanism. On a very high level, the attention mechanism helps the model focus on specific parts of the inputs and their corresponding outputs to make accurate predictions. For instance, when translating ‘Amor Fati’ to another language, the words may not necessarily maintain the same order (perhaps ‘Fati’ becomes the first word in the translation), or the translated sentence may require additional words. So attention helps to align the words correctly.

Language models mark just the beginning of a vast area of applications. I believe chatbots are merely one of the many uses we are going to see. The Transformer architecture has proven to be effective with image data, termed Vision Transformers, enabling machines to understand images. A success highlighted in the recent update of ChatGPT. This area of research, which focuses on making machines understand various data domains such as text, images, videos, audio etc is called Multimodal Machine Learning. It’s the field I am currently working on.

So to sum it up, LLMs are the backbones of technologies like ChatGPT. Although there are many more steps involved in making LLMs behave like the chatbots we see, which are a bit more technical, once you grasp the main logic behind LLMs, we can start delving into more advanced topics.

And next time you are talking with ChatGPT, don’t be surprised by how it gives responses; beneath it’s a Transformer model attempting to predict words based on attention to incoming requests.

Further readings

Feature photo generated using https://stablediffusionweb.com/#demo with prompt “A person trying to learn about LLMs in the Alps in spring season,
There are many words on the Laptop monitor that is in front, but is most curious about the word LLM.”



Mriganka Nath

high dimensions go brrrrr; I work with Neural Networks;