NEW Subquadratic LLM: Hype or Game Changer

Show notes

Dive DEEP with my courses: https://academind.com/courses Or get the all-access membership: https://academind.com/membership Announcement post: https://subq.ai/how-ssa-makes-long-context-practical

Website: https://maximilian-schwarzmueller.com/

Socials: 👉 Twitch: https://www.twitch.tv/maxedapps 👉 X: https://x.com/maxedapps 👉 Udemy: https://www.udemy.com/user/maximilian-schwarzmuller/ 👉 LinkedIn: https://www.linkedin.com/in/maximilian-schwarzmueller/

Want to become a web developer or expand your web development knowledge? I have multiple bestselling online courses on React, Angular, NodeJS, Docker & much more! 👉 https://academind.com/courses

Show transcript

00:00:00: A couple of hours ago there was a pretty big announcement or some pretty big hype.

00:00:06: We don't know yet and I definitely wouldn't rule out the pointless hype part but if it's true, its indeed a big announcement because Alexander Wedden which i didn't know you probably didn't either announced subq which stands for Subquedratic A major breakthrough in LLM intelligence.

00:00:29: And what he announced here is a brand new type of large language model that excels at long context tasks without losing, At least that's what he claims Without losing the intelligence.

00:00:46: In quotes The models are generating tokens but That gives them their intelligence.

00:00:52: So without losing the Intelligence you're used to from current Frontier models like Opus, GPD.

00:01:01: Now what he mentions in the announcement post on X and then there also is an announcement blog post with more technical details at which we'll have a look because will dive deep into this episode.

00:01:12: end video here five percent of what opus costs.

00:01:30: He also promises that their initial model will have a twelve million token context window, which just to put the number into perspective means you can fit entire code bases, huge code bases in that context window!

00:01:45: You can fit multiple large legal documents there and thats why models like this if they exist could be super useful and totally game-changing.

00:01:58: No other way of putting it.

00:02:00: if they work, we don't have a lot of details yet I'll get back to that.

00:02:03: but If They Work That Of Course Means The All These Workarounds That We're Using Right Now Like Sub Agents Rack And So On Which Are All Workaronds Around?

00:02:14: the Problem That The Model Only Sees A Small Part Of The Thing It Should See.

00:02:21: so If You're Working on A Code Base Existing Frontier Models Depending on the size of your code base can't see the entire codebase, they cannot load it.

00:02:30: So if you're asking to change something You have to hope that this model finds parts in your codebase To make a change for them.

00:02:40: That becomes more and more of problem The bigger the codebase or the amount of documents you want the model to work On.

00:02:48: so If u HAVE A MODEL that can reliably use a twelve million token context window with good quality, that naturally would be a game changer.

00:02:59: Speaking of game changing will dive deep in this video and I'll dive deeper into all my courses.

00:03:06: so if you're interested practically use tools like cloud code, Codex, other AI tasks or coding.

00:03:14: Or the combination of all that.

00:03:16: then my courses may be worth a look there.

00:03:18: practical their hands on they're in depth and you can get the individual course's or The membership which gives you access to all the courses for one monthly or annual price links below.

00:03:31: So let's dive in a bit deeper now and as mentioned, there is an announcement blog post with some technical details but not a lot.

00:03:39: to be very clear here.

00:03:41: There's a lot of information missing and we also don't have a lot off benchmarks.

00:03:46: specifically they only published three benchmarks the ruler benchmark that tests retrieval and reasoning behaviors beyond simple needle lookup including multi hop retrieval aggregation, variable tracking and selective filtering.

00:04:01: So that is a benchmark which in the end it's all about A model finding multiple pieces of relevant information from our relatively big context window.

00:04:10: One hundred twenty eight thousand tokens so not super large Of a Context Window Not nearly close to The twelve million stay promised but also not just five k or so?

00:04:22: This Is a benchmark That tests how well a Model can find And piece together different parts From more or less large context window document base.

00:04:33: And here their model is on the same level as opus four point six in that post.

00:04:38: They also mentioned another benchmark, the MRCR v two benchmark which was all about long context retrieval tasks.

00:04:47: where they're model Is In The Range As They Stated Of Opus Four Point Six Though It's Yeah it's In The range If You Look At All The Other Results Here.

00:04:57: But Its Definitely Worse Which of course is interesting, since their entire thing IS the long context retrieval here.

00:05:07: But then again you could also argue that for super-long contexts window use cases other models aren't usable at all whilst they might still give very good results which may be better than nothing and their models can definitely improve over time.

00:05:28: so I wouldn't take this as a super bad sign for the initial model.

00:05:33: It's just something worth noting, and of course it is also worth noting that its far better than GeminiFreepointOnePro or OpusFourpointSeven in that table.

00:05:45: And they all released one benchmark which i found interesting Which was about coding related tasks.

00:05:50: Now...I will say That ALL these benchmarks I'm not huge fan of them We all know They can kind of be gained, many of them can at least models.

00:06:02: Can deliberately or undeliberately be fine-tuned?

00:06:06: Or optimized to perform well in benchmarks.

00:06:10: we had plenty such cases In the past but still they give us something To look at and I find this software engineering benchmark here interesting because Here We can see that their model is pretty much in the range of the opus models.

00:06:29: And that, of course shows is it's not just able to find information... ...in long context windows and lots of documents big code bases but also do something useful with them so they can generate meaningful good code.

00:06:45: as a result It's intelligence and of the data it is able to retrieve in these long context windows, so to say.

00:06:53: So it's not just about retrieving its alls about doing useful stuff And it seems to be good there.

00:06:58: but as mentioned that is about it.

00:07:01: we got no other deep dives or technical details.

00:07:05: There is no model card yet And therefore All We Have Is A Description Essentially How Their Model Uses Sparse Attention Instead Of Dense Attention to make these long context tasks work or to make the model work efficiently in long-context windows scenarios and how it achieves its speed up, cost efficiency.

00:07:36: So let's take a look at dense versus sparse attention.

00:07:43: Now dense attention is what you have in the current frontier models.

00:07:48: So your GPT-Five point five, OPUS four point seven all other models.

00:07:52: these are all dense models which essentially means that for every new token let's say Token D In order to generate that token All other tokens have to be evaluated and the connections between these tokens have been evaluated.

00:08:10: Because, the entire idea in large language models is that you derive a future token which could be an entire word or part of it based on what came before that token.

00:08:21: So if for example... A contract can be terminated at any Then the next word thereafter is what you want to predict.

00:08:33: You may have asked a model, hey when can I terminate my contract?

00:08:37: And you may have fed that contract as a PDF document or as plain text into your prompt as well.

00:08:44: so The prompt in front of this sentence which the model is generating as an output Is Your Question and then maybe some other context contract for example, right?

00:08:58: That is how we currently use models.

00:09:00: And in order to produce this token here and In order to reduce each token that came in front of it the model basically had a look at The entire conversation all the tokens in there So that's your question and any additional context you put in There and it split that into multiple tokens and then combined All these tokens or calculated weights in the end based on all the combinations of the prior tokens.

00:09:29: So for example if that were our entire conversation, obviously deliberately short it's an example then this is how it would have been split up into tokens for the GPT-V models for examples.

00:09:44: so some tokens are just a word or a word with a blank in front And in order to generate that next token, all previous tokens are combined with each other.

00:09:59: To understand the meaning!

00:10:01: in the end.

00:10:02: because of course a question mark has very different meaning and implication for future token depending on what came infront.

00:10:12: So that question mark is combined with all previous tokens, it's combination of these combinations then used to derive this final token.

00:10:21: That's at high level how you can think about dense attention.

00:10:31: very inefficient, but it's kind of the best we have right now at least when it comes to intelligence and quality.

00:10:39: But is quadratic because its N times n which means in order to derive a new token you need all previous tokens.

00:10:49: there are optimization mechanisms like KV caching, which in the end caches results of calculated weights that have been calculated.

00:10:59: So for a new token you don't need to recalculate all previous combinations but still calculate it by comparing them with all the previous cached weights.

00:11:12: quadratic situation here.

00:11:15: And that of course is inefficient and slow, which is why these frontier models we have right now are very compute-hungry, slow especially when you do get into the higher context window areas AND while there's pretty strict Context Window Size limits because since it's quadratic a twelve million Context window size is pretty much impossible!

00:11:42: to compute.

00:11:43: It would take forever and compute time is just one dimension, memory that must be reserved is another one.

00:11:50: so that's how dense models work in a nutshell and what their limitations are.

00:11:55: now the opposite or an alternative approach that is used by That new model The sub-q model that was announced yesterday Is to use sparse attention.

00:12:06: Now How does sparse Attention Work?

00:12:09: The idea with sparse attention is that in order to calculate a new token, you don't look at all the previous tokens.

00:12:18: You don't have the combinations of all your previous tokens but just off if you selected tokens.

00:12:23: so for example If you want to derive the Token D here... ...you may be looking at B and C but not at A. Now of course the big question then how do you decide which previous tokens to look or which previous tokens are interesting for producing that new token.

00:12:42: And there are different approaches, because this model is not the first sparse attention model but they haven't really taken off here and have serious limitations.

00:12:57: For example one way is to use a local window approach.

00:13:01: Now what does that mean?

00:13:02: That means, in order to produce new token... let's say the token number five The fifth token in sequence.

00:13:11: we take look at just two tokens before it.

00:13:16: so three plus four, for example.

00:13:19: So you have a sliding window of tokens and you always just take a look at the tokens in front of the token you're about to generate.

00:13:27: now as You can imagine this has some serious limitations because if I'm only looking At The last few Tokens If i For Example wonder when A contract Can be terminated?

00:13:39: For example, so that next token.

00:13:49: That's about to be predicted has no idea of what was before in the context.

00:13:54: So that's not useful.

00:13:55: You can have an unlimited context window size with this approach But all the contexts doesn't matter or so that's obvious limitation.

00:14:03: Another approach is a so-called global token approach.

00:14:07: Here the idea is that you have a global summary token, so on high level.

00:14:13: You can think of this as special tokens coming at beginning of the token sequence inserted by the model which summarizes the tokens after it and then for predicting next token that Global Token is taken into account.

00:14:33: Now that may work very well if we go back to this example here with the legal text, which you have passed into a model in your prompt.

00:14:41: If that summary was generated for your conversation and includes contract termination terms.

00:14:50: then of course this next token can be predicted based on it's summary.

00:14:55: but if you're unlucky Well, then you're out of luck and your back to the state where the information is totally missing.

00:15:05: So a global token approach can work.

00:15:08: But of course The longer your context window gets.

00:15:12: The more generic the summary gets.

00:15:14: I mean that's easy to imagine.

00:15:15: if You have like a hundred page PDF document And you were to summarize that in a sentence or two it would be very unspecific right.

00:15:23: so Of course predicting the next token based on its summary won't really work.

00:15:29: Now another approach would be to use a router, which is that you have like an extra neural network.

00:15:38: So you have two models essentially your large language model and then you have an extra routing model.

00:15:44: And that routing model takes a look at the prompt by user or At the context of the next token to be generated to the other tokens, it deems relevant.

00:16:00: But now that of course means you have a routing model which somehow needs probably goes back into the quadratic attention area or is very unspecific and you're relying on that.

00:16:15: So, again either going to the quadratic complexity and your not gaining much compared to a dense model OR you don't do it and will have some loss because the router isn't really good.

00:16:29: so just as with this summary would be hoping that the router does activates the right tokens for predicting next token.

00:16:40: And that is why sparse attention isn't really taken off thus far, because all these different approaches have meaningful tradeoffs and to this point there hasn't been a sparse-attention model which would've produced equal quality comparable with current frontier dense models and would be able to act over a big context window.

00:17:08: And they promise... ...to change this with their new model.

00:17:13: in that announcement blog post, They mention That Their Model Does Content-Dependent Selection For each query.

00:17:21: the model selects which parts of the sequence are worth attending To and computes attention exactly Over those positions.

00:17:28: so In The end we're back to This Routing Approach but kind of promise here, mention that their mechanism seems to be very efficient for activating the right tokens.

00:17:45: They mentioned that dense attention assumes every paramount matter, so it evaluates all of them in practice almost non-du.

00:17:52: SSA which stands for Subquadratic Selective Attention which is their approach removes that assumption.

00:17:58: It does not approximate attention!

00:18:03: That is their approach.

00:18:08: They're doing content-dependent routing to activate the right tokens or use the right token for predicting a next token, and that's what gives them their efficiency boost!

00:18:19: And we have yet to see how well this actually works because as mentioned... We've got very limited subset of benchmarks here not a lot of other or no other benchmarks.

00:18:32: We have no model card, we have no details on how exactly their content dependent selection works and therefore we have a lot question marks here!

00:18:43: And if there's one thing that AI is obviously a useful tool and I use it every day, you probably use it everyday.

00:18:56: And tools like Codex or Cloudcode are very useful...I have no doubt about that.

00:19:05: but we also learned in an industry with a lot of hype!

00:19:10: We're in the transition period.

00:19:11: everything's changing right now and not all promises well get realized, materialized to actually something useful.

00:19:27: I mean take the models by Meta for example which were dense models.

00:19:33: um Delama four models had amazing benchmark numbers but weren't that great.

00:19:39: so there are a lot of hyped up examples and that's just one example of course.

00:19:44: uh There are many examples out.

00:19:48: It's definitely worth being cautious, but if they publish these models and you can apply for early access right now.

00:19:56: I did buy it didn't get access.

00:19:58: yet If this models do live up to their promises if They are useful intelligent across large context window sizes.

00:20:09: that of course will change a lot.

00:20:12: That we'll help with the compute constraints we have right now, because there is not even close to enough compute out there in the world.

00:20:20: We need way more data centers chips electricity and everything.

00:20:24: so having a model that is Way More efficient would help with that.

00:20:31: Well maybe we will use it That much more than The problem stays the same but still It Would definitely enable more use Right Now And of course it would unlock brand new use cases.

00:20:42: It'd make possible to simply shove an entire codebase in there and act on that, so all these workarounds we're using right now will go away!

00:20:51: We wouldn't need sub-agents necessarily or react systems if that worked.

00:20:59: but that's a WOULD of course... Lives up to the big promises they're making.

00:21:06: If it does, They definitely found that a billion multi-billion or trillion dollar company there.

Show notes

Show transcript

New comment