APIs are locking up - thanks to Gen AI

Show notes

Website: https://maximilian-schwarzmueller.com/

Socials: 👉 Twitch: https://www.twitch.tv/maxedapps 👉 X: https://x.com/maxedapps 👉 Udemy: https://www.udemy.com/user/maximilian-schwarzmuller/ 👉 LinkedIn: https://www.linkedin.com/in/maximilian-schwarzmueller/

Want to become a web developer or expand your web development knowledge? I have multiple bestselling online courses on React, Angular, NodeJS, Docker & much more! 👉 https://academind.com/courses

Show transcript

00:00:00: There's kind of a trend, if you wanna call it like

00:00:03: this, or definitely something we see

00:00:07: become more common over the last one or two years, which is

00:00:11: not really surprising, but nonetheless, something I wanna talk

00:00:14: about. The trend of APIs of

00:00:18: websites or web services becoming more and

00:00:22: more locked down or private or more

00:00:26: expensive to use, whatever you wanna call it.

00:00:28: And the most recent example, which is the reason why I'm creating this video

00:00:32: here, is the Reddit API,

00:00:35: because, uh, two days ago, there's been a post by the,

00:00:39: the Reddit team in the Reddit development forum where they

00:00:43: essentially announced that the usage of their API now

00:00:47: needs approval. So there is an approval

00:00:50: process for using the API for

00:00:54: responsible use to support responsible

00:00:58: builders, and I'll get back to that and what that means, but it's

00:01:02: kind of in line, if you wanna call it like this, to what

00:01:05: Twitter, X, did two years ago already.

00:01:09: They made their API really

00:01:13: expensive to use, at least at scale.

00:01:16: So if you wanna interact with Twitter, with X, programmatically,

00:01:20: if you wanna build yet another social media scheduling tool and you wanna

00:01:24: support X, well, that could get expensive

00:01:28: depending on how you build it, uh, because the, the free

00:01:32: usage is quite limited. You can, for example, read

00:01:36: 100 posts per month and write 500

00:01:40: posts per month, which might be more than enough

00:01:44: for your own little tool that you're building for

00:01:47: yourself, but if you are building a SaaS product

00:01:51: on top of the X API, that will not suffice, so you'll have to

00:01:55: pay, but not even the basic tier might be

00:01:58: case. It might be, it might not be. And the pro tier might

00:02:02: still not be enough. Now chances are, it may be enough, but it

00:02:06: is also, uh, quite expensive.

00:02:10: And now for Reddit, there, as I

00:02:14: mentioned, i- it's not about paying or about, uh, the

00:02:18: price they're asking, but it's about an approval

00:02:22: process so that not every application can start using their

00:02:26: API. And the question of course is, why are companies

00:02:30: doing that? Well, there are a couple of reasons and

00:02:34: one big important reason. Obviously, you could say, why would they

00:02:38: not do it? Why would they give you access to their

00:02:41: free? And you could argue, well, because in the

00:02:44: past, before AI, they may have

00:02:48: benefited from doing so. Because if people can build

00:02:52: products on top of, let's say, X,

00:02:56: if I can build a social media scheduling

00:02:59: application, that might be in Twitter's and X's

00:03:03: interest because more posts on X could

00:03:07: mean more engagement, uh, more people reading and

00:03:10: interacting with those posts, so that might not be too

00:03:14: bad. And there is a reason why you can

00:03:17: write more than you can read. You

00:03:21: could think that you should be able to read more than write because

00:03:25: writes are more expensive to their database, to their infrastructure

00:03:28: it's the opposite. They allow you to write more than they

00:03:32: read. And just as a side note, X also

00:03:36: has a new program which they're testing, it's a pilot right

00:03:39: now, uh, where they, um, want to give

00:03:43: you, uh, pay per use access to their API.

00:03:46: But it stays the same. You have to pay to use it and it can get

00:03:50: expensive. Now, why are companies doing that?

00:03:53: Well, the big answer, of course, is AI, or

00:03:56: specifically, of course, gen AI. Because

00:04:00: with the rise of gen AI, it has become

00:04:04: clear that all that data which these companies own, all

00:04:08: these Reddit posts, all the posts on X,

00:04:12: that is a valuable resource because

00:04:16: those gen AI models, of

00:04:19: course, need data in their training

00:04:23: or for their training process. They, data is the

00:04:27: most important thing there because as we all know, ChatGPT

00:04:31: or the GPT models were trained essentially on the entire

00:04:34: data, publicly available data, you could find on the internet.

00:04:39: Um, and still, these models will need vast amounts of

00:04:42: data for their training. Nowadays, of course, there is the entire concept or

00:04:46: idea of using synthetic data as well as real

00:04:50: data for the training process, and to my understanding, that

00:04:54: seems to work quite well, though we'll see if that maybe still is

00:04:58: a problem and there is like a, a ceiling due to the limited

00:05:02: data that's available because the entire data in the internet has already been

00:05:06: consumed, so now you're just generating more

00:05:09: from that knowledge that was gathered from that

00:05:13: internet data, so there might be a ceiling there.

00:05:16: It's not entirely clear yet. Um, but anyways,

00:05:19: data is super important and of course there's still new data being

00:05:23: Now more data than ever is generated by AI though, to be

00:05:27: fair, so that is synthetic data in the end,

00:05:31: lots of data is being generated, uh, including some data by humans

00:05:35: on X and Reddit every day, and of course,

00:05:39: those platforms don't want to give away that data for free

00:05:43: anymore. They did in the past because we didn't see

00:05:47: coming with, uh, large language models and, um,

00:05:51: now of course they want to protect their data because a site

00:05:54: like X of course sits on lots

00:05:59: of data, lots of valuable posts, at least to some degree,

00:06:03: let's be honest.Most of the posts are total BS, but at least

00:06:06: some decent posts there and definitely valuable in the sense

00:06:10: of being valuable for training. And those sites don't

00:06:14: wanna give that data away for free anymore, which is why

00:06:18: they're locking it down. There also is a reason why we

00:06:22: see more and more web scraping

00:06:25: businesses, uh, coming up almost every day because now

00:06:29: with large language models, even if we ignore the training

00:06:32: part, many of the applications that we wanna build

00:06:36: with help of large language models or on top of large language models

00:06:40: will need access to recent data. If you're building a

00:06:43: smart chatbot and you're using OpenAI's models under the

00:06:47: hood, you probably wanna pull in some recent

00:06:51: chatbot more useful. You wanna add web search, you

00:06:55: wanna be able to have your chatbot answer

00:06:58: questions related to the most recent posts on X.

00:07:02: So you wanna pull that data into your application and

00:07:06: then reach your chat history and the prompts you sent to

00:07:10: the large language model with that data that then hopefully

00:07:14: allows the model to generate a better answer, uh, for what the user

00:07:17: asked. And that's why these sites are

00:07:21: kind of locking down their APIs to make it harder to get

00:07:25: access to the data because in the past they gave it away for free.

00:07:28: They don't wanna do that anymore. Obviously, there still are ways to

00:07:32: get that data, as I mentioned. There's a plethora of web

00:07:36: crawling companies and not all of these companies, uh,

00:07:40: respect the fact that certain sites don't want to get

00:07:43: crawled. Now I did actually a livestream on

00:07:47: the topic of building our own web crawler, um,

00:07:51: a while back, and I'll, uh, provide a link to that

00:07:55: episode. You can watch the full, uh, livestream episode, uh,

00:07:59: below this episode, of course. So you can build your own

00:08:03: crawler and I did that with Crawl4AI.

00:08:05: And essentially what you're building there is, um, an

00:08:09: application that spins up a browser and

00:08:13: simulates being a user and visiting a website to then

00:08:16: extract that website content, to extract the

00:08:20: rendered HTML content and so on. Um, that is

00:08:24: how you can build and use a crawler in the livestream.

00:08:27: I just, um, built it to crawl my own website,

00:08:31: that clear. So I, uh, did not start crawling X

00:08:35: there because web scraping and crawling is

00:08:39: kind of a gray zone. And, uh, there are

00:08:43: many sites that clearly state in their terms that they do

00:08:47: not allow web crawling. So you are violating those terms if you

00:08:51: do, which is why sites like Firecrawl, for

00:08:54: example, won't crawl X links. If I

00:08:59: take an X link and I try to scrape

00:09:03: that, uh, I'll get an error that this is not

00:09:06: supported. Or actually here it doesn't even start as

00:09:10: it seems in the past, I did get an error.

00:09:13: So there are sites that don't allow that.

00:09:15: There probably also are sites that do, and you can

00:09:19: definitely build your own crawler that doesn't give

00:09:23: anything about anything and extract any content of any

00:09:27: site you wanna extract. Now I will say, of course, that many

00:09:31: sites also try to implement some technical hurdles that make it

00:09:35: harder to crawl them, but in the end if you really want to, you can

00:09:38: get around pretty much all of them.

00:09:41: It might not be legal, it might be violating their terms, but

00:09:45: it is possible. Because, and that takes us back

00:09:49: to the beginning, to the main topic, because of course all that data

00:09:53: is super valuable. However, this does have

00:09:57: I believe a real downside or a, an

00:10:01: implication that's not great for us as

00:10:04: developers. Because I totally get that these sites don't wanna give

00:10:08: away access to their data, and just as a side note,

00:10:11: it's kind of not their data. It's the data of the users using the

00:10:15: site, but that's a whole different story.

00:10:18: But I get that they don't wanna give away access to this data.

00:10:21: The problem of course is that kind of as a, a,

00:10:25: an additional casualty, we as developers are limited

00:10:29: in what we can build, uh, with those APIs.

00:10:32: Sure, you might get approval for the Reddit API

00:10:36: something they're happy with. But of course you also might

00:10:40: not get that approval. And for X you have

00:10:44: to pay quite a bit of money depending on what you're

00:10:47: building, even if what you're building has nothing

00:10:51: extracting that data and using it for model

00:10:55: training or anything like that. It limits the

00:10:59: amount of useful stuff we can build on top of other

00:11:02: services and sites. And that of course in turn, um,

00:11:06: might also hurt those sites because some useful products from which they might

00:11:10: benefit then maybe won't get built.

00:11:13: But of course I guess that is a price they're happy to pay

00:11:17: because either they'll get paid by these

00:11:21: people that build products on top of them, or at least they prevent that

00:11:24: data extraction. So yeah, uh, I expect

00:11:28: that we will see more sites and

00:11:32: services, uh, locking down their APIs.

00:11:36: I think we'll see more sites becoming pretty

00:11:39: protective about their data, pretty aggressive

00:11:42: against crawlers, which is absolutely their right,

00:11:46: there. Um, but I think that might also hurt

00:11:50: us as developer because it kind of limits the, the stuff we

00:11:54: can build around other popular services and

00:11:57: sites. These are my two cents. What do you think about this

00:12:01: topic?

New comment

Your name or nickname, will be shown publicly
At least 10 characters long
By submitting your comment you agree that the content of the field "Name or nickname" will be stored and shown publicly next to your comment. Using your real name is optional.