APIs are locking up - thanks to Gen AI

Show notes

Website: https://maximilian-schwarzmueller.com/

Socials: 👉 Twitch: https://www.twitch.tv/maxedapps 👉 X: https://x.com/maxedapps 👉 Udemy: https://www.udemy.com/user/maximilian-schwarzmuller/ 👉 LinkedIn: https://www.linkedin.com/in/maximilian-schwarzmueller/

Want to become a web developer or expand your web development knowledge? I have multiple bestselling online courses on React, Angular, NodeJS, Docker & much more! 👉 https://academind.com/courses

Show transcript

00:00:00: There's kind of a trend, if you wanna call it like

00:00:03: this, or definitely something we see

00:00:07: become more common over the last one or two years, which is

00:00:11: not really surprising, but nonetheless, something I wanna talk

00:00:14: about. The trend of APIs of

00:00:18: websites or web services becoming more and

00:00:22: more locked down or private or more

00:00:26: expensive to use, whatever you wanna call it.

00:00:28: And the most recent example, which is the reason why I'm creating this video

00:00:32: here, is the Reddit API,

00:00:35: because, uh, two days ago, there's been a post by the,

00:00:39: the Reddit team in the Reddit development forum where they

00:00:43: essentially announced that the usage of their API now

00:00:47: needs approval. So there is an approval

00:00:50: process for using the API for

00:00:54: responsible use to support responsible

00:00:58: builders, and I'll get back to that and what that means, but it's

00:01:02: kind of in line, if you wanna call it like this, to what

00:01:05: Twitter, X, did two years ago already.

00:01:09: They made their API really

00:01:13: expensive to use, at least at scale.

00:01:16: So if you wanna interact with Twitter, with X, programmatically,

00:01:20: if you wanna build yet another social media scheduling tool and you wanna

00:01:24: support X, well, that could get expensive

00:01:28: depending on how you build it, uh, because the, the free

00:01:32: usage is quite limited. You can, for example, read

00:01:36: 100 posts per month and write 500

00:01:40: posts per month, which might be more than enough

00:01:44: for your own little tool that you're building for

00:01:47: yourself, but if you are building a SaaS product

00:01:51: on top of the X API, that will not suffice, so you'll have to

00:01:55: pay, but not even the basic tier might be

00:01:58: case. It might be, it might not be. And the pro tier might

00:02:02: still not be enough. Now chances are, it may be enough, but it

00:02:06: is also, uh, quite expensive.

00:02:10: And now for Reddit, there, as I

00:02:14: mentioned, i- it's not about paying or about, uh, the

00:02:18: price they're asking, but it's about an approval

00:02:22: process so that not every application can start using their

00:02:26: API. And the question of course is, why are companies

00:02:30: doing that? Well, there are a couple of reasons and

00:02:34: one big important reason. Obviously, you could say, why would they

00:02:38: not do it? Why would they give you access to their

00:02:41: free? And you could argue, well, because in the

00:02:44: past, before AI, they may have

00:02:48: benefited from doing so. Because if people can build

00:02:52: products on top of, let's say, X,

00:02:56: if I can build a social media scheduling

00:02:59: application, that might be in Twitter's and X's

00:03:03: interest because more posts on X could

00:03:07: mean more engagement, uh, more people reading and

00:03:10: interacting with those posts, so that might not be too

00:03:14: bad. And there is a reason why you can

00:03:17: write more than you can read. You

00:03:21: could think that you should be able to read more than write because

00:03:25: writes are more expensive to their database, to their infrastructure

00:03:28: it's the opposite. They allow you to write more than they

00:03:32: read. And just as a side note, X also

00:03:36: has a new program which they're testing, it's a pilot right

00:03:39: now, uh, where they, um, want to give

00:03:43: you, uh, pay per use access to their API.

00:03:46: But it stays the same. You have to pay to use it and it can get

00:03:50: expensive. Now, why are companies doing that?

00:03:53: Well, the big answer, of course, is AI, or

00:03:56: specifically, of course, gen AI. Because

00:04:00: with the rise of gen AI, it has become

00:04:04: clear that all that data which these companies own, all

00:04:08: these Reddit posts, all the posts on X,

00:04:12: that is a valuable resource because

00:04:16: those gen AI models, of

00:04:19: course, need data in their training

00:04:23: or for their training process. They, data is the

00:04:27: most important thing there because as we all know, ChatGPT

00:04:31: or the GPT models were trained essentially on the entire

00:04:34: data, publicly available data, you could find on the internet.

00:04:39: Um, and still, these models will need vast amounts of

00:04:42: data for their training. Nowadays, of course, there is the entire concept or

00:04:46: idea of using synthetic data as well as real

00:04:50: data for the training process, and to my understanding, that

00:04:54: seems to work quite well, though we'll see if that maybe still is

00:04:58: a problem and there is like a, a ceiling due to the limited

00:05:02: data that's available because the entire data in the internet has already been

00:05:06: consumed, so now you're just generating more

00:05:09: from that knowledge that was gathered from that

00:05:13: internet data, so there might be a ceiling there.

00:05:16: It's not entirely clear yet. Um, but anyways,

00:05:19: data is super important and of course there's still new data being

00:05:23: Now more data than ever is generated by AI though, to be

00:05:27: fair, so that is synthetic data in the end,

00:05:31: lots of data is being generated, uh, including some data by humans

00:05:35: on X and Reddit every day, and of course,

00:05:39: those platforms don't want to give away that data for free

00:05:43: anymore. They did in the past because we didn't see

00:05:47: coming with, uh, large language models and, um,

00:05:51: now of course they want to protect their data because a site

00:05:54: like X of course sits on lots

00:05:59: of data, lots of valuable posts, at least to some degree,

00:06:03: let's be honest.Most of the posts are total BS, but at least

00:06:06: some decent posts there and definitely valuable in the sense

00:06:10: of being valuable for training. And those sites don't

00:06:14: wanna give that data away for free anymore, which is why

00:06:18: they're locking it down. There also is a reason why we

00:06:22: see more and more web scraping

00:06:25: businesses, uh, coming up almost every day because now

00:06:29: with large language models, even if we ignore the training

00:06:32: part, many of the applications that we wanna build

00:06:36: with help of large language models or on top of large language models

00:06:40: will need access to recent data. If you're building a

00:06:43: smart chatbot and you're using OpenAI's models under the

00:06:47: hood, you probably wanna pull in some recent

00:06:51: chatbot more useful. You wanna add web search, you

00:06:55: wanna be able to have your chatbot answer

00:06:58: questions related to the most recent posts on X.

00:07:02: So you wanna pull that data into your application and

00:07:06: then reach your chat history and the prompts you sent to

00:07:10: the large language model with that data that then hopefully

00:07:14: allows the model to generate a better answer, uh, for what the user

00:07:17: asked. And that's why these sites are

00:07:21: kind of locking down their APIs to make it harder to get

00:07:25: access to the data because in the past they gave it away for free.

00:07:28: They don't wanna do that anymore. Obviously, there still are ways to

00:07:32: get that data, as I mentioned. There's a plethora of web

00:07:36: crawling companies and not all of these companies, uh,

00:07:40: respect the fact that certain sites don't want to get

00:07:43: crawled. Now I did actually a livestream on

00:07:47: the topic of building our own web crawler, um,

00:07:51: a while back, and I'll, uh, provide a link to that

00:07:55: episode. You can watch the full, uh, livestream episode, uh,

00:07:59: below this episode, of course. So you can build your own

00:08:03: crawler and I did that with Crawl4AI.

00:08:05: And essentially what you're building there is, um, an

00:08:09: application that spins up a browser and

00:08:13: simulates being a user and visiting a website to then

00:08:16: extract that website content, to extract the

00:08:20: rendered HTML content and so on. Um, that is

00:08:24: how you can build and use a crawler in the livestream.

00:08:27: I just, um, built it to crawl my own website,

00:08:31: that clear. So I, uh, did not start crawling X

00:08:35: there because web scraping and crawling is

00:08:39: kind of a gray zone. And, uh, there are

00:08:43: many sites that clearly state in their terms that they do

00:08:47: not allow web crawling. So you are violating those terms if you

00:08:51: do, which is why sites like Firecrawl, for

00:08:54: example, won't crawl X links. If I

00:08:59: take an X link and I try to scrape

00:09:03: that, uh, I'll get an error that this is not

00:09:06: supported. Or actually here it doesn't even start as

00:09:10: it seems in the past, I did get an error.

00:09:13: So there are sites that don't allow that.

00:09:15: There probably also are sites that do, and you can

00:09:19: definitely build your own crawler that doesn't give

00:09:23: anything about anything and extract any content of any

00:09:27: site you wanna extract. Now I will say, of course, that many

00:09:31: sites also try to implement some technical hurdles that make it

00:09:35: harder to crawl them, but in the end if you really want to, you can

00:09:38: get around pretty much all of them.

00:09:41: It might not be legal, it might be violating their terms, but

00:09:45: it is possible. Because, and that takes us back

00:09:49: to the beginning, to the main topic, because of course all that data

00:09:53: is super valuable. However, this does have

00:09:57: I believe a real downside or a, an

00:10:01: implication that's not great for us as

00:10:04: developers. Because I totally get that these sites don't wanna give

00:10:08: away access to their data, and just as a side note,

00:10:11: it's kind of not their data. It's the data of the users using the

00:10:15: site, but that's a whole different story.

00:10:18: But I get that they don't wanna give away access to this data.

00:10:21: The problem of course is that kind of as a, a,

00:10:25: an additional casualty, we as developers are limited

00:10:29: in what we can build, uh, with those APIs.

00:10:32: Sure, you might get approval for the Reddit API

00:10:36: something they're happy with. But of course you also might

00:10:40: not get that approval. And for X you have

00:10:44: to pay quite a bit of money depending on what you're

00:10:47: building, even if what you're building has nothing

00:10:51: extracting that data and using it for model

00:10:55: training or anything like that. It limits the

00:10:59: amount of useful stuff we can build on top of other

00:11:02: services and sites. And that of course in turn, um,

00:11:06: might also hurt those sites because some useful products from which they might

00:11:10: benefit then maybe won't get built.

00:11:13: But of course I guess that is a price they're happy to pay

00:11:17: because either they'll get paid by these

00:11:21: people that build products on top of them, or at least they prevent that

00:11:24: data extraction. So yeah, uh, I expect

00:11:28: that we will see more sites and

00:11:32: services, uh, locking down their APIs.

00:11:36: I think we'll see more sites becoming pretty

00:11:39: protective about their data, pretty aggressive

00:11:42: against crawlers, which is absolutely their right,

00:11:46: there. Um, but I think that might also hurt

00:11:50: us as developer because it kind of limits the, the stuff we

00:11:54: can build around other popular services and

00:11:57: sites. These are my two cents. What do you think about this

00:12:01: topic?

Show notes

Show transcript

New comment