APIs are locking up - thanks to Gen AI
Show notes
Website: https://maximilian-schwarzmueller.com/
Socials: 👉 Twitch: https://www.twitch.tv/maxedapps 👉 X: https://x.com/maxedapps 👉 Udemy: https://www.udemy.com/user/maximilian-schwarzmuller/ 👉 LinkedIn: https://www.linkedin.com/in/maximilian-schwarzmueller/
Want to become a web developer or expand your web development knowledge? I have multiple bestselling online courses on React, Angular, NodeJS, Docker & much more! 👉 https://academind.com/courses
Show transcript
00:00:00: There's kind of a trend, if you wanna call it like
00:00:03: this, or definitely something we see
00:00:07: become more common over the last one or two years, which is
00:00:11: not really surprising, but nonetheless, something I wanna talk
00:00:14: about. The trend of APIs of
00:00:18: websites or web services becoming more and
00:00:22: more locked down or private or more
00:00:26: expensive to use, whatever you wanna call it.
00:00:28: And the most recent example, which is the reason why I'm creating this video
00:00:32: here, is the Reddit API,
00:00:35: because, uh, two days ago, there's been a post by the,
00:00:39: the Reddit team in the Reddit development forum where they
00:00:43: essentially announced that the usage of their API now
00:00:47: needs approval. So there is an approval
00:00:50: process for using the API for
00:00:54: responsible use to support responsible
00:00:58: builders, and I'll get back to that and what that means, but it's
00:01:02: kind of in line, if you wanna call it like this, to what
00:01:05: Twitter, X, did two years ago already.
00:01:09: They made their API really
00:01:13: expensive to use, at least at scale.
00:01:16: So if you wanna interact with Twitter, with X, programmatically,
00:01:20: if you wanna build yet another social media scheduling tool and you wanna
00:01:24: support X, well, that could get expensive
00:01:28: depending on how you build it, uh, because the, the free
00:01:32: usage is quite limited. You can, for example, read
00:01:36: 100 posts per month and write 500
00:01:40: posts per month, which might be more than enough
00:01:44: for your own little tool that you're building for
00:01:47: yourself, but if you are building a SaaS product
00:01:51: on top of the X API, that will not suffice, so you'll have to
00:01:55: pay, but not even the basic tier might be
00:01:58: case. It might be, it might not be. And the pro tier might
00:02:02: still not be enough. Now chances are, it may be enough, but it
00:02:06: is also, uh, quite expensive.
00:02:10: And now for Reddit, there, as I
00:02:14: mentioned, i- it's not about paying or about, uh, the
00:02:18: price they're asking, but it's about an approval
00:02:22: process so that not every application can start using their
00:02:26: API. And the question of course is, why are companies
00:02:30: doing that? Well, there are a couple of reasons and
00:02:34: one big important reason. Obviously, you could say, why would they
00:02:38: not do it? Why would they give you access to their
00:02:41: free? And you could argue, well, because in the
00:02:44: past, before AI, they may have
00:02:48: benefited from doing so. Because if people can build
00:02:52: products on top of, let's say, X,
00:02:56: if I can build a social media scheduling
00:02:59: application, that might be in Twitter's and X's
00:03:03: interest because more posts on X could
00:03:07: mean more engagement, uh, more people reading and
00:03:10: interacting with those posts, so that might not be too
00:03:14: bad. And there is a reason why you can
00:03:17: write more than you can read. You
00:03:21: could think that you should be able to read more than write because
00:03:25: writes are more expensive to their database, to their infrastructure
00:03:28: it's the opposite. They allow you to write more than they
00:03:32: read. And just as a side note, X also
00:03:36: has a new program which they're testing, it's a pilot right
00:03:39: now, uh, where they, um, want to give
00:03:43: you, uh, pay per use access to their API.
00:03:46: But it stays the same. You have to pay to use it and it can get
00:03:50: expensive. Now, why are companies doing that?
00:03:53: Well, the big answer, of course, is AI, or
00:03:56: specifically, of course, gen AI. Because
00:04:00: with the rise of gen AI, it has become
00:04:04: clear that all that data which these companies own, all
00:04:08: these Reddit posts, all the posts on X,
00:04:12: that is a valuable resource because
00:04:16: those gen AI models, of
00:04:19: course, need data in their training
00:04:23: or for their training process. They, data is the
00:04:27: most important thing there because as we all know, ChatGPT
00:04:31: or the GPT models were trained essentially on the entire
00:04:34: data, publicly available data, you could find on the internet.
00:04:39: Um, and still, these models will need vast amounts of
00:04:42: data for their training. Nowadays, of course, there is the entire concept or
00:04:46: idea of using synthetic data as well as real
00:04:50: data for the training process, and to my understanding, that
00:04:54: seems to work quite well, though we'll see if that maybe still is
00:04:58: a problem and there is like a, a ceiling due to the limited
00:05:02: data that's available because the entire data in the internet has already been
00:05:06: consumed, so now you're just generating more
00:05:09: from that knowledge that was gathered from that
00:05:13: internet data, so there might be a ceiling there.
00:05:16: It's not entirely clear yet. Um, but anyways,
00:05:19: data is super important and of course there's still new data being
00:05:23: Now more data than ever is generated by AI though, to be
00:05:27: fair, so that is synthetic data in the end,
00:05:31: lots of data is being generated, uh, including some data by humans
00:05:35: on X and Reddit every day, and of course,
00:05:39: those platforms don't want to give away that data for free
00:05:43: anymore. They did in the past because we didn't see
00:05:47: coming with, uh, large language models and, um,
00:05:51: now of course they want to protect their data because a site
00:05:54: like X of course sits on lots
00:05:59: of data, lots of valuable posts, at least to some degree,
00:06:03: let's be honest.Most of the posts are total BS, but at least
00:06:06: some decent posts there and definitely valuable in the sense
00:06:10: of being valuable for training. And those sites don't
00:06:14: wanna give that data away for free anymore, which is why
00:06:18: they're locking it down. There also is a reason why we
00:06:22: see more and more web scraping
00:06:25: businesses, uh, coming up almost every day because now
00:06:29: with large language models, even if we ignore the training
00:06:32: part, many of the applications that we wanna build
00:06:36: with help of large language models or on top of large language models
00:06:40: will need access to recent data. If you're building a
00:06:43: smart chatbot and you're using OpenAI's models under the
00:06:47: hood, you probably wanna pull in some recent
00:06:51: chatbot more useful. You wanna add web search, you
00:06:55: wanna be able to have your chatbot answer
00:06:58: questions related to the most recent posts on X.
00:07:02: So you wanna pull that data into your application and
00:07:06: then reach your chat history and the prompts you sent to
00:07:10: the large language model with that data that then hopefully
00:07:14: allows the model to generate a better answer, uh, for what the user
00:07:17: asked. And that's why these sites are
00:07:21: kind of locking down their APIs to make it harder to get
00:07:25: access to the data because in the past they gave it away for free.
00:07:28: They don't wanna do that anymore. Obviously, there still are ways to
00:07:32: get that data, as I mentioned. There's a plethora of web
00:07:36: crawling companies and not all of these companies, uh,
00:07:40: respect the fact that certain sites don't want to get
00:07:43: crawled. Now I did actually a livestream on
00:07:47: the topic of building our own web crawler, um,
00:07:51: a while back, and I'll, uh, provide a link to that
00:07:55: episode. You can watch the full, uh, livestream episode, uh,
00:07:59: below this episode, of course. So you can build your own
00:08:03: crawler and I did that with Crawl4AI.
00:08:05: And essentially what you're building there is, um, an
00:08:09: application that spins up a browser and
00:08:13: simulates being a user and visiting a website to then
00:08:16: extract that website content, to extract the
00:08:20: rendered HTML content and so on. Um, that is
00:08:24: how you can build and use a crawler in the livestream.
00:08:27: I just, um, built it to crawl my own website,
00:08:31: that clear. So I, uh, did not start crawling X
00:08:35: there because web scraping and crawling is
00:08:39: kind of a gray zone. And, uh, there are
00:08:43: many sites that clearly state in their terms that they do
00:08:47: not allow web crawling. So you are violating those terms if you
00:08:51: do, which is why sites like Firecrawl, for
00:08:54: example, won't crawl X links. If I
00:08:59: take an X link and I try to scrape
00:09:03: that, uh, I'll get an error that this is not
00:09:06: supported. Or actually here it doesn't even start as
00:09:10: it seems in the past, I did get an error.
00:09:13: So there are sites that don't allow that.
00:09:15: There probably also are sites that do, and you can
00:09:19: definitely build your own crawler that doesn't give
00:09:23: anything about anything and extract any content of any
00:09:27: site you wanna extract. Now I will say, of course, that many
00:09:31: sites also try to implement some technical hurdles that make it
00:09:35: harder to crawl them, but in the end if you really want to, you can
00:09:38: get around pretty much all of them.
00:09:41: It might not be legal, it might be violating their terms, but
00:09:45: it is possible. Because, and that takes us back
00:09:49: to the beginning, to the main topic, because of course all that data
00:09:53: is super valuable. However, this does have
00:09:57: I believe a real downside or a, an
00:10:01: implication that's not great for us as
00:10:04: developers. Because I totally get that these sites don't wanna give
00:10:08: away access to their data, and just as a side note,
00:10:11: it's kind of not their data. It's the data of the users using the
00:10:15: site, but that's a whole different story.
00:10:18: But I get that they don't wanna give away access to this data.
00:10:21: The problem of course is that kind of as a, a,
00:10:25: an additional casualty, we as developers are limited
00:10:29: in what we can build, uh, with those APIs.
00:10:32: Sure, you might get approval for the Reddit API
00:10:36: something they're happy with. But of course you also might
00:10:40: not get that approval. And for X you have
00:10:44: to pay quite a bit of money depending on what you're
00:10:47: building, even if what you're building has nothing
00:10:51: extracting that data and using it for model
00:10:55: training or anything like that. It limits the
00:10:59: amount of useful stuff we can build on top of other
00:11:02: services and sites. And that of course in turn, um,
00:11:06: might also hurt those sites because some useful products from which they might
00:11:10: benefit then maybe won't get built.
00:11:13: But of course I guess that is a price they're happy to pay
00:11:17: because either they'll get paid by these
00:11:21: people that build products on top of them, or at least they prevent that
00:11:24: data extraction. So yeah, uh, I expect
00:11:28: that we will see more sites and
00:11:32: services, uh, locking down their APIs.
00:11:36: I think we'll see more sites becoming pretty
00:11:39: protective about their data, pretty aggressive
00:11:42: against crawlers, which is absolutely their right,
00:11:46: there. Um, but I think that might also hurt
00:11:50: us as developer because it kind of limits the, the stuff we
00:11:54: can build around other popular services and
00:11:57: sites. These are my two cents. What do you think about this
00:12:01: topic?
New comment