Microeconomics of Technical Costing
A framework for deciding between managed services, VPS, and bare metal for your infrastructure
I'm not sure where I'll take this yet, but I'll review my journey so far. I've tended to use services like AWS, Heroku, LogDNA, and many many many others to get my applications up and running. And while I'm ostensibly a very technical person, I've never really taken the time to think too critically about my strategy for doing so. That is, until today.
The Scraping Problem
I'm working on an app, and for this app I need to scrape a lot of websites, and I need to scrape them all very often, like every 15 minutes. So, I need to scrape them concurrently. On top of that, these websites need to be rendered before they can be scraped. It's not possible to simply request their HTML pages from their domains and get back something that can be quickly scraped. That is to say, I need to render each page using a Chromium headless browser, and then I need to subsequently scrape the rendered HTML.
Actually, as I'm writing this, I'm realizing this is not actually true - I could actually scrape these sites using the responses I get from simple GET requests without any subsequent page rendering. The same cannot be said for some later operations I need to perform, but it's something I should seriously consider doing to save many, many resources (and avoid arousing unnecessary suspicion from their developers).
Anyway, assuming this is actually necessary, launching a Chromium instance to do this occupies 150 to 200 megabytes of memory. And since I'm running my little server as a "basic" Heroku dyno, which only comes with 500 megabytes of RAM, I can only simultaneously run this for about 3 or 4 sites at a time. Very limiting.
I started investigating using bigger Dynos (10x the price would give me more memory, but still not enough to really rip through these sites, especially on a regular basis, and these processes would hog up all my available memory). I looked into using AWS Lambdas. Holy smokes that would be expensive. Like almost $2000 a month to run this operation every 15 minutes for 100 sites at a time. Finally, I just asked ChatGPT to recommend the lowest cost way to do this and it suggested the obvious - running a Virtual Private Server or renting a Bare Metal Server. Comparing costs, I realized I could achieve my goal for under $50/month and still have a whole lot of room left over for other tasks.
Why Not Managed Everything?
Now, this is an obvious sort of example, but I've never really needed to spend all that much on infra, so I've tended to not really worry about it. It's under $50/month, so who really cares? The right questions are often ONLY born of necessity. But this analysis begged the question - why not do everything via a VPS or rented bare metal server?
Of course, it depends on the thing, but for instance why not host my API server with a VPS? Why not my email server? Database server? Why not all my files, logs, backups, and other assets?
I can start to think of a lot of answers. Namely, it's a pain in the ass to setup. For most things under $20 a month is worth the premium to literally not have to think about it at all. But, once you get past that, it makes sense to start poking.
Aside from the pain of setting it up, there are other reasons:
- For logs, certain logging platforms just don't cost that much and make it very very easy to search through the logs.
- For assets, it doesn't cost much to host them, and then they can be available via a CDN. But, how much faster does the CDN really make things? And is it worth it?
- For email, things like using Google Workspace aren't even really worth it for the mail, but that's always been pretty clear. Maybe it is and I'm missing something. But I feel like Google Workspace really becomes valuable when you start considering the collaboration features for Sheets, Slides, Docs, Calendar, and the like. It's more about administration and app usage in the modern cloud app stack for business.
Then there's security - the big question. And while I realize this, I'm not even that smart on security. Sure, I know how to setup SSL, certificates, web tokens, hashed passwords, all that. But, what does using a managed server get me over managing my own? What net new responsibility do I now bear if I setup NGINX and a reverse proxy versus just using something like Heroku? And I love Cloudflare, absolutely love it, and I know it's great for security, but I don't really leverage any of those features.
This is all to say, I love building great products, and that's always been my emphasis more than cost. But, now thinking about what I'll have to spend my money on and who I'll have to hire, it's possible with LLMs to now not really have to hire anyone if you're smart about it. Keep your costs down, keep systems manageable, and be relentless. That combination gets you where you want to be. Of course you need to have redundancies and don't want to be a bottleneck for your org, but this same philosophy applies to if you're hiring other people to run your org or serve as redundancies for yourself.
The rest of this post will continue by thinking through those important questions, until we have a general philosophy on how to think about all these questions, and where the inflection points of the logic are - where the logic breaks down and when to then consider approaching it differently.
Proxies and VPNs
Now, say you wanted to (somewhat) anonymously visit these websites via your web server. Makes sense to use a proxy. Assume you need 100 proxy addresses.
Bad Options
Multiple VPS Approach
You could setup a bunch of VPSs, each with their own IP address, then route all requests through them. This would be very costly though; with EC2, the minimum amount you'd pay is about $15/month per server. So that would be about $1500/month.
You could lighten the load with Spot instances (an instance that uses spare EC2 capacity that is available for less than the On-Demand price), but that would still come out to about $5/server/month, or about $500/month total. Plus data transfer costs are non-negligible, about $250/month total, but it doesn't matter because we're already way over budget.
Finally worth mentioning on the VPS route that you could also use AWS Lightsail, which gives you a 512MB service for $3.50/month. 100 IP addresses would then cost you about $350/month.
NAT Gateway + AWS Global Accelerator
Also an option, but sucks: There's the approach of using a NAT Gateway + AWS Global Accelerator, but that involves heavily leveraging AWS which is highly undesirable for so so many reasons, least of all that it may seem inexpensive unless you've overlooked one of a thousand possible things and then you get a huge bill after wasting a bunch of time setting a bunch of things up.
Good Options
Third-Party Proxy Services
Alternatively, you could use a third party proxy service. ProxyMesh charges $50/month to use 100 IP addresses per day.
Serverless Proxy Providers
Also a good approach is using a Serverless Proxy Provider such as ScraperAPI: for $49/month you get a subscription that is "Ideal for small projects or personal use. Scrape 100,000 URLs or 3,000 heavy protected URLs."
What's the Difference?
Third-party proxy services focus on providing a pool of IP addresses, which you can use to route your web requests. They usually offer residential, data center, or mobile IPs and rotate them to avoid getting blocked by websites. These services focus primarily on proxying requests, meaning they allow you to send HTTP/HTTPS requests through different IP addresses to hide your identity or bypass geo-restrictions.
Serverless proxy providers go beyond just providing proxy IPs. They also offer a full service that handles many additional challenges associated with making web requests. These providers typically manage IP rotation and solve complex issues like CAPTCHAs, retries, rate limits, JavaScript rendering, and anti-bot mechanisms.
When to use which:
- If you just need to rotate IPs to avoid rate limiting, access geo-blocked content, or mask your IP: Use a third-party proxy service. These are cost-effective and provide large proxy pools without extra features. They work well for standard HTTP/HTTPS requests where you need to rotate proxies or access content from different locations.
- If you need to handle complex requests, avoid bot protection, handle CAPTCHAs, or scrape dynamic pages with JavaScript rendering: Use a serverless proxy provider. These are better for scraping tasks, automating requests, or accessing pages protected by advanced bot detection systems. They simplify the process by offering automatic retries, CAPTCHA solving, and JavaScript support.