An opinionated collection of tips for technical founders. There are lots of guides around "how to find product market fit," how to hire, and other aspects of being a technical founder. And on the technical front, there are many tutorials for beginners creating "hello world" apps, and lots of advice on how to deal with massive scale and huge existing systems. But there are fewer guides on how to get a greenfield codebase up and running as a solid foundation you can build a team and product around.
Hopefully, this will be useful for non-technical founders too - you'll get a flavor for the work that goes into building software at a professional level that is more than "prioritizing tickets and coding features."
It's mostly geared towards services, but this applies to rich client applications (desktop, mobile, and embedded) as well, especially if they have to talk to a backend service.
I'll also recommend some vendors, libraries, and services here, most of which I've personally used.
How to get up and running without accruing a bunch of foundational debt.
Do you have expertise in building, running, and debugging your technology choices? This may be the most important near term question - your initial momentum has ripple effects for the rest of the organization's existence. Don't start with a technology stack that your founding team is unfamiliar with.
Is this a bad choice for the solution you're building? There's usually not a single right answer, but there are a lot of wrong ones. E.g., "we'll write our web app in C++" is most likely a wrong answer.
Can you hire for this stack? Will you need to? How is the industry trending?
Your choice of data store is very important - data tends to grow and accumulate over time, and unlike other technology choices, it's hard to transition to a new tech by running it side-by-side, because you often need all your data in a single place. Two straightforward schools of thought are 1) use a proven, mature, multi-purpose store correctly, and it will reliably solve many problems for a long time, or 2) pick a bleeding edge store uniquely suited for your use case and hope that the productivity/performance benefits swamp the overhead of dealing with immature database technology. The right answer is context-specific, but commonly #1 is a default good answer.
It's generally a bad idea to pick a database for its scalability. Database scalability is an important long term consideration, but most solutions are already very scalable with varying degrees of effort -- as long as the schema is designed correctly. Performance is not scalability, but it can be a reasonable substitute: on modern hardware, careful application and schema design will usually get you enough performance that you don't have to think about scale-out for a long time.
As with programming language and platform decisions, expertise trumps the "optimal" choice. You will benefit far more from having an expert on your chosen database than picking a "scalable" database you have less expertise in. Remember that Google Ads, Uber, and Facebook were quietly being built on top of MySQL when "big data" was becoming part of tech vernacular.
General advice that applies to most data stores is: store less data. This helps you in multiple dimensions (reducing pressure on caches, reducing expensive high-performance storage requirements, fitting more indexes in memory, etc). Data needed only for "offline" analysis should generally be kept out of OLTP databases. For append-only use cases (like time series or logs) where older data is accessed less frequently, consider a time-based partitioning scheme to enable easier archiving or scale-out later.
Over time you will likely accrete "auxiliary" stores for specialized use cases, such as fulltext search or caching. For every store you introduce, decide upfront whether it is a critical source-of-truth, or an auxiliary store that can regenerate its data from somewhere else. Find ways to prevent developers from committing critical data to auxiliary stores - using naming conventions, documentation, code comments, etc. Then be vigilant anyway, because these details get lost over time and "obvious" ideas aren't obvious to everyone. This will save an enormous amount of money and headache later.
Resist the urge to take the more technically impressive approach when it is of dubious benefit.
Whatever you do, don't use a microservices architecture. You ain't gonna need it.
If you're making a web app, most likely you don't need to build a Single Page App. Creating dynamic web applications with just pure server-side rendered HTML is approximately 5 times less effort than a comparable SPA, and mostly nobody will notice. The best part? You can still mount React.js components and such wherever you need highly interactive and complex UI with lots of state. Just don't make the mistake of defaulting to this for your entire app and paying the tax forever. Managing an API with a single internal consumer is just volunteering for needless pain.
Decide on your version control system. Unless you have a very specific compelling reason to do otherwise, use git: most developers these days are trained on git (unless they work somewhere with a proprietary internal system like Google, Microsoft, IBM, etc - but in any case the concepts employed will have parallels in git). Linux and Windows (e.g., millions of files in a single repo) are versioned with git. I promise it supports what you need.
Create a policy for how new libraries, databases etc are added to this environment. If there are specific licenses that aren't allowed, document them now to prevent building them in as foundational dependencies in your app.
Make sure your environment is documented, versioned, and reproducible. You will likely want to use some container technology (like Docker) to get these reproducible build environments; this may also help with the deployment story. Your ideal state is "run this one-liner to install Docker, then use this one-liner to bootstrap everything in Docker." Test this on a new machine, repeatedly if necessary: ideally anyone should be able to copy paste a script on a machine out of the box and have a working dev environment. To develop these setup scripts, I recommend creating a VM, and snapshotting it so you can restore it to a clean "just before my setup script" state.
Create a seed file to prepopulate any database data needed to boot the app. Check this into source control and reference it from your setup steps.
Consider what IDEs will be supported. You might want to check in configuration files for those IDEs to enforce certain norms (e.g., tabs vs spaces, semicolon policy, etc). Set up linters configuration, document what licenses are needed, and check this into the readme.
Consider starting on a cloud IDE like Github Codespaces, AWS Cloud9, or Google Code Cloud. A cloud IDE will save you hours of toil and debugging broken setups, "works on my machine" problems, and general reproducibility issues. Unfortunately there are still many use cases where the limitations of cloud IDEs prevent widespread adoption (applications that depend heavily on subdomains come to mind).
In general, minimize dependencies on an external service to boot up. Service outages or lack of internet access will break your ability to start up your dev environment (taking a dependency on a cloud IDE is a calculated risk in this regard). Also note that network dependencies to start your dev environment may imply you need it for your production service to boot as well.
Use a dotenv file (unversioned) to store secrets. Version your configuration files and don't store configs with your secrets. Configs are like "in development, the web server starts 5 threads." If you store all your configs alongside your secrets in unversioned dotenv files, synchronizing configs between developers becomes a nightmare.
You probably aren't using a "network perimeter model" so hopefully you don't have a need to set up a corporate network, VPN access, etc.
Use Cloudflare to host your DNS. You can use the AWS or Azure or Google-specific services too, but Cloudflare is the best, and you likely want them in the mix even if you use one of the others. Cloudflare comes integrated with a CDN, reverse proxy, SSL, and many security features, with pricing from "free" to "surprisingly reasonable." This significantly cuts down on the number of other vendors you need to work with and integrate.
Enable DNSSEC and periodically export your configuration files to backups. You should also be enabling HSTS, making a habit of always assuming HTTPS, and not supporting HTTP at all. This is much easier to do from the get-go as opposed to enabling HTTPS later and dealing with surprise mixed-content issues forever.
Most people benefit from starting and staying on a monolith architecture on a single repo per deployment target - e.g. one for web, one for iOS, etc.. Dependency management and sharing of common libraries otherwise becomes very hard way too early, sucking up valuable time on menial trash like how to sync schemas between your microservices. Public services like Github do not do "monorepo with multiple deployment targets" well out-of-the-box, since you usually want to build (and maybe deploy) on every commit, but only for the specific targets that were changed (e.g., if your backend and frontend are separate deployment targets, you might not want to trigger a backend deployment when someone checks in a frontend change, but this requires extra work to configure in a monorepo).
Low operational overhead is usually the name of the game (unless - for example - you are an infrastructure provider and expect this to be where your margin comes from). So if you can, favor one of the managed platform providers like Heroku, Fly.io, or Render. And consider using a managed database like crunchydata.
Make sure you have encryption at rest and protect sensitive information in transit.
Make sure you have high availability and failover instances of every tier - frontend, web, database, and so on (if you have good backups, there is a case to be made that very early stage startups in certain domains don't need high availability, which can sometimes reduce your costs by 2x or more).
Make sure backups are enabled, and you have a documented policy to test these at least annually.
Make sure your environment is documented, versioned, and reproducible.
A build should be a single command in both dev and the test environment. A single script should be able to produce a production build at any time. If you're creating a service, this should extend to deployment as well - deploying to production should be a single command. You'll want to do this often; maintaining a green build all the time is one of those areas where automation always pays for itself, and changes the way you develop for the better.
In general this should be run on every checkin with results visible in your VCS. I highly recommend investigating if your hosted VCS provides this out of the box: Github Actions, GitLab CI/CD, etc. Minimizing the number of vendors and logins you deal with is always important, but helps a lot early on since you probably don't have a corporate identity service yet.
Your test system may depend on external vendors - Chromatic, Browserstack, etc. More realistic tests that use I/O and depend on external vendors tend to be slow. Set up parallelism in your tests as soon as feasible, so that you don't have to go and rearchitect them later.
Don't spend too much time actually writing tests early on. There isn't enough value/complexity yet to make them pay off. Just make sure there's a good test system in place, you have a few high level end-to-end tests, and that they're being regularly run and kept green.
If you're running a service, you need a way to deploy to production in a single step. You'll also need to configure other environments: staging, test, pre-production, or whatever name(s) make sense for your workflow. You will need to secure these and potentially register separate domain names (etc) for each environment, as you want each environment to be 100% segregated and independent of any other environment, with no resource sharing whatsoever.
Design the system to allow one-click rollbacks of code. You'll need it eventually. I like Heroku because of this. Your goal is to facilitate high deployment confidence and lower the cost of mistakes so builds and deployments can be fast and iterative.
Make sure everyone knows where to see the current status of the service and deployment history. "Hey prod is down, did someone just deploy? How can I reach them?" is a 100% preventable problem.
Each time you run into problems in your pipeline, update your documentation on edge cases (e.g., most people learn quickly they can't roll back certain changes naively, if they include a database migration or API change).
Once you're up and live, make sure you invest in your devops system to maximize visibility and confidence.
Initially, your release process is going to be fairly coupled with your build/deployment process. This is fine for most domains. The first thing you'll want to specify is a deployment checklist. Keep this up to date and use it every time you deploy. Shorter lists increase compliance, so rely on automation and automated checks to fail builds whenever possible. Only put things in deployment checklists that can't be (or haven't been) automated yet.
You may eventually want to decouple deploying and releasing, which is when building or integrating a service like LaunchDarkly will help. You can also build your own if you know your requirements well. This will let you deploy code as soon as its done, and then pick a specific time to go live in coordination with other teams. Feature flags will naturally be (ab)used for runtime configuration and client-specific overrides, especially if you don't have a great way of managing that yet. You can either embrace this or try to clamp down on it, but in either case you will want to document specific uses that are OK and not OK, and best practices around default values, expiration, evaluating server-side vs client-side, etc.
For rich clients, there's often a number of manual steps (especially at first) to get a "release" out, and this may involve waiting on approval from a third party ecosystem gatekeeper. Write these steps down as you do them to minimize the mental overhead and risk of errors (or extended review cycles). Your goal as you do this the first few times is to automate the mundane parts as much as possible and make this readme as short as possible - it should still be a single command to build a release artifact, even if there are a bunch of other inputs to submit for app review.
Sooner or later you will need to add some kind of queuing to your system so that it can handle surges in activity and long-running tasks. There are some great language-specific libraries for this (Sidekiq for Ruby, Celery for Python, etc), but resist the temptation to use them for generic "async message queues" - for those, you'll want something much more suited to message passing like AWS SQS, RabbitMQ, etc. Those will generally scale better and interoperate better.
Having multiple different queues will be critical for ensuring responsiveness, and lets you scale bottlenecked jobs independently. This is extremely hard to get right without understanding the load characteristics of the system, and how latency-sensitive each type of job is. Some examples of separate queues might be 1) a queue for small, time-sensitive jobs like transactional emails (password resets, order confirmations, etc), and 2) a queue for long-running jobs like batch operations that can take minutes/hours. You don't want email jobs stuck behind 30 minute batch export jobs. Beware of grouping jobs by how fast they are on average; if your p99 or 95 on some jobs are really high this can cause huge latency spikes. Job performance can change over time. Make sure you configure good monitoring on queue metrics (queue latency is best, although "queue size" can sometimes be helpful), and adjust things as necessary.
Eventually, you may want to introduce rate limiting in your queues -- either in terms of how quickly your users can perform certain operations, or how often you can use an external API. There are a number of strategies and tradeoffs here, and depending on how many different queues your architecture can handle, your designs will be different. For example, if you will only ever have a few hundred or thousand clients at most, it might be feasible to have client-specific queues, which will allow you to rate limit or throttle jobs without unintentionally affecting other clients.
As you build, there will likely not be a discrete point where you go from "monolith" to "microservices," but rather the team will gradually accrete microservices naturally. Avoid the temptation to put a queue in front of every service, as it will increase operational costs and add end-to-end latency without commensurate benefit. Instead, decide what layer(s) will be responsible for queuing, and add backpressure and load shedding to the system around that layer.
Beware that adding a queuing service introduces much more operational and conceptual complexity to your application than you might expect. This is because it introduces an asynchronous layer, allowing race conditions where none existed before. It also makes it much more complicated to enforce happens-before order guarantees. At a certain point these become worthwhile tradeoffs, but they require you to think about what types of work will be done asynchronously, and how progress will be tracked and reported to end users. You may need a mental framework for this, and eventually a way to document and communicate it.
The robustness of your error reporting is critical, so make sure you configure all the integrations and libraries correctly, or you might not even know when you've shipped a showstopping bug. Take the time to connect your VCS and set up any de-duplication rules as needed when the first bugs come in. A good exception tracker has the robustness and flexibility to support accurate de-duping of common crashes and exceptions. Also make sure you can distinguish what version of the code is running, so you can group by build and trace regressions back to the code that introduced them.
Make sure you are streaming all your application logs to somewhere they can be searched. They can help you chase down anomalies and find patterns in production that don't show up as crashes or exceptions.
Separately from logs and exceptions, you'll also need operational monitoring - how many requests are you serving, and how quickly? What resources are near capacity? When was the last deploy? This last one is important to instrument so you can quickly tell if a recent change broke something.
AWS Cloudwatch (for example) lets you cobble a lot of this together, but NewRelic or DataDog might be more suited and offer many drop-in libraries for collecting and reporting operational statistics you can alert on.
You'll want to set up a vendor like PagerDuty or Opsgenie to schedule any on-call rotations, if your service has uptime requirements.
Set up key transactions to monitor - your most important and common user actions. Configure baseline metrics to fire alerts if they go out of range - e.g., response times, queue delays, etc. Initially, tighter tolerances are helpful as things can go from "degraded" to "completely down" very quickly. Over time, you'll want to tune these so that every alert is both important and actionable with a pre-written runbook.
You may want a public facing statuspage so you can share updates with customers in the case of incidents.
Salt and hash your passwords.
Encrypt sensitive data at rest at the application level. Separately, enable encryption at your data tier.
Encrypt in transit with TLS. Enable DNSSEC and HSTS.
Store your secrets separately from your code. Defend against XSS and content injection.
Once deploying to production is seamless and automatic, you'll need to get add supporting infrastructure to deal with real live users and your initial customers.
Use Stripe for handling billing and money. Nothing else really comes close.
Stripe has a built-in subscriptions API for handling recurring billing. Depending on how complicated your billing ends up being, you sometimes have no choice but to custom-build on top of native Stripe payments instead. This gets complicated fast, and is rarely worth it when (for example) selling to SMBs. Stick to something simple like month-to-month or annual prepaid billing, or a pay-as-you-go balance for usage-based.
You'll have all kinds of exceptions you need to support - internal staff, partners, investors, etc, are all potential groups that you want to sometimes treat "specially."
You'll need to weigh the benefits of pre vs post paid billing, and other details like whether you'll issue invoices based on signup date or month-aligned. Complexity in billing builds up fast, and part of the complexity is because dates are complex! Taking time to keep things conceptually simple (like month-aligned billing, for example) can really help in the long term. This might have business implications (e.g., maybe it makes your cashflow unacceptably spiky), so try to iron all these out beforehand.
One subtle problem you'll need to solve is: you must agree as a team on where your source of truth for billing information lives. As an engineer, it can be tempting to say "I know! I'll store it in Stripe!" Except - if you deal with mid market or enterprise, your customer data is probably in Salesforce or Zendesk, or written up in some sales contracts, potentially sitting in someone's email inbox. Things are fine if you truly have a self-serve product and everyone is paying the same price for a Pro or Business plan by credit card. But as soon as someone can configure a "custom" plan or wants to pay by check, and you're tracking customers and payment terms somewhere else in addition to Stripe, things quickly go off the rails (but you probably won't notice until later when you try to audit your financials).
You'll quickly realize you can't just run SQL against your production database to provide support. Building a staff interface will be critical to safely supporting your users.
You generally shouldn't have to build from scratch. Framework-specific libraries (like Avo or
rails_admin for Ruby on Rails, or the default Django admin for Python) can bootstrap a lot of the heavy
lifting and CRUD actions you need. Invest in this so you don't waste cycles doing dangerous one-off data fixes in production.
If you're in a regulatory domain that allows it, it is extremely useful to build a feature that lets you "assume" the identity of one of your users. Days of email back-and-forth can be eliminated by simply seeing the world as if you were logged in as that user. You can easily find and fix bugs this way, and it can have some unexpected benefits occasionally as well (like spotting or dealing with abuse).
You'll need to differentiate between who can and can't access your staff interface, so you'll need to add authorization and access control. If you're running a multi-tenant cloud application, you'll also want to enforce those access rules now. Users and internal support staff will ask for customizable role-based access control; how urgent this is depends on customer size and product domain.
You'll want to audit and prominently display sensitive actions like signups, account closures, etc, and basically any actions taken in your staff interface. This can help you understand and debug accounts, and is required for many compliance scenarios as well.
It is essentially a law of physics that whatever action can be done in your app, eventually someone will desperately want to undo it. You should consider auditing everything, including things that don't seem to be of interest now. Your staff interface will need more and more "undo" controls over time, and the audit log can be a valuable source of information to reconstruct state that was accidentally changed.
Soft deletion is sometimes controversial. Any approach to soft-deletion comes with substantial trade-offs, but one that has the fewest unintended side effects is along the lines of "create a table called `deleted_records` and stick deleted entities in there. That way, you can undo accidental deletions and such without having to constantly deal with omitting soft-deleted records polluting your other tables.
You'll need to send transactional emails for things like password resets, so consider integrating with a service that can act as your SMTP relay and get all your tricky deliverability issues on autopilot. The gold standard is probably a paid service like Postmark, since they have a product that only delivers transactional mail. Otherwise you can try your luck with more budget providers like Amazon SES, Sendgrid, Mailgun, or Sparkpost.
For rich client apps, integration with OneSignal or Pusher can help you manage notification messages; otherwise Twilio is the go-to for voice and SMS.
Create a system to run things on a regular schedule, as you'll need to do this periodically (rotating out old data, sending reports, rolling up statistics, etc).
Consider setting up a system to alert you when these fail, as they can otherwise fail silently for long periods until someone notices. Dead Man's Snitch is an example service that can integrate with your monitoring setup.
At a certain point you'll need to keep track of your major dependencies and vendors. Early on, a spreadsheet with some key data is fine: name of vendor, purpose, approximate cost, who has access.
As you start to get users, customers, and the people to support them, you need to build features and systems to deal with increased scale.
Dealing with multi-factor authentication, password resets, and "I-didn't-get-the-email" problems are a distraction from real issues to solve. Adding domain-specific single sign on is usually a no-brainer and can help cut down on those types of support headaches.
If you're building & selling to enterprise, SAML SSO and such becomes a necessary checkbox very quickly. There are some services (like WorkOS) that try to handle this for you, because its very deceptively complicated, and every implementation has unique quirks that aren't necessarily captured in specifications or documentation.
Be careful about how you select and match identifiers. For example, if someone can sign up with Google, are they allowed to change their email address to one that doesn't match their Google account? The more conservative answer prevents people from accidentally creating more than one account.
In many domains, your power users and best customers will want to manage multiple accounts from one login. Note that "manage" means different things to different people - the high order bits might be about sharing resources or user permissions between accounts, or it might be about billing. You may need to do substantial refactoring of your system to support multi account management, but this can have a substantial payoff if your power users are your best customers.
In a very strict multi-tenant setup, you might be enforcing multi-tenancy via some form of "every customer gets their own schema," "every customer gets their own partition key," or something to that effect. This is often the foundation for a scale-out strategy as well - many use cases lend themselves well to sharding by customer key. Multi account management throws a wrench in this because it implies the need for a "global" schema somewhere. Avoid partitioning or sharding your users and permissions tables until you have a solid understanding of your multi-account requirements.
Initially, every instance of abuse will be new and novel. In most domains, its OK to plug them when a bad actor abuses something.
After you achieve some scale, you'll need to be more proactive.
If you are building an app or service with user-generated content that's displayed to other people, you'll need to work with leadership to figure out policy enforcement around spam, botting, doxxing, pornography, terrorism, malware distribution, sanctions evasion, and so on.
Deploy recaptcha or hcaptcha on any actions that can be botted. This includes sign up, sign in, password changes/resets, email forms, etc. Correctly implemented, these will also act as a sort of natural rate limit. You'll need to enforce rate limits in other parts of your code, such as when interacting with APIs or API clients.
Make use of quarantining / shadowban mechanics whenever possible. While frustrating for real users, you need to withhold as much information about how your anti-abuse mechanisms work to foil real adversaries.
Ideally, your business intelligence tool can plug into the data exhaust of your operational monitoring tools. Most
likely though, this will separately require a lot of conscious effort. The gold standard seems to be to get
everything into Snowflake (or Redshift, if you like to trade pain for money). But if you are clever, you can create
structured logs, or send everything to (for instance) StatsD for operational monitoring, and then instrument your
code for business intelligence as well (e.g., fire events like
This does not solve the problem of complex analytics queries against your production database. You don't want to run complex analytics queries against your production database. There isn't a single correct solution here because its domain-specific, but most answers collapse to one of:
Much ink has been spilled about the Right Way to do change data capture. The important thing to keep in mind is that you need to treat your production system as a single producer (out of many) of data, and that many downstream systems will eventually care about every change that happens in that system. This will let downstream systems reliably construct the state of your production database without actually having to directly query it.
Depending on the volume and value of your data, you may also want to send everything to a data lake for easy re-processing later. This is a lot harder than it may seem, as it involves changes to your product development processes to ensure that any new features or changes account for the data pipeline.
Your product org is always going to want as much information about what users are doing as possible. Vendors like Amplitude will probably be your gold standard here. Heap Analytics is also plausible. At really low volumes, screen recording software like Hotjar, FullStory, or Mouseflow are viable.
You will need both one-time and ongoing investment with these products (don't believe the sales pitches). In fact, as your organization grows in sophistication, you'll need to introduce governance so that everyone is operating off the same definitions, and is looking at the same types of reports. It will save tremendous headache in the future to plan out your usage patterns and naming conventions.
Remember that thing about building a robust data pipeline because everything is architected to be stream-oriented? If you did that, you can plug in Amplitude/ integrations to it here. If you didn't, then you need to directly integrate with them from your application.
Sales and marketing leaders (etc) will eventually request changes or data. Decide how you'll want to handle these, keeping in mind: you probably have better things to do than updating promotion text on the login or pricing page. As you scale you will want to move most content into a CMS so that non-technical teams can go and update marketing copy.
Google Tag Manager and other toolkits like it can be your friend here. Be thoughtful about governance and training, since this can devolve into a backdoor rats nest of untested/unreviewed code and a source of security liabilities.
Remember that thing about building a robust data pipeline because everything is architected to be stream-oriented? If you did that, you can plug in Hubspot and Salesforce integrations to it here - hooray for loose coupling. If you didn't, then you need to directly integrate with them from your application.