This blog will be an attempt for me to design the system design path for google drive from scratch. This is going to be my first time making such a blog. I will try to keep it as simple and beginner friendly as possible. Let us begin.
Everyone has used google drive (or one of its alternatives) at some point in our lives. We use it store and retrieve documents, edit them, move them and share them to other people among other things.
Technical Specifications
We need this service to be highly available at all times.
We need this service to create a folder, upload and view files, rename files, delete and download files. When i say files in this blog, I am generalizing the term to include documents of all file formats and extensions.
User should be able to download any of their files at any point of time.
The user should be able to move the files to any folder they want to.
Google Drive provides 15 GB of space in its free tier. We will be using the same metric for all of our assumptions here.
Let us understand the scaling aspect of our service.
Since we are dealing with a lot of users with a lot of storage (keep note that user purchasing different plans will have higher storage), managing all of that data in real-time is a pain.
Let us try to visualize using a rough estimation.
Now, keep in mind that traditional database read/writes are notoriously slow. This is the opportunity where we start to look for alternatives, namely KV stores like ronDB. But regardless of that, the database is going to be massive. Sharding is an important step here but on what basis should the sharding be done? Suppose John from canada has uploaded a certain file and he wants codey from australia to download the file in realtime. Sharding based on region is tough in this usecase. Instead sharding could be done based on accountOwnerId in this case.
Quick side note: One can build a google drive clone that checks almost all the boxes apart from scalability. I had built one using appwrite as the backend service that manages auth, storage and security and while it works fine, it is not scalable. Check it out here
Sharding
You will be seeing sharding used in any scale-worthy service. It is an important process of splitting a large database into multiple smaller shards. This is done in order to lower the load of one single database and also helps reduce the chances of single point of failure. Please note that sharding and partitioning are two completely different processes. Refer to the illustration below to understand sharding in details.
As can be seen from the above illustration, we have a large database with 1 million records. Because of its sheer size, it is slow to query data from it and also becomes a single point of failure. To mitigate this, we shard it based on users' alphabets and come up with 4 shards, each containing 1/4th of the total records. This helps us scale our application and also ensures our data is highly available at all times.
Blob and Object Storage
Unlike your college project that stores files in a relational database like MySQL, our Google Drive cannot use a relational database for storage. It is simply too slow and complex for something that is so data-heavy and does not necessarily require relationships with objects. Object storage is where the files or blobs are placed in a flat data lake with no hierarchy. Think of a data lake as a large collection of unstructured data. Due to the sheer scale of Google Drive and the extremely high amount of storage being dealt with, storing them in a structured format is not an option.
A blob, on the other hand, stands for Binary Large Object and is a collection of data of an unknown size. They do not follow any given format and are simply a series of bits (0s and 1s), making them perfect for storing any kind of data. They are great for storing data on a massive scale and can be duplicated, as required by Google Drive, because they need to have multiple copies of the same data. This ensures that if any server goes down, it does not affect the availability of those files.
Few examples of blob storage services would be Amazon S3, Azure Blob Storage or Cloudflare R2.
File and Folder schema
For such a massive service, a robust and structured schema is a must. Below attached is an illustration of such a schema.
{
name: fileName/folderName
accountOwnerID: 100192
id: file or folder ID
isFolder: true/false
parent: parentId // for moving the file from one folder to another
children: [] // a list of child branches if any
}
This schema works well for Google Drive's functionality. The accountOwnerID links the file to a specific user. The isFolder boolean indicates whether the object is a file or a folder. The children list contains all child files. The parentID is helpful when moving a file, as every file or folder has a parent, making it easier to manage.
The infrastructure of our service
Since this is google drive that we are dealing with, expect the infrastructure to be pretty hefty. Nonetheless, I will try to abstract it out as much as possible without losing valuable details.’
Components:
Clients: These are the users accessing the storage service. They could be mobile client or desktop/laptop clients.
Notification Server: This server is responsible for push notifications using webSockets (or a similar technology). It ensures that the client stays synchronized to the data
Metadata Server: These servers are used to keep track of the files and store data about the data (metadata). They also manage permissions and user preferences settings.
Storage: As discussed earlier, being a storage heavy service, we will need heavy data backup that is also scalable and persistent.
Load Balancer: A key component in any large system, used to ensure high availability by preventing one server to be overwhelmed by requests, by distributing the load accordingly.
Refer below for a visual to understand how it all works together.
Uploading a file
Now comes the fun part. A lot happens when the user starts uploading their files. But before that, we need to discuss chunking and why it is important here. Imagine the user is uploading a video from their phone that is 1 GB in size. Uploading this alone would use up about the same size in the server memory. Moreover, it would also keep the request handler busy until the upload is finished. Now imagine what happens if the upload fails at 90% due to network issues. The user would have to re-upload the entire file all over again. This is not a scalable method. Instead, we split the large file into multiple smaller chunks. These chunks not only use less server memory but are also very useful in case of a failure during uploading, as the progress can simply resume from where it was left off. The next illustration will make things clearer.
Renaming a file or a folder
When you try to rename a file or a folder, the service first checks if the item is fully synced to the server or not. In case the file is synced, the request to rename the file/folder can be sent immediately, and if not, it can be queued until the sync is complete.
Moving a file or a folder
Moving a file or a folder is slightly more complex than it seems like. Every file or folder has a parentId pointing to its parent. Think of it like a pointer to another folder pretty much. In order to move a file/folder, you need to update the parentId of the current folder, and also the parentId of the folder you will be moving to. The next illustration will make things little clearer.
As can be seen, we are moving the Images folder from its original parent Folder1 to its new parent Folder2. This requires updating the parentId’s of the images folder and also the children list of both the old and the new parent folder in order to reflect the changes.
Deleting a file or a folder
Deleting is an interesting aspect because the file you are trying to delete could have been shared with other users. In such a case, the service needs to notify those users about the deletion. Also, many storage services have a concept of a bin or trash, which keeps recently deleted items for a short period of time. Every file that is deleted gets a timer attached to it and is put in the bin. Once this timer expires, the metadata for that file is removed from the servers, and the data chunks are then deleted permanently. Refer to the illustration below.
Conclusion
This blog provides a beginner-friendly guide to designing a scalable system like Google Drive. It covers essential components such as high availability, sharding strategies, and blob storage. The article details how various operations—like uploading, renaming, moving, and deleting files and folders—are managed. Additionally, it discusses the infrastructure components, including clients, metadata servers, and load balancers, and explains the need for efficient data management with chunking and robust schema design.