If you want to leverage the power of the cloud to start deriving valuable insights from your data, you’ll need start by finding somewhere to store it. Before you do that, you’ll need to pick a platform. Let’s say you’ve carefully considered the offerings of a few cloud providers against your requirements and you’ve decided that you want google’s big data experience in your corner. After all, google are responsible for a number of innovations in big data, such as MapReduce and Dremel. Before you start cramming all of your data into BigQuery though, let’s look at a more flexible storage service – Cloud Storage.
There is a hands on element to this post, so if you want to follow along I’ll assume that you’ve already signed up for google cloud and you’ve got a project ready to go. If you haven’t done this already, you’ll need to sign up for a free trial. You’ll need to provide a payment method for this but don’t worry, you won’t be charged unless you opt to activate your account, even if the trial expires or you run out of credits. Simply fill in the forms and you’ll soon be set up with your very first google cloud project.
What is Cloud Storage?
Cloud storage is an object storage service. When you set up your first cloud storage bucket, it looks and feels much like the file system you’re used to using on your everyday device. The cloud console does a great job in this respect, but it’s important to know that that’s pretty much where the similarities end. With cloud storage, files are stored as objects (or blobs), allowing for features that a traditional file system simply can’t support, such as customizable metadata, versioning and lifecycle rules. Objects are immutable – once you upload a file, it can’t be modified. You can however, download a copy of the object, modify and then re-upload it, overwriting the original object.
Unlike traditional file systems, cloud storage stores your data in a manner that allows petabyte level scalability without any performance problems. Your data is also replicated across data centers either at a regional level, or at a multi-regional level. This means that should an incident occur that results in outages at one data center (or entire region for multi-regional storage), you will still have access to replicas of your data in other locations.
Cloud storage offers a number of different storage classes to suit a range of data access needs. This ranges from the standard class for high frequency data access, all the way down to the archive class for data that is accessed once per year (or less). While choosing the appropriate storage class can help you keep storage costs down, you shouldn’t try to cut costs by opting for a storage class that doesn’t meet your access needs. There are additional costs associated with accessing data for coldline, nearline and archival storage, so this may end up costing you more in the long run.
You can control who has access to the data in your cloud storage buckets with Cloud IAM. Access control can either be fine grained, where you control who has access to individual objects, or uniform, where permissions apply to all objects in the bucket. Cloud storage also provides google-managed server-side encryption for your data by default, but you can provide your own encryption keys if you prefer.
When Should I use Cloud Storage?
Cloud storage is a great option for storing unstructured and raw data. In fact, you can store almost anything in a cloud storage bucket. Text data? No problem. Images? Sure. Code artefacts? Absolutely. Your extensive archive of cat videos? You bet. There are plenty of use cases for cloud storage, some examples include:
Data Lake – cloud storage plays nicely with many of the google cloud big data & ai offerings. You don’t need to spend time wrangling with schemas, just load your data into your chosen data processing platform and away you go. This can help you start getting insight from your data quickly. You have a number of tools at your disposal for analysing data. These include Dataproc, Dataflow, AI Platform Notebooks and BigQuery.
Staging area for code artefacts – You can use cloud storage to store your artefacts before you deploy them to your target service as part of your build/release process. Many google cloud services do this behind the scenes already (such as appengine and dataflow).
Backups and disaster recovery storage – data replication and high durability make cloud storage a good choice for backups. You’re not likely to need access to this type of data frequently, so choosing a storage class with lower access frequency would be a good option here.
Extract, Transform, Load (ETL) – Sometimes your data will need to go through multiple stages of transformation and you’ll need somewhere to put it as you transfer it between services. For example, you can use cloud storage as a landing zone for your raw data, before spinning up a dataflow pipeline to transform and load your raw data into a BigQuery table.
Cloud storage costs are low, which makes it a good choice when you need to store large volumes of data. At the time of writing, it costs around $0.026 per gb per month for a standard class multi-regional bucket. This doesn’t include costs associated with reading/writing data and costs will vary depending on storage location and class. If you plan to hold large quantities of data, consider compressing your data to keep storage costs down. You can read more about cloud storage pricing here.
Cloud Storage In Action
It’s time to get hands on. We’re going to walk through some cloud storage basics using the cloud console UI. You can also interact with cloud storage programmatically using the client libraries for your preferred language, or the cloud sdk and cloud shell command line tools. We won’t be covering these tools here, but you can follow the links for more detail on those.
Head on over to the console and open the cloud storage browser so we can get started. You’ll need to click the navigation menu hamburger icon to do this.
Before we can start storing data we need to create a bucket. A standard class bucket with regional replication will work fine for our needs. To start, click the create bucket button and fill in the form as follows:
- In the name your bucket section, type a unique name for your bucket.
- In the choose where to store your data section, choose the regional location type, and then select a location of your choice from the location drop down.
- In the choose a default storage class for your data section, leave the class as standard.
- In the choose how to control access to objects section, leave the access control as fine-grained.
- In the advanced settings section, leave everything as default.
Once you’ve filled in the form, hit the create button to create your bucket. This will open your new bucket where you’ll be able to configure your bucket and start adding data.
Working with Buckets
Before you start adding data, it’s a good idea to consider whether you need to configure your bucket to include retention policies, lifecycle rules or object versioning. With your bucket open in the storage browser, you’ll notice the Bucket Lock tab towards the top of the page. From this tab, you can set a retention policy and object lifecycle rules.
A retention policy protects your objects from deletion for a minimum period of time after being uploaded. You can set a retention policy by clicking the ‘+’ icon under the retention policy section. This will open a dialog and allow you to define a minimum retention period that can range from a few seconds to several years. Once a retention policy is in place you will then have the option to lock the policy. When the policy is locked, it cannot be removed from the bucket.
Lifecycle rules can be applied by clicking the ‘add lifecycle rule’ button at the top of the page. This opens a new page that allows you to set the conditions for triggering the rule and the action to take when the rule is triggered. You can use lifecycle rules to delete or archive objects when they reach a certain age. This can help you keep storage costs down by automatically deleting or moving objects to cheaper storage classes that you no longer need or need less frequent access to.
Lastly, to enable versioning of your objects in cloud storage, you will need to use the command line gsutil tool. A quick way to access this is by opening the cloud shell from the console toolbar. You enable versioning with the command:
gsutil versioning set on gs://<your-bucket-name>
Object versioning can be useful for tracking changes to objects over time. You should be aware that versioning will increase storage costs as you will need to store multiple versions of each object as they change over time. You can read more on how to work with versioned objects here.
With your shiny new bucket open, you can now upload some data to your bucket. To add some data, you can either drag files from your local device on to the bucket browser, or you can click the upload files button. First though, lets create a folder to organise our data. Click the Create folder button and give your folder a name.
Now, lets upload some data to our new folder. Click your new folder, then grab a few files from your local machine and drop them onto the browser, a popup will appear in the corner of your screen as the files start uploading to the bucket.
You should note that there isn’t really a notion of folders in object storage. This is a trick of the UI. Files are stored in a flat storage space, so when you create a ‘folder’ what you’re actually doing is adding a prefix to the name of your file.
Working With Objects
Now that we have some objects in our storage bucket, let’s look a little closer at the ways we can interact with them. In the storage browser, you should now see a list of all the objects created from the files that you uploaded. You’ll notice each object has a list of columns that describe its properties, such as the object size and type. Clicking on an object name will open a page that gives you more information about the object, displaying a brief summary and a link to its content.
To the far right of the storage browser, you’ll notice each object has an icon that opens a drop down list of actions you can perform on the object. If you click this, you’ll see that we can add metadata, modify permissions (because we chose fine-grained access control) and do some other common tasks such as rename, copy and move our objects. Lets try adding some custom metadata to our object. Click the icon next to an object of your choice and then choose add metadata. In the popup that follows add a new item, then type a new key value pair. I’ve opted to add a Language property with English as the value, but you can add anything you like.
Metadata items are key values pairs that describe objects. A good use case for metadata is to facilitate decisions on how the object should be processed by your applications or data processing pipelines. For example, we could route objects containing content with different languages to different storage buckets for further processing.
Last of all, you can delete an object by clicking the check box to the left of it in the storage browser and then hitting the delete button. If you have retention policies in place, you can only delete objects once the retention period has passed.
We’ve covered a lot of ground, so let’s summarise some of the key points to finish:
- Cloud storage is an object storage service.
- To store data in cloud storage, you first need to create a bucket.
- Your data is stored as an object in cloud storage. Objects have metadata you can use to describe their content.
- You can store petabytes of data in cloud storage without performance problems.
- You can store almost any type of data in cloud storage.
- When storing large volumes of data, compression can help you keep storage costs down.
- If you don’t need frequent access to your data, choosing a coldline, nearline or archival storage class can help reduce storage costs.
- Your data is replicated across multiple data centres in a region for regional storage, or multiple regions for multi-regional storage. This helps protect you from data loss.
- Setting a retention policy on a cloud storage bucket will prevent objects from being deleted for a specified period of time.
- Lifecycle rules can automatically delete or archive cloud storage objects for you, to help keep costs under control.
- You can enable versioning of cloud storage objects with the gsutil command line tool to track changes to your data over time.