Karan Singh

Code Never Lies, Comments Sometime Do !!

Ceph Object Storage : Part-I (the Internals)

| Comments

Ceph Object Storage

There is some performance difference between pure RADOS writes ( ex. via RadosBench ) vs RGW writes. There are several factors contributing to this such as :

  • Object storage access protocols ( S3 / Swift ) have higher overheads than native RADOS writes
  • Client write requests are translated through RGW adds additional latency causing additional bottlenecks
  • The most important factor is that “RGW maintains bucket indices that needs to be updated every time when a write operation is done. And further more RADOS writes does not have this over head of maintaining indexes / metadata”

In this blog post i will talk about a new feature landed in Ceph Jewel v10.1.0 which is officially known as Indexless Buckets and unofficially as Blind Buckets. Before diving into indexless buckets let’s understand what RGW does under the covers with a write request.

How RGW performs write operation ?

The RGW object body consists of 2 sections HEAD and TAIL. The HEAD section consists of 1st stripe of object and metadata while TAIL section consists of subsequent object stripes.

When a write request comes to RGW it takes the following actions

  1. RGW stripes the object based on rgw_stripe_size setting
  2. It then divides these stripes into more smaller chunks based on max_chunk_size
  3. It then opens RADOS handles ( threads ) to write these chunks to cluster
  4. RGW synchronously writes the HEAD section with first object chunk as well as performs the first phase of bucket index update
  5. Then it completes writing subsequent object chunks in the TAIL section
  6. Finally RGW asynchronously does second phase of bucket index update to record that write is completed

So each RGW object write undergoes (1 x Head write + n x Tail Writes) + 2 Index update operations

How RGW bucket indices are stored ?

  • RGW writes bucket indices which are stored as normal RADOS objects in the cluster
  • If bucket sharding is enabled ( which is a good thing from performance point of view ) these indices are sharded across multiple RADOS objects to improve parallelism which contributes better performance.

How write performance improves with Indexless Buckets ?

When blind buckets / indexless buckets are configured and bucket is created, RGW does bucket index initialization only when bucket is created , however bucket indices never gets updated on subsequent writes to that bucket. This saves quite a lot of small IO writes which greatly improves object storage write performance.

So whats the trade-off ?

With indexless buckets RGW does significantly lower index updates which saves quite a lot of small write IO , which means lower disk saturation and higher overall performance. As of community references performance improvement has been seen upto 60-80% by implementing Indexless buckets. ( I will share some more data on that in upcoming blog posts )

Since indexless buckets are blind i.e. the buckets do not have metadata about the objects stored into them, so such buckets can’t list objects. You data is all safe and secure inside bucket , it’s just you can’t list that bucket.

There are several use cases where you don’t care about bucket indexes and metadata, so why to loose performance for that thing that you don’t want to use. OR maybe you can store index / metadata at application level so you don’t need to duplicate that effort at storage level and thus get some additional performance.

Enough talking show me the code ?

To keep this easy to follow , i promise to write second episode of this blog post , showing you the actual implementation of Indexless buckets !!! Stay Tuned !!!

Special thanks to my partner in crime, Kyle [email protected] Hat

(Update) : Here is the pointer to Episode-2 , Implementation of Indexless Buckets