I know I have been stuck with SQL server for too long and a lot behind in adopting newer database technologies, but late is better than never. While updating a web application which uses SQL Server, I realized the current relational structure of my database is not really the optimal solution. I have more than 120 million rows of data in one of the tables which represent Option Chains for Stocks, one row per Option. I store this data as snapshots in time and do not change them after they are stored. Anyone familiar with this type of data knows that these individual rows are not really interesting by themselves and they are normally looked at alongside others which belong to the same Stock, and with same expiration. In real world you are presented with the whole chain (see an example of such data here).
Now, imagine you have to reconstruct this chain from those individual rows in my large table with 120 million rows. Size of this database on disk is more than 22 GB and it needs a decent machine to run with acceptable performance in a multi-user environment.
This data is ideal candidate for a document storage solution like MongoDB or ElasticSearch. The document structure looks like this:
{
"options" : {
"mappings" : {
"goog" : {
"properties" : {
"calls" : {
"properties" : {
"a" : {
"type" : "string"
},
"b" : {
"type" : "string"
},
"o" : {
"type" : "string"
},
"st" : {
"type" : "string"
},
"v" : {
"type" : "string"
}
}
},
"expires" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"p" : {
"type" : "double"
},
"puts" : {
"properties" : {
"a" : {
"type" : "string"
},
"b" : {
"type" : "string"
},
"o" : {
"type" : "string"
},
"st" : {
"type" : "string"
},
"v" : {
"type" : "string"
}
}
},
"stamp" : {
"type" : "date",
"format" : "dateOptionalTime"
}
}
}
}
}
}
To make sure I'm choosing the appropriate solution for storing this data, I decided to test and compare both MongoDB and ElasticSearch, and my comparison vectors are these:
- Disk Storage consumed
- Memory utilization
- CPU Utilization
- Search Performance (time)
I used identical virtual machines with these configurations:
- 1 vCPU
- 4 GB RAM
- 50 GB storage
- CentOS (2.6.32-431)
Unfortunately I did not measure the time it took to migrate data from SQL server (which was in the range of hours), but I'm sure it took more time in MongoDB than ElasticSearch.
After data was loaded, I rebooted both VMs so I can start with a clean slate. My test consist of 10,000 searches for documents, from which 6,451 do exist in the data stores and 3,549 do not. Test tool runs on a separate VM running on the same hardware, and it is configured to use 4 threads in parallel. 2 tests are executed on each server, one uses the ID of the document and the other uses 2 date fields as the query criteria.
Here is the comparison of the Disk and Memory used before and after the first execution of the test, and clearly MongoDB uses more disk space to store the same data, and more RAM during the test:
|
MongoDB |
ElasticSearch |
Disk space used |
7.3 GB |
2.3 GB |
Memory used Before |
241 MB |
426 MB |
Memory used After |
3,716 MB |
604 MB |
Average CPU usage |
Cold Start: mostly under 20% Subsequent runs: mostly 99% |
mostly 99% |
And this table shows the time it took to do the 10,000 searches for each test, during the first run, when no data is cached:
Cold Start |
MongoDB |
ElasticSearch |
Lookup by 2 date fields |
78.46 Sec |
23.65 Sec |
Lookup by ID |
12.55 Sec |
5.76 Sec |
So in cold start, ElasticSearch was 200% to 300% faster.
Following table shows the execution times but after a few repeated runs (to show impact of caching):
Subsequent runs |
MongoDB |
ElasticSearch |
Lookup by 2 date fields |
12.34 Sec |
8.33 Sec |
Lookup by ID |
1.42 Sec |
1.67 Sec |
Conclusion:
ElasticSearch turned out to be using less disk space, less memory and have overall better performance. Now, some people may argue that with more tweaks I could get different results. For example, I did not create an index in MongoDB, but I did not do any tweaks in ElasticSearch either.
Obviously this is the way I chose to test and compare the two solutions, and everyone should do their own due diligence before selecting a technology. So please keep me posted about your results.
35034884-487d-4e54-aab1-d72c98ab90aa|1|5.0|96d5b379-7e1d-4dac-a6ba-1e50db561b04