Quote:
Originally Posted by kevin_owner
So I guess that text files are way faster than a database.
|
As a programmer, you have to learn to be more scientific with your research and analysis and come to more realistic conclusions rather than just take what you see on the surface and apply it to the entire subject.
Right now, it's like you are using a fork to cut a stick of butter in half and coming to the conclusion that a fork can cut as fast as a knife, simply because of one test. Cutting a stick of butter with a fork is as fast as using a knife simply because of how small sticks of butter are. However, you know that logic is flawed because if you had to cut a cake with a fork, it'd not be as fast or as clean as using a knife.
The same is true of comparing flat files to a database. You can't take one test's results and then come to a broad conclusion. You can only come to a conclusion in context of what you tested. This is because the test might have been flawed (as yours currently is) so you come to the wrong general conclusion and then create a system based on wrong results and that comes back to hurt you later on.
Now, to get to why your current test is flawed.
The first stage of the test would be opening the data for access. For a flat file, that is calling fopen/or member function open of an input stream. For a database, that's calling the correct connect function. You can benchmark these if you want, but the results are only important if you frequently perform the logic in your design. I.e., if it takes a couple of seconds to connect to a database, but you only need to connect once at startup, it's irrelevant compared to a flat file taking no time.
The second stage of the test would be loading the data to memory, since that's what you seem to want to be doing here. With a flat file, you just read the data into your linear array so it is ready for searching.
However, for a database, you incorrectly benchmarked a different task. The concept of querying a database for data vs searching a linear array for it has nothing to do with flat file access vs. database access. It has to do with in memory access vs. database access. Obviously, with such a small set of data, memory searching is significantly faster than database searching simply by design. The correct thing to do would be to query the entire database table once and then load that into an array.
As you can see, if you actually do that, there is nothing to benchmark really because the solution to the problem is the same for both methods. That is, you load data from one storage medium (flat files / database) into memory and then search it. The time it takes to open/connect and then load/query can be measured, but the searching itself will be the same.
In the context of your original test, you can't really compare flat file searching to database searching because of how Window's file caching works. You'd always get skewed results since the file contents would always be in memory so access times are a lot faster than if the cache had to be flushed and the file reloaded each search. I came across that problem when writing one of my PK2 APIs and using memory mapped files, for example. It greatly changed the results of my tests!
Another reason you can't accurately benchmark loading a flat file into memory and compare it to a database, generally speaking, is because you will hit physical memory limitations with flat files that you won't with a database. Let's say you had 10gb of database data and 10gb of flat file data. Unless you actually had a 64bit system and a lot of ram, you could not run your current benchmark of loading a flat file to memory and searching it linearly vs querying a database. Even if you could, which do you think would win then? I'd be willing to put my money on a DB.
This is why you have to be more scientific in your benchmarks. You can't say "flat files are faster than a database" because that statement makes no sense. You
can say, "given a small set of data, accessing data in memory is faster than querying a database each time" and that would be generally true according to your test results. However, just because you come to those results doesn't mean they are necessarily always true. The settings of the database do matter as well as the system running the tests.
Lastly, generally you use "flat files" to describe free standing files in a system. Text files are a specific type of flat files in which data is represented as, well, text. There are other formats you can use. Text files are not that efficient either when it comes to the type of data you are working with. Binary files would make for a better choice since you process it once, dump it to a file, then can load it straight to memory without any additional processing overhead like there is with text files. Note that there will be additional overhead with database and type conversions as well in some cases! You have to factor that in to a benchmark as well according to how the database access semantics you are using work.
For your project, if all you need to do is load static data into memory and then search by a 'key', then preprocessed binary flat files are the way to go. They will able to be loaded from disk to memory the fastest and require no additional processing. You do not have to do a linear search on them though. You can simply load into a vector or a list and then store iterators in a map that maps the id to the object iterator in the list/vector. That way, you have fast lookup times at only a small overhead cost of maintaining the map, a tradeoff well worth it not to have to perform a linear search each time.
A database is only useful in your case if you wanted to search by saying "I want a list of entity ids who are N units away from point X,Y,Z" or "I want a list of entity ids who are between a height of A and B". If you have no need for such data context specific queries, and the data is not going to change at run time, and you are working with a small set of data, there's no real benefit to using a database to load the data in
this particular case.
Going back to what was said earlier about startup times, if all you do is load this data once, then the overhead of using any method is irrelevant really. Loading it from the PK2 as was mentioned is a viable solution as well. However, I'd prefer custom tools to process from the PK2 into your own useful formats simply because of the flexibility it grants you.
Anyways, keep these things in mind with anything you do. Don't focus on raw performance solely; as it is often meaningless in most contexts. People that get obsessed with performance in C++ end up costing themselves a lot of productivity and their projects suffer as a result.
"The First Rule of Program Optimization: Don't do it. The Second Rule of Program Optimization (for experts only!): Don't do it yet.” Get something working that you are comfortable with and profile later. Don't worry about having to recode something because of a design decision you made because it will happen regardless. That's just the way it works when you code first without a design.
Good luck with your project!