[Poll] Reading data from text file or database?

kevin_owner · 03/31/2011, 14:48

Hi,

I'm currently facing a little choice about how I should read the data of silkroad. So I hope that I can get a clear answer with this thread/poll.

Well as you might know I'm using C++ and currently I'm reading the data into the memory which is just a big list of for example npc position objects.

So yesterday I created a little test and I inserted all the npc positions into an MySQL database (version 5.1) and I used 10 different id's of npcs to get the positions. I runned this test 1000 times and the avarge times was 16 milli seconds for those 10 different id's.

for the list of the npc objects I used the linear search algorithm to make it easier and the result was 15 milliseconds for 10 id's.

I could use somethings like an selection search but it would still be a little problem with those id's cause they aren't unique.

so technically the text file should be faster than the database. But it would be easier to just use the database. but if you have something like an emulator you want speed.

Well what's your opinion?

lesderid · 03/31/2011, 15:13

Doesn't really matter.
Just load all the IDs into a List (or array) when you start the server.
It will make the loading time slightly longer but it will be a huge performance win instead of searching in a database/file when some function needs a value. (like NPC chat ids)

vorosmihaly · 03/31/2011, 15:24

well,if I use c#,I prefer txt files because it's easier to handle ( for me )
dunno about c++,but I think you shouldn't load it into the database^^

Shadowz75 · 03/31/2011, 15:53

Quote:

Originally Posted by kevin_owner

Hi,

I'm currently facing a little choice about how I should read the data of silkroad. So I hope that I can get a clear answer with this thread/poll.

Well as you might know I'm using C++ and currently I'm reading the data into the memory which is just a big list of for example npc position objects.

So yesterday I created a little test and I inserted all the npc positions into an MySQL database (version 5.1) and I used 10 different id's of npcs to get the positions. I runned this test 1000 times and the avarge times was 16 milli seconds for those 10 different id's.

for the list of the npc objects I used the linear search algorithm to make it easier and the result was 15 milliseconds for 10 id's.

I could use somethings like an selection search but it would still be a little problem with those id's cause they aren't unique.

so technically the text file should be faster than the database. But it would be easier to just use the database. but if you have something like an emulator you want speed.

Well what's your opinion?

Use a database it makes your life much easier. And the little speed difference is not important

kevin_owner · 03/31/2011, 18:22

Thank you for your responses

So I can assume that everbody agrees with the following?
text files are faster but are a little harder to use than a database.
so text file = more speed
database = easier

btw intresting result in the poll since 5 people voted for database and only 1 for text file.

npcdoom · 03/31/2011, 20:03

Well i prefer reading from textfiles, better in this case, but theres the unicode issue, but parsing the text files is pretty easy.

bootdisk · 03/31/2011, 20:29

Text files are tricky. Why don't you convert the data to something more convenient? I feel better when I read directly an struct from a binary file on disk.

For example, npc positions, why should it be in a database? that data isn't going to change. Parsers of strings waste cpu cycles.

I'd say, that you could try what I did (isn't the best option, right, but well, sharing improves your own stuff and so). When I have huge amounts of data to treat and not to kill computers I create 2 files, 1 holding ids + offsets in the second one and another with all the data together.
Then at runtime, I just load the one that holds ids and offsets, when I need something from an specific id, I just read from the second file at that specific offset.

To generate those 2 files I create a tool (nowadays I use Python the most for tools).
Well, hope you can imagine this kind of "system", and it's the base for most of the file systems used and in using by game developers.

kevin_owner · 04/01/2011, 08:56

Thank you for your answers

Well what I did with my text files I just converted them to ascii. (I hope this doesn't change too much since only the koreans chars are now ??)

and I stripped them a little bit down to remove the unused columns.

@bootdisk that is a nice way which I've never thought about thanks

It sure is some memory saving way since you don't load them all in the memory and you're not using the unused ones

btw I made a stupid stupid mistake in my test. Because I ran it in debug mode.. once i've changed it to release the result was a lot different. So I ran the same test and it still took 16 milliseconds for the database and just 0 milli seconds for text files with and inefficient loop. So I changed the text file test too make it search for 1000 times more npcs and it took only 81 milliseconds.

So I guess that text files are way faster than a database.

InvincibleNoOB · 04/01/2011, 09:38

Use Media.pk2 for the internal game data, just like sro_client does.

pushedx · 04/01/2011, 16:28

Quote:

Originally Posted by kevin_owner

So I guess that text files are way faster than a database.

As a programmer, you have to learn to be more scientific with your research and analysis and come to more realistic conclusions rather than just take what you see on the surface and apply it to the entire subject.

Right now, it's like you are using a fork to cut a stick of butter in half and coming to the conclusion that a fork can cut as fast as a knife, simply because of one test. Cutting a stick of butter with a fork is as fast as using a knife simply because of how small sticks of butter are. However, you know that logic is flawed because if you had to cut a cake with a fork, it'd not be as fast or as clean as using a knife.

The same is true of comparing flat files to a database. You can't take one test's results and then come to a broad conclusion. You can only come to a conclusion in context of what you tested. This is because the test might have been flawed (as yours currently is) so you come to the wrong general conclusion and then create a system based on wrong results and that comes back to hurt you later on.

Now, to get to why your current test is flawed.

The first stage of the test would be opening the data for access. For a flat file, that is calling fopen/or member function open of an input stream. For a database, that's calling the correct connect function. You can benchmark these if you want, but the results are only important if you frequently perform the logic in your design. I.e., if it takes a couple of seconds to connect to a database, but you only need to connect once at startup, it's irrelevant compared to a flat file taking no time.

The second stage of the test would be loading the data to memory, since that's what you seem to want to be doing here. With a flat file, you just read the data into your linear array so it is ready for searching. However, for a database, you incorrectly benchmarked a different task. The concept of querying a database for data vs searching a linear array for it has nothing to do with flat file access vs. database access. It has to do with in memory access vs. database access. Obviously, with such a small set of data, memory searching is significantly faster than database searching simply by design. The correct thing to do would be to query the entire database table once and then load that into an array.

As you can see, if you actually do that, there is nothing to benchmark really because the solution to the problem is the same for both methods. That is, you load data from one storage medium (flat files / database) into memory and then search it. The time it takes to open/connect and then load/query can be measured, but the searching itself will be the same.

In the context of your original test, you can't really compare flat file searching to database searching because of how Window's file caching works. You'd always get skewed results since the file contents would always be in memory so access times are a lot faster than if the cache had to be flushed and the file reloaded each search. I came across that problem when writing one of my PK2 APIs and using memory mapped files, for example. It greatly changed the results of my tests!

Another reason you can't accurately benchmark loading a flat file into memory and compare it to a database, generally speaking, is because you will hit physical memory limitations with flat files that you won't with a database. Let's say you had 10gb of database data and 10gb of flat file data. Unless you actually had a 64bit system and a lot of ram, you could not run your current benchmark of loading a flat file to memory and searching it linearly vs querying a database. Even if you could, which do you think would win then? I'd be willing to put my money on a DB.

This is why you have to be more scientific in your benchmarks. You can't say "flat files are faster than a database" because that statement makes no sense. You can say, "given a small set of data, accessing data in memory is faster than querying a database each time" and that would be generally true according to your test results. However, just because you come to those results doesn't mean they are necessarily always true. The settings of the database do matter as well as the system running the tests.

Lastly, generally you use "flat files" to describe free standing files in a system. Text files are a specific type of flat files in which data is represented as, well, text. There are other formats you can use. Text files are not that efficient either when it comes to the type of data you are working with. Binary files would make for a better choice since you process it once, dump it to a file, then can load it straight to memory without any additional processing overhead like there is with text files. Note that there will be additional overhead with database and type conversions as well in some cases! You have to factor that in to a benchmark as well according to how the database access semantics you are using work.

For your project, if all you need to do is load static data into memory and then search by a 'key', then preprocessed binary flat files are the way to go. They will able to be loaded from disk to memory the fastest and require no additional processing. You do not have to do a linear search on them though. You can simply load into a vector or a list and then store iterators in a map that maps the id to the object iterator in the list/vector. That way, you have fast lookup times at only a small overhead cost of maintaining the map, a tradeoff well worth it not to have to perform a linear search each time.

A database is only useful in your case if you wanted to search by saying "I want a list of entity ids who are N units away from point X,Y,Z" or "I want a list of entity ids who are between a height of A and B". If you have no need for such data context specific queries, and the data is not going to change at run time, and you are working with a small set of data, there's no real benefit to using a database to load the data in this particular case.

Going back to what was said earlier about startup times, if all you do is load this data once, then the overhead of using any method is irrelevant really. Loading it from the PK2 as was mentioned is a viable solution as well. However, I'd prefer custom tools to process from the PK2 into your own useful formats simply because of the flexibility it grants you.

Anyways, keep these things in mind with anything you do. Don't focus on raw performance solely; as it is often meaningless in most contexts. People that get obsessed with performance in C++ end up costing themselves a lot of productivity and their projects suffer as a result. "The First Rule of Program Optimization: Don't do it. The Second Rule of Program Optimization (for experts only!): Don't do it yet.” Get something working that you are comfortable with and profile later. Don't worry about having to recode something because of a design decision you made because it will happen regardless. That's just the way it works when you code first without a design.

Good luck with your project!

Keyeight · 04/01/2011, 20:38

Quote:

Originally Posted by pushedx

As a programmer, you have to learn to be more scientific with your research and analysis and come to more realistic conclusions rather than just take what you see on the surface and apply it to the entire subject.

Right now, it's like you are using a fork to cut a stick of butter in half and coming to the conclusion that a fork can cut as fast as a knife, simply because of one test. Cutting a stick of butter with a fork is as fast as using a knife simply because of how small sticks of butter are. However, you know that logic is flawed because if you had to cut a cake with a fork, it'd not be as fast or as clean as using a knife.

The same is true of comparing flat files to a database. You can't take one test's results and then come to a broad conclusion. You can only come to a conclusion in context of what you tested. This is because the test might have been flawed (as yours currently is) so you come to the wrong general conclusion and then create a system based on wrong results and that comes back to hurt you later on.

Now, to get to why your current test is flawed.

The first stage of the test would be opening the data for access. For a flat file, that is calling fopen/or member function open of an input stream. For a database, that's calling the correct connect function. You can benchmark these if you want, but the results are only important if you frequently perform the logic in your design. I.e., if it takes a couple of seconds to connect to a database, but you only need to connect once at startup, it's irrelevant compared to a flat file taking no time.

The second stage of the test would be loading the data to memory, since that's what you seem to want to be doing here. With a flat file, you just read the data into your linear array so it is ready for searching. However, for a database, you incorrectly benchmarked a different task. The concept of querying a database for data vs searching a linear array for it has nothing to do with flat file access vs. database access. It has to do with in memory access vs. database access. Obviously, with such a small set of data, memory searching is significantly faster than database searching simply by design. The correct thing to do would be to query the entire database table once and then load that into an array.

As you can see, if you actually do that, there is nothing to benchmark really because the solution to the problem is the same for both methods. That is, you load data from one storage medium (flat files / database) into memory and then search it. The time it takes to open/connect and then load/query can be measured, but the searching itself will be the same.

In the context of your original test, you can't really compare flat file searching to database searching because of how Window's file caching works. You'd always get skewed results since the file contents would always be in memory so access times are a lot faster than if the cache had to be flushed and the file reloaded each search. I came across that problem when writing one of my PK2 APIs and using memory mapped files, for example. It greatly changed the results of my tests!

Another reason you can't accurately benchmark loading a flat file into memory and compare it to a database, generally speaking, is because you will hit physical memory limitations with flat files that you won't with a database. Let's say you had 10gb of database data and 10gb of flat file data. Unless you actually had a 64bit system and a lot of ram, you could not run your current benchmark of loading a flat file to memory and searching it linearly vs querying a database. Even if you could, which do you think would win then? I'd be willing to put my money on a DB.

This is why you have to be more scientific in your benchmarks. You can't say "flat files are faster than a database" because that statement makes no sense. You can say, "given a small set of data, accessing data in memory is faster than querying a database each time" and that would be generally true according to your test results. However, just because you come to those results doesn't mean they are necessarily always true. The settings of the database do matter as well as the system running the tests.

Lastly, generally you use "flat files" to describe free standing files in a system. Text files are a specific type of flat files in which data is represented as, well, text. There are other formats you can use. Text files are not that efficient either when it comes to the type of data you are working with. Binary files would make for a better choice since you process it once, dump it to a file, then can load it straight to memory without any additional processing overhead like there is with text files. Note that there will be additional overhead with database and type conversions as well in some cases! You have to factor that in to a benchmark as well according to how the database access semantics you are using work.

For your project, if all you need to do is load static data into memory and then search by a 'key', then preprocessed binary flat files are the way to go. They will able to be loaded from disk to memory the fastest and require no additional processing. You do not have to do a linear search on them though. You can simply load into a vector or a list and then store iterators in a map that maps the id to the object iterator in the list/vector. That way, you have fast lookup times at only a small overhead cost of maintaining the map, a tradeoff well worth it not to have to perform a linear search each time.

A database is only useful in your case if you wanted to search by saying "I want a list of entity ids who are N units away from point X,Y,Z" or "I want a list of entity ids who are between a height of A and B". If you have no need for such data context specific queries, and the data is not going to change at run time, and you are working with a small set of data, there's no real benefit to using a database to load the data in this particular case.

Going back to what was said earlier about startup times, if all you do is load this data once, then the overhead of using any method is irrelevant really. Loading it from the PK2 as was mentioned is a viable solution as well. However, I'd prefer custom tools to process from the PK2 into your own useful formats simply because of the flexibility it grants you.

Anyways, keep these things in mind with anything you do. Don't focus on raw performance solely; as it is often meaningless in most contexts. People that get obsessed with performance in C++ end up costing themselves a lot of productivity and their projects suffer as a result. "The First Rule of Program Optimization: Don't do it. The Second Rule of Program Optimization (for experts only!): Don't do it yet.” Get something working that you are comfortable with and profile later. Don't worry about having to recode something because of a design decision you made because it will happen regardless. That's just the way it works when you code first without a design.

Good luck with your project!

I have quoted half explain about silkroad security if you have objection to the piece I will cancel half the Commentary

answer me plz and thx alot for you explain

dracek · 04/01/2011, 23:53

Can only add an excerpt from UNIX rules (Eric S. Raymond: The Art of Unix Programming). I have selected few that apply to your situation and underlined those that are indeed related.

Rule of Clarity: Clarity is better than cleverness.
Rule of Simplicity: Design for simplicity; add complexity only where you must.
Rule of Optimization: Prototype before polishing. Get it working before you optimize it.
Rule of Diversity: Distrust all claims for "one true way".