STL in Games, causing me headaches!

    This site uses cookies. By continuing to browse this site, you are agreeing to our Cookie Policy.

    • STL in Games, causing me headaches!

      Hey Guys,

      So I have been putting my engine to use on a game for school, it is very dense in actors, however the world is very small, it has a small area, which just happens to be packed full of actors.

      I noticed how my performance was dipping quite a bit and decided to run some profiling on my code, to my surprise, every single hotspot in my code was leading to std::map::find , some were using std::string as the key and some (like the actors) using unsigned integers.

      I have been trying to find things to use then map, but really I can't think of anything to compare strings that would be faster than map, I have tried unordered_map however the results aren't a whole lot different. I tried hashing my strings and using an integer compare, but the string hashing is just as slow as actually comparing.

      There are SOME things I can 'pre-hash' but there are some things that I need to compare using a string, for instance my shader system grabs GLSL uniform locations by string, but instead of having to do that twice it will store the location with a string index.

      This is the biggest bottleneck in my engine so far, and I am having trouble resolving it, everywhere else says 'just use unordered map'.
      PC - Custom Built
      CPU: 3rd Gen. Intel i7 3770 3.4Ghz
      GPU: ATI Radeon HD 7959 3GB
      RAM: 16GB

      Laptop - Alienware M17x
      CPU: 3rd Gen. Intel i7 - Ivy Bridge
      GPU: NVIDIA GeForce GTX 660M - 2GB GDDR5
      RAM: 8GB Dual Channel DDR3 @ 1600mhz
    • My guess is that your problem isn't actually with maps. How many elements are we talking about? Maps are balanced trees, so you're looking at O(logn) for look-up times. You could try to switch to a hash table (std::hash_map in most STL implementations). This gives you near O(1) look-up time for the typical use-case.

      Still, I think the problem is that you are likely accessing the data structure too often. You should look into a caching mechanism so you're not doing as many look ups. If it turns out that you do have too many objects (thousands), than you should use a different scheme. One common data structure is a quad tree (or octree in 3D) that allows you to do simple spatial partitioning. This can GREATLY speed up your ability to find objects.

      Another thing to watch out for is using strings as keys. Strings are not at all efficient so you really should come up with another scheme. Hashing the strings is one approach. I have a StringId class I use that basically has a static set of strings and uses a pointer to that string for comparisons. Thus, comparing two StringId's is very fast. If I have to have a map keyed by string, I try to use StringId's.

      If you can provide some sample code where you're hitting your major bottlenecks, that would also help quite a bit.

      -Rez
    • the question is, what is your usecases for the strings? are the objects (or actors) indexed by strings and everytime you want to access an actor you have to use a find-operation with the string as index?

      to fight that use (int) id's you get when you register an object for example in the world:
      class world {
      int registerObject(char *name);
      void setPosition(int objectId);
      };
      and then you can use this int to access the data. That has also the positiv side effect that you can change the underlying data structure from map to vector and get much faster access in the setPosition function and so on... (but you have to be careful when you want to remove an object because you can't erase the object, an easy fix would be to include an "isValid" or "isAlive" flag in the object and reuse them when you register new objects).

      when you need to use strings perhaps you also can use compile-time-hashed strings where the hash is calculated during compilation. here are some pointers for this idea: lol.zoy.org/blog/2011/12/20/cpp-constant-string-hash and altdevblogaday.com/2011/10/27/…pile-time-string-hashing/

      I've also written my own implementation based on the murmur3-hash, if there is interest I can share the source code...
    • For instance, in my ShaderProgram::GetParameter
      function, it searches to see if the parameter has already been retrieved from the shader, 6.3% of my program is spent on one line, almost all of my bottlenecks (aside from rendering) eventually lead to the 'find' method of maps, whether it is a map, unordered_map, or hash_map , they are all taking up a huge amount of my CPU time, every actor that spawns brings the frame rate down about 5-20 FPS.

      Source Code

      1. //name is std::string, there are 5 elements in the hash_map here
      2. auto it = m_ShaderParameters.find(name);


      Some of the lookups I do are purely integer keys, and they are still quite slow even with with small container sizes (about 5 elements). Running the Visual Studio Profiler says the time is actually spent in the find function itself.

      Here is my SVN source
      SVN

      The majority of my bottlenecks are in my actor class in
      ProfileBranch/src/Logic/Actor

      But another major one is at
      ProfileBranch/src/Application/Graphics/Shader.cpp
      PC - Custom Built
      CPU: 3rd Gen. Intel i7 3770 3.4Ghz
      GPU: ATI Radeon HD 7959 3GB
      RAM: 16GB

      Laptop - Alienware M17x
      CPU: 3rd Gen. Intel i7 - Ivy Bridge
      GPU: NVIDIA GeForce GTX 660M - 2GB GDDR5
      RAM: 8GB Dual Channel DDR3 @ 1600mhz

      The post was edited 2 times, last by mholley519 ().

    • So the performance gradually building, and by the time it reaches a stand still it is at about 100 actors, I really don't have any way of optimizing this with partitioning as all of the actors are seen at once (puzzle game), the code is slowing right down however I am not even drawing anything besides physics debug drawing.

      [Edit]
      Yeah this has kind of made my game unplayable, I am trying to work around the bottlenecks with different container types, but to no avail it plumits, if anyone was free sometime to take a scan over my code with me I would really appreciate it, let me know if possible.
      PC - Custom Built
      CPU: 3rd Gen. Intel i7 3770 3.4Ghz
      GPU: ATI Radeon HD 7959 3GB
      RAM: 16GB

      Laptop - Alienware M17x
      CPU: 3rd Gen. Intel i7 - Ivy Bridge
      GPU: NVIDIA GeForce GTX 660M - 2GB GDDR5
      RAM: 8GB Dual Channel DDR3 @ 1600mhz

      The post was edited 1 time, last by mholley519 ().

    • the most obvious question: it's release mode, right?

      the next thing: in which function of the actor is the most time spent and how much? I've looked a bit through the code but aside from using strings everywhere I've not seen something obvious. have you checked that not via virtual functions or something like that the program is doing something completly different than you would think?
      can you temporary remove some actor-functions and look which functions costs you the most performance?
    • It is in Debug mode, I use breakpoints and the debugging info alot, it speeds up in release mode but still not alot.

      I already know my performance is going to the map searching, so I ended up doing delayed updates on things that didn't need to be updated every frame, however, even though my performance skyrockets with this, it is still only as strong as the weakest link, in this case that link is now Box2D physics step, I am wondering whether running it in a seperate thread would solve this problem.
      PC - Custom Built
      CPU: 3rd Gen. Intel i7 3770 3.4Ghz
      GPU: ATI Radeon HD 7959 3GB
      RAM: 16GB

      Laptop - Alienware M17x
      CPU: 3rd Gen. Intel i7 - Ivy Bridge
      GPU: NVIDIA GeForce GTX 660M - 2GB GDDR5
      RAM: 8GB Dual Channel DDR3 @ 1600mhz
    • Uargs ;) do NOT profile your code in debug-mode, the stl-container do a lot of additional stuff in debug mode. for this type of analysis I normally use a "release mode with debug information", normally you have to copy the release-build information and switch in the compiler- and additional in the linker-options the debug infos to on. with that you should get a much better picture where the time is spent... and you can still use the debugging stuff you're used to (but the programm pointer in single step mode jumps around because the compiler normally reorder the instructions).

      after that try to replace the strings as much as possible with int-id's and use them....
    • Almost everything is using int id's, Actors use int Id's, components use int id's, they do however have additional string tags that are occasionally used.
      PC - Custom Built
      CPU: 3rd Gen. Intel i7 3770 3.4Ghz
      GPU: ATI Radeon HD 7959 3GB
      RAM: 16GB

      Laptop - Alienware M17x
      CPU: 3rd Gen. Intel i7 - Ivy Bridge
      GPU: NVIDIA GeForce GTX 660M - 2GB GDDR5
      RAM: 8GB Dual Channel DDR3 @ 1600mhz
    • I think it's really about how many times you're calling find() in a frame. If you're searching through your actor tree 10,000 per actor per frame, that'll definitely cause a slow-down and cause the profiler to tell you that find() is slow. It's not that the function is slow, it's that you're calling it way too much.

      I agree with questor that you should create a new build target for profiling to get more correct information.

      Another thing to consider is that you're approaching the limits of the singular actor/component model. The system I built for GCC is really good for the typical use-case of having several dozen relatively complex objects, but it doesn't work well for having hundreds or thousands of relatively simple objects. Still, for a simple game, having 100 actors isn't unreasonable. If you had 1000 actors, I'd question the use of the actor system.

      I'll see if I can sync to the code in the next few days and do some basic analysis. Is there anything I need to know about the build environment, or should I be able to just sync it down, build, and run?

      -Rez
    • When you go to the Assembla SVN page there is a tab named files, grab the 3rd party libs, and extract it alongside the trunk and bin folders. It uses Visual Studio 2010 however the 'Performance' branch uses 2012, I would however look at the trunk as I have been experimenting in performance.

      I really appreciate you taking a look, I actually did get it running faster, but I had to end up only updating actors on only so many seconds, I think I had it set at 1/30th of a second. This did give a huge increase in speed, I was getting 130,000 frames per second lol, then as I added more actors, the physics engine seemed to bottleneck things.

      I decided the only way to get around that was running the Physics in it's own thread, but as I have never done threading, I have run into a ton of issues with it. I will definitely plan for that next time around.

      It would still be great to hear your opinion on how to speed it up though.
      PC - Custom Built
      CPU: 3rd Gen. Intel i7 3770 3.4Ghz
      GPU: ATI Radeon HD 7959 3GB
      RAM: 16GB

      Laptop - Alienware M17x
      CPU: 3rd Gen. Intel i7 - Ivy Bridge
      GPU: NVIDIA GeForce GTX 660M - 2GB GDDR5
      RAM: 8GB Dual Channel DDR3 @ 1600mhz
    • So I have been looking around, and also reading in a book I have, and it is talking about how the most prominent method of multi-programming in games is not ideal, he's talking about the splitting of sub-systems into there own thread, and instead the best method is to divide the work load of a single game loop into jobs, for instance if you are currently updating actors, you would divide your actor list between several threads and have each thread update actors.

      I have attempted to do this as I have had lots of trouble integrating the other method into my system, I am unsure as to what I am doing wrong.

      Source Code

      1. //Copy the list to be used as a queue of actors
      2. list<CActor*> actorCopy = m_ActorList;
      3. //Create a semaphore so that not more than one thread is accessing the copied list at one time
      4. m_pSemaphore = SDL_CreateSemaphore(1);
      5. //Create the threads, passing the copied list as the argument
      6. SDL_Thread* pThreads[4];
      7. pThreads[0] = SDL_CreateThread(ActorUpdateThread, &actorCopy);
      8. pThreads[1] = SDL_CreateThread(ActorUpdateThread, &actorCopy);
      9. pThreads[2] = SDL_CreateThread(ActorUpdateThread, &actorCopy);
      10. pThreads[3] = SDL_CreateThread(ActorUpdateThread, &actorCopy);
      11. //A blank return which I don't use
      12. int ret;
      13. //Wait for the threads to finish processing the actors
      14. SDL_WaitThread(pThreads[0], &ret);
      15. SDL_WaitThread(pThreads[1], &ret);
      16. SDL_WaitThread(pThreads[2], &ret);
      17. SDL_WaitThread(pThreads[3], &ret);
      18. //Destroy the semaphore
      19. SDL_DestroySemaphore(m_pSemaphore);
      Display All


      This is in my game logic update, I used to just simply update my actors, it may not be the best thing to use multithreading on but I am trying to start out slow. I seem to have everything right here (I Think, if I understand threads right).

      Source Code

      1. int ActorUpdateThread(void* pData)
      2. {
      3. //Cast the void pointer
      4. list<CActor*> * pActors = (list<CActor*> *) pData;
      5. //retrieve the delta time
      6. double deltaT = g_pApp->m_pTimer->GetDeltaTime();
      7. //run until further notice
      8. while(1)
      9. {
      10. //If the list is empty this thread should quit
      11. if(pActors->empty())
      12. {
      13. return 0;
      14. }
      15. //Otherwise lock the semaphore, retrieve the front actor, then pop the front, the next thread should grab the next actor for processing
      16. SDL_SemWait(g_pLogic->m_pSemaphore);
      17. CActor* pActor = pActors->front();
      18. pActors->pop_front();
      19. SDL_SemPost(g_pLogic->m_pSemaphore);
      20. //If there was an actor, update it
      21. if(pActor)
      22. pActor->Update(deltaT);
      23. }
      24. return 0;
      25. }
      Display All


      I am however getting list iterator not deferenceable errors, I am doing all my proper error checking, if the list was empty then it would have returned, otherwise dereferencing the front of the list should be fine, then popping it as well.
      Is there some kind of issue I am not seeing where one thread is popping right before another is accessing? I thought thats what the semaphore was supposed to block against.

      *Edit*
      Ok so threading is really annoying, maybe I will try again my next engine, once I finish this game.
      PC - Custom Built
      CPU: 3rd Gen. Intel i7 3770 3.4Ghz
      GPU: ATI Radeon HD 7959 3GB
      RAM: 16GB

      Laptop - Alienware M17x
      CPU: 3rd Gen. Intel i7 - Ivy Bridge
      GPU: NVIDIA GeForce GTX 660M - 2GB GDDR5
      RAM: 8GB Dual Channel DDR3 @ 1600mhz

      The post was edited 1 time, last by mholley519 ().

    • Hey mholley,

      If you've never worked with threads before, I would be very wary of trying to immediately make your engine multi-threaded. For the scope of your project, it sounds like you should be able to perform just fine as a single-threaded engine. Multi-threading opens up a lot of potential issues like race conditions, dead locks, and starvation.

      As I asked earlier, if the profiler telling you that your individual calls to std::map::find are taking a lot time, or is it telling you that in total, you're making a lot of calls to there which add up to a big chunk of time?

      James
    • Yeah that is the problem, the actors could potentially add up to about, hmm, maybe between 500-1000 (haven't counted completely), I ended up delaying my calls to update the actor which really brought my fps up (130,000 FPS), however as soon as the physics system had to do any work it choked it down to less then 60 with alot of actors.

      I am really at a loss as to where to go from here, I have explored delaying the physics, which gives me really wacky results, threading things (not working), and other small things, but nothing is helping, I guess I really need someone who has more experience to take a look, I have never got this far with a game so really I am racking my brain on this part, it's not even an option to try a simpler game, as it is for my production class assignment. Rez said he would take a look if he could in the next few days, hopefully he can give me some tips on how to better improve on this system.
      PC - Custom Built
      CPU: 3rd Gen. Intel i7 3770 3.4Ghz
      GPU: ATI Radeon HD 7959 3GB
      RAM: 16GB

      Laptop - Alienware M17x
      CPU: 3rd Gen. Intel i7 - Ivy Bridge
      GPU: NVIDIA GeForce GTX 660M - 2GB GDDR5
      RAM: 8GB Dual Channel DDR3 @ 1600mhz
    • Originally posted by mholley519
      Rez said he would take a look if he could in the next few days, hopefully he can give me some tips on how to better improve on this system.

      I'll try and get to it this weekend if I can, but I can't make any promises. If not, it'll have to be sometime next week.

      -Rez
    • I've taken a quick look at your code and compiled it on my machine and got during startup this:
      shader.cpp - 83 : ERROR: 0:3: 'location' : syntax error parse error
      shader.cpp - 83 : ERROR: 0:4: 'location' : syntax error parse error

      but I could see something happening on the screen, so I think the errors aren't as important :)

      first question: how can I reproduce the situation with many actors, simply wait some time?

      and you've written that the update of the actors is the slow thing, have you tried to find which component takes so much time to update? (from a first look I've seen there are transform-, luascript-, spriterender-, spriteanimation and physics components)
    • next finding: you are accessing your components by strings (for example in gamemanager.cpp line 65) which is really slow.. idea for optimisation: if you know that you have to access some special components every frame build a list with pointers to these components and use them directly... (as example in gamemanager make something like that:

      Source Code

      1. typedef struct {
      2. TransformComponentPtr transform;
      3. PhysicsComponentPtr physic;
      4. } UpdateInfo;
      5. std::vector<UpdateInfo> mUpdateInfos;
      6. // now you insert for every actor the needed infos
      7. // and save them in an own list in the game manager
      8. // but this is only important if this is really your bottleneck which is not clear for me yet..
      9. // it will not help if only 0.5% cpu time is spent in this function

      )
    • In the performance branch I am accessing the components ID by it's integer, I haven't updated the trunk yet, also the error you get there is with the font rendering shader, which I will commit now (didn't get uploaded). You basically have to wait about 2 minutes for
      a ton of actors.

      Whenever you get a chance Rez, I appreciate the help
      PC - Custom Built
      CPU: 3rd Gen. Intel i7 3770 3.4Ghz
      GPU: ATI Radeon HD 7959 3GB
      RAM: 16GB

      Laptop - Alienware M17x
      CPU: 3rd Gen. Intel i7 - Ivy Bridge
      GPU: NVIDIA GeForce GTX 660M - 2GB GDDR5
      RAM: 8GB Dual Channel DDR3 @ 1600mhz

      The post was edited 1 time, last by mholley519 ().

    • I've tested your release-build with verysleepy and found no performance problems so far. I've had the same framerate, independently how many actors were on screen. but in my build there is no font visible?!

      and in the task manager the release-build is taking only 10% cpu time...

      [edit] I've the impression that the font is loaded every frame and that is clearly visible in the fps... it's the VGetRawResourceSize that is called too often and is very expensive! I've seen that it's also used for example in the spriterendernode.cpp in the vrender function or in the meshrendernode stuff. when called for every used resource during rendering it's really a cpu-burner!

      The post was edited 2 times, last by questor/fused ().