[Ns-developers] GSOC - Parallel Simulations | First ideas

George Riley riley at ece.gatech.edu
Thu Mar 27 07:24:17 PDT 2008


Here are my comments on the parallelization design.
(Interspersed below).

--------------------------------------------------
George Riley
Associate Professor
Georgia Tech
Electrical and Computer Engineering
riley at ece.gatech.edu
404-894-4767
Klaus Advanced Computing Building
Room 3360
266 Ferst Drive
Atlanta, Georgia 30332-0765

ECE4110 Web page:
http://users.ece.gatech.edu/~riley/ece4110/

ECE4112 Web page:
http://users.ece.gatech.edu/~riley/ece4112/
>
> o Investigate and compare current parallelization techniques:
>
>  - Threads (namely POSIX threads)
Threads can be a good approach for tightly-coupled environments as you  
mention.
We have to be careful to avoid the need for "interlocking" (mutex  
protection) of
state variables however.  Are you imagining a single event list shared  
by
all threads, or multiple lists (one per thread).  How will you  
determine which
node object are "assigned" to each thread?  Or will this be dynamic?

>
>
>    o Enables the possibility to utilize CMP/SMP cores
>      as well as other "physically distributed systems" nodes  
> (cluster).  Of
>      course, threads by them self are bounded to a CPU, but an  
> abstract layer
>      based on thread logic can enable this at a later increment (see  
> the next
>      paragraphs for an in deep description)/ Thats a big advantage,  
> compared
>      to TBB and other micro level optimizations.
>
>    o Fits more natural into the concept of simulator  
> parallelization. Cause
>      many processes can be considered as standalone (dependencies are
>      considered separately at the and of this document).
>
>    o Figure out how MPI (message passing interface) can be utilized  
> and
>      combined with the treading approach. Maybe as an start point  
> for data
>      distribution. These are the first shots, _could_ be changed in  
> the design
>      phase ... :)
MPI is the clear winner for distributed (across a network of  
workstations), but
am not sure it's best for a multi-threaded approach.  However there is  
an advantage
towards a single approach for both (tightly-coupled and distributed).

>
>
>    o POSIX threads are well known considered and applied in a wide  
> area -
>      community support is backed, a not to undervalued aspect!
>
>  - TBB/OpenMP
>
>    o Limited to CMP/SMP Systems
>    o Simpler approach (easier to implement).
>    o "Micro approach", where threads on the other hand, reflect more  
> an
>      "overall approach" and permits fine grained parallelization with
>      dedicated locking primitives (reader/writer locks).  On the  
> other hand
>      consider TBB the whole structure as one major blob with  
> additional
>      concurrency. IMHO an more suitable approach for our field of
>      application.
I would prefer we don't design something requiring the locking  
primitives.
THis is much too invasive, in the extreme requiring every state variable
access to be locked.

>
>
>
>    There are more issues with the underlying technique like platform
>    support, number of users (expert knowledge), future outlook, et  
> cetera,
>    But I incline to POSIX threads - they seem to deliever the highest
>    potential!
>
> o Proposed Architecture (still first shots, of course ;):
>
>  Add an additional parallelization abstraction layer (with well  
> defined
>  interfaces). This enables several possibilities, namely to enable/ 
> disable
>  the whole parallelization via a compile time switch ("mark" this  
> feature
>  as experimental in the beginning), make then replaceable and enable  
> the
>  possibility to extend the whole subsystem with additional  
> distribution
>  (cluster functionality) without major code subsitution in the ns-3  
> core.
>
>  A clean but powerful parallelization layer is the goal to split
>  implementation issues from the core ns3 logic! At least the newly  
> added
>  functionality should _interference_ the common functionality as  
> less as
>  possible to prevent "simulation result anomalies".
Ideally, someone using ns3 for simple serial simulations can ignore  
completely
everything related to distributed sim; since the distributed sims will  
in practice
comprise a small fraction of use cases, we don't want this to be  
intrusive.

>
>
>
> o In the first part an profiling analysis (code coverage test,  
> oprofile,   et cetera,.)
>  to detect CPU hogs to understand where processing power is  
> consumed. Use this
>  information to find approaches to split workload into several pieces.
>  Currently my knowlendge is limited in this sector an may help me  
> where are
>  the major CPU hogs.
>
>  - Spot causal dependencies within the model like global variables.  
> Dig into
>    major components (like wireless node dependencies, like radio
>    interference, et cetera,). Categorize these into several groups  
> to spot out
>    the parallelization possibilities.
>
>  - The parallelization trade-off should be determined to predicate the
>    overall newly introduced overhead.
>
>  - Study already published literature within the sector of  
> parallelization of
>    simulators. (thats my everyday job currently - I quite an novice  
> in the
>    field of simulator parallelization and there are an quite a lot of
>    publications out there, if you also consider some "border- 
> literature".
>
>  - Dig into the implementation details of other distribution system  
> with
>    a similar algorithm scheme - don't reinvent the wheel a second  
> time.
Definitely agree with not re-inventing the wheel; there are a couple  
of existing
distributed simulators (ns2, pdns, GTNetS) that we should leverage.

>
>
> o Data Dependencies and CPU Characteristics.
>
>  A major challenge are inter-thread dependencies. Dependencies here  
> denotes
>  to data dependencies. To detect these dependencies the design phase  
> are a
>  major part, because they affect important performance critical  
> issues. Which
>  have impact for the overall performance and prevent unnecessary,
>  uncoordinated synchronization between the newly invented threads.
>
>  Furthermore CPU cache line trashing and CPU design should also be
>  considered. There are a bunch of parallelization approaches  
> (implemented
>  with BTT for example) who reduce the overall performance compared to
>  "unoptimized" code. The algorithm should consider therefore also CPU
>  architecture issues into context!
Getting into the details of cache line behavior is, as you likely  
know, very
complex and platform dependent.  If we optimize for one processor, we
might end up sub-optimal (or even poor) for another.  I would prefer
we focus on the design and implementation of the distributed simulation
independently of this. We can of course look in to cache optimization at
a later time.
>
>
>
> o Procedure/Agenda
>
>  1. Dig into the source
>  2. Analyze possible parallelization areas (task 1 and 2 can be done
>     synchronously ;)
>  3. Study the literature and other applications with similar behavior
>  4. Start Programming (and documentation ;)
>  5. Build adequate unit test cases to verify simulator results (is  
> is almost
>     always a good feeling that you are backed up through strong unit  
> tests! ;-)
>
>
>
> This is GSOC proposal, most likely not that stable (more alpha like  
> ns3) but
> an attempt! ;)
Here are a couple of more points.

We need to decide on how to do "time management".  As you should know,
any conservative distributed simulation needs some way to determine if
a given event is "safe" to process.  There are a number of approaches  
to his.
See "Fujimoto: Parallel and Distributed Simulation Systems".

We need some way (perhaps even manual by the ns3 user) to map the
topology (all nodes/interfaces/links, etc) to individual processors/ 
systems.
How this is done has a tremendous effect on overall performance.  
People have
looked at various graph partitioning algorithms but there is no "one  
size fits
all solution".  At a minimum we need a ns3-user specified mapping,  
either
with an external file, or in the C++ code itself.

You can safely assume that the only simulation data flowing from one
processor to another is a packet (except for overhead messages related
to time synchronization).  We need to pay particular attention the  
packet
serialization (converting a packet class object to a serial stream of  
bytes)
and reconstruction at the receiving end.  THe packet payload data is  
essentially
serialized for free, but the meta-data and packets tags I'm not sure  
about.
This will need to be handled correctly.

The notion of "lookahead" is well known and understood within
the distributed simulation community.  We need a way to specify the
lookahead value (automatically preferably) and to utilize this in the
time synchronization protocol.

George



>
>
>
> Best regards, HGN
>
>
> -- 
> Hagen Paul Pfeifer <hagen at jauu.net>  ||  http://jauu.net/
> Telephone: +49 174 5455209           ||  Key Id: 0x98350C22
> Key Fingerprint: 490F 557B 6C48 6D7E 5706 2EA2 4A22 8D45 9835 0C22
> Always in motion, the future is.



More information about the Ns-developers mailing list