[Ns-developers] GSOC - Parallel Simulations | First ideas
Hagen Paul Pfeifer
hagen at jauu.net
Mon Mar 24 17:20:14 PDT 2008
Hello NS3 Dev,
here are my _open_ approach for the area of parallel simulations:
o Investigate and compare current parallelization techniques:
- Threads (namely POSIX threads)
o Enables the possibility to utilize CMP/SMP cores
as well as other "physically distributed systems" nodes (cluster). Of
course, threads by them self are bounded to a CPU, but an abstract layer
based on thread logic can enable this at a later increment (see the next
paragraphs for an in deep description)/ Thats a big advantage, compared
to TBB and other micro level optimizations.
o Fits more natural into the concept of simulator parallelization. Cause
many processes can be considered as standalone (dependencies are
considered separately at the and of this document).
o Figure out how MPI (message passing interface) can be utilized and
combined with the treading approach. Maybe as an start point for data
distribution. These are the first shots, _could_ be changed in the design
phase ... :)
o POSIX threads are well known considered and applied in a wide area -
community support is backed, a not to undervalued aspect!
- TBB/OpenMP
o Limited to CMP/SMP Systems
o Simpler approach (easier to implement).
o "Micro approach", where threads on the other hand, reflect more an
"overall approach" and permits fine grained parallelization with
dedicated locking primitives (reader/writer locks). On the other hand
consider TBB the whole structure as one major blob with additional
concurrency. IMHO an more suitable approach for our field of
application.
There are more issues with the underlying technique like platform
support, number of users (expert knowledge), future outlook, et cetera,
But I incline to POSIX threads - they seem to deliever the highest
potential!
o Proposed Architecture (still first shots, of course ;):
Add an additional parallelization abstraction layer (with well defined
interfaces). This enables several possibilities, namely to enable/disable
the whole parallelization via a compile time switch ("mark" this feature
as experimental in the beginning), make then replaceable and enable the
possibility to extend the whole subsystem with additional distribution
(cluster functionality) without major code subsitution in the ns-3 core.
A clean but powerful parallelization layer is the goal to split
implementation issues from the core ns3 logic! At least the newly added
functionality should _interference_ the common functionality as less as
possible to prevent "simulation result anomalies".
o In the first part an profiling analysis (code coverage test, oprofile, et cetera,.)
to detect CPU hogs to understand where processing power is consumed. Use this
information to find approaches to split workload into several pieces.
Currently my knowlendge is limited in this sector an may help me where are
the major CPU hogs.
- Spot causal dependencies within the model like global variables. Dig into
major components (like wireless node dependencies, like radio
interference, et cetera,). Categorize these into several groups to spot out
the parallelization possibilities.
- The parallelization trade-off should be determined to predicate the
overall newly introduced overhead.
- Study already published literature within the sector of parallelization of
simulators. (thats my everyday job currently - I quite an novice in the
field of simulator parallelization and there are an quite a lot of
publications out there, if you also consider some "border-literature".
- Dig into the implementation details of other distribution system with
a similar algorithm scheme - don't reinvent the wheel a second time.
o Data Dependencies and CPU Characteristics.
A major challenge are inter-thread dependencies. Dependencies here denotes
to data dependencies. To detect these dependencies the design phase are a
major part, because they affect important performance critical issues. Which
have impact for the overall performance and prevent unnecessary,
uncoordinated synchronization between the newly invented threads.
Furthermore CPU cache line trashing and CPU design should also be
considered. There are a bunch of parallelization approaches (implemented
with BTT for example) who reduce the overall performance compared to
"unoptimized" code. The algorithm should consider therefore also CPU
architecture issues into context!
o Procedure/Agenda
1. Dig into the source
2. Analyze possible parallelization areas (task 1 and 2 can be done
synchronously ;)
3. Study the literature and other applications with similar behavior
4. Start Programming (and documentation ;)
5. Build adequate unit test cases to verify simulator results (is is almost
always a good feeling that you are backed up through strong unit tests! ;-)
This is GSOC proposal, most likely not that stable (more alpha like ns3) but
an attempt! ;)
Best regards, HGN
--
Hagen Paul Pfeifer <hagen at jauu.net> || http://jauu.net/
Telephone: +49 174 5455209 || Key Id: 0x98350C22
Key Fingerprint: 490F 557B 6C48 6D7E 5706 2EA2 4A22 8D45 9835 0C22
Always in motion, the future is.
More information about the Ns-developers
mailing list