[Ns-developers] Multithreading code status update

Guillaume Seguin guillaume.seguin at ens.fr
Fri Aug 14 06:42:03 PDT 2009


During the past few weeks I've been working on a new version of the
multithreaded simulator implementation.

Summary
=======
This new implementation is no more based on Chandy, Misra, Bryant algorithm bus
uses synchronization barriers to perform critical operations, which mostly are
computing a global maximum date up to which events can be computed. This way no
messages are used, so that the algorithm doesn't lose all its time processing
messages while doing minimal work.

Where
=====
I uploaded the new version to

http://code.nsnam.org/guillaume/ns-3-multithreading/

and moved the previous version to

http://code.nsnam.org/guillaume/ns-3-multithreading-old/

Details
=======
The current multithreaded simulator code can currently two implementations of
simulation barriers (POSIX (pthread) ones and spinlock ones, implemented by
Mathieu Lacage), and splits the partitions into two categories :
- dedicated partitions, which are specific to a given thread
- shared partitions, which are stored in one or more global list, each list
  being processed by several threads

Dedicated partitions are used to reduce contention, while shared partitions are
used to balance threads run time at each iteration : threads which will have
processed their dedicated partitions quickly will pick up shared partitions, so
that they don't waste too much time waiting for slower threads.

Two implementations of the shared partitions lists are available, a lock-based
one and a lockless one, though I highly disadvice you from using the lock-based
one since it has proved being really slow.

Making your script use the multithreaded simulator
==================================================

This is actually quite easy.
You first have to enable multithreading right after processing command line
arguments with :

  MultiThreadingHelper multiThreadingHelper;
  multiThreadingHelper.Enable ();

Then, right before Simulator::Run (), add :

  multiThreadingHelper.Install ();

And that's it !

For reference, see examples/multithreaded-udp-echo.cc

Parameters
==========

Four specific runtime parameters are available :
- ns3::MultiThreadedSimulatorImpl::ThreadsCount, which specifies the number of
  computation threads
- ns3::MultiThreadedSimulatorImpl::BarrierType, which specifies the type of
  barriers you want to use : either "posix" (for pthread barriers) or "spin"
  (for global spinlock based barriers) or the default, "spin-tree" (for tree
  framed spinlock based barriers)
- ns3::MultiThreadedSimulatorImpl::ThreadDedicatedPercent, which specifies the
  percentage of thread dedicated partitions
- ns3::MultiThreadedSimulatorImpl::SharedPartitionsSets, which specifies the
  number of shared partitions lists, which are equally distributed among
  threads
Example :

./your-script --ns3::MultiThreadedSimulatorImpl::ThreadsCount=8 \
              --ns3::MultiThreadedSimulatorImpl::BarrierType=posix \
              --ns3::MultiThreadedSimulatorImpl::ThreadDedicatedPercent=60 \
              --ns3::MultiThreadedSimulatorImpl::SharedPartitionsSets=4

This makes the simulator use 8 threads, synchronizing using POSIX barriers,
with 60% of dedicated partitions and 4 shared partitions sets.
If there are 100 partitions, each thread will have :
- 7 or 8 dedicated partitions (60% of 100 divided by 8)
- access to a shared partitions set which will hold 10 partitions (40% of 100
  divided by 4) and which will be shared by 2 threads (8 / 4)

In addition, one compile time parameters is available, defined in
src/simulator/multithreaded-simulator-impl.h :
- LOCKLESS_SHARED_PARTITIONS : 0 means using lock-based shared partitions sets,
  1 means using lockless shared partitions sets

Notes
=====

As a side note, with this new version, if you specify a ThreadsCount value of 8,
you'll have exactly 8 threads created, while in the previous version you had 9 ;
the thread which used to be waiting for other threads is now doing actual work
before waiting others.

Please also note that Wifi networks still aren't supported in this version.

PowerPC architecture isn't supported yet either, mostly because of thread
specific storage and barrier implementations issues. The PowerPC code included
for atomic and high precision clock operations hasn't been tested yet (because
of the aforementioned issues).

Performances
============

We've been benchmarking the multithreaded simulator using the nms-p2p (and its
multithreaded clone multithreaded-nms-p2p) example.

Running the nms-p2p with an up to date ns-3-dev (as of this morning) we get :
Simulator init time: 11.3674
Simulator run time: 11.4909
Total elapsed time: 22.8583

Running nms-p2p on the current multithreading repo (but without multithreading)
we get :
Simulator init time: 24.9997
Simulator run time: 15.0179
Total elapsed time: 40.0177

Running multithreaded-nms-p2p (default parameters, not optimized yet) we get :
Simulator init time: 25.0096
Simulator run time: 14.0174
Total elapsed time: 39.027

So as you can see, there is *some* gain from the multithreaded code compared to
the non-multithreaded one. Sadly, the thread-safety bits introduced in the
codebase also lead to a huge performance decrease compared to ns-3-dev current
state.
Better performances were achieved at some point, which were around the
non-multithreaded code performances (for run time only) and I'm aiming to reach
this point again soon.

Sadly, it appears that this example has some bad bottlenecks (specifically some
PointToPointNetDevice::Receive events taking a loooooot of time to run because
of the shape of the network) and is more a nice routing benchmark than a
multithreading one, so that performance gains can't be as awesome as we would
want them to be.

Call for examples
=================

Consequently, if you have any real-world example that you'd like to run using
the multithreaded simulator, just send it to me and I'll take care of it
(private communications will remain private, though it'd be great to include the
examples in the ns-3 distribution for future works).

Further work
============

Well, further work includes mostly finding the best parameters and finding new
examples, either from the litterature or from YOU *wink* and using them to
benchmark the simulator. This will allow me to try more tricky optimizations
and heuristics to balance threads or optimize memory details.
Doing better memory (and more specifically L1/L2) caches benchmarks and
improvements is also a path to follow.
Last, a good solution to support Wifi networks has to be found.

Feel free to send any question you might have about this implementation :)

Best regards,
Guillaume Seguin


More information about the Ns-developers mailing list