Re: ideas

From: Henrik Nordstrom <hno@dont-contact.us>
Date: Mon, 12 Jun 2000 22:53:37 +0200

Andres Kroonmaa wrote:

> I wouldn't hope that threads could add anything to reliability.
> Not to argue with you, but trying to be eagerly helpful ;)

Haven't claimed it would. Threads for SMP support is Adrian's idea. I
doubt it will work on many of the target platforms of Squid.. (well, it
will work on the same platforms as async-io currently works, but I have
doubts for the *BSD family)

> There are many different types of "threads", and Squid current design
> is definitely and clearly "threaded". Each request is handled from start
> to end with some thread, as codepath between two setSelect() calls
> is equivalent to a thread. All requests are given their own thread, and
> each thread is suspended in comm_select loop while waiting for IO.
> Basically, this is where threads are scheduled to get the cpu time.
> All code is reused as with reentrant threads, and the only data that
> identifies thread is request structure (or storeEntry). Squid type
> of code is never called threaded just to avoid confusion with system
> supported threads implementations.

From a coding perspective Squid is better described as a large engine of
state machines than threads. But yes, some of the flows can be viewed as
threads.

> Current scheme is very similar to what we'd get if we used user-level
> threads and wrote the squid down in fully threaded manner.
> The only difference is that currently Squid is limited to single thread
> being able to execute concurrently, while real threaded design could
> potentially have ability for each thread to progress on separate CPU,
> but this ability depends on OS thread implementation somewhat.
> As the benefit, compared to fully threaded design, Squid current code
> has absolutely predictable execution order, vs. absolutely unpredictable
> with threads (unless you enforce order by other means).

If you count filedescriptor number as predictable then yes.

Most user-level threads implementations are also quite predictable with
very simple schedulers (usually plain round-robin or FIFO).

> Whats quite important, is that user-level threads are not able to enter
> kernel concurrently, because kernel is viewing all userlevel threads at
> process level as a single thread, and most threads libs enclose syscalls
> with mutexes (or worse: _all_ syscalls with _single_ mutex). This means
> that any call to system potentially ends up with hitting locked mutex
> and rescheduling all threads. This means that all calls to system are
> serialised and all threads can run concurrently only between calls to
> system. So we only add overhead of switching between threads, and gain
> very little on average. This isn't quite what we want.

Yes, and I am well aware of this. So is Adrian. What we are looking at
is a combination where threads/processes are used to scale on SMP
machines, and within each thread/process a connection sheduling is done
like how Squid currently operates.

> So for real parallelism we are forced to use kernel-threads. Of course
> we can expect much better concurrency and together with that more work
> done in the same time. But we'd have to deal with all the headaches of
> concurrent threads, or add bottlenecks ourselves.

Of course. On systems not supporting kernel threads we have to use
processes to accomplish SMP support.

> To write classical threaded squid we'd most probably need a total
> rewrite which is most probably not desired. This leaves us with
> adding threads where it gives most benefit. Current async-io is
> one good example. In async-io, threads are used for good, and in a
> way, are unavoidable if we want non-blocking Squid.

Yes.

> By using kernel threads we pretty much unavoidably face inter-thread
> communication and syncronisation, which adds overhead proportionally
> to the number of threads running. There are several reasons for that,
> and without digging deeper into how threads are actually implemented
> it is hard to see, and may leave false sense of scalability if not
> accounted for.

Very much true, and is why we are looking at "fat" threads where as much
as possible is done within the thread. It is not like we are looking at
a single thread per connection design (if you want that, then go for
"oops")

> I don't think threads can in any way help us write fault-tolerant code,
> although I'd agree that using threads could help us better track code
> source, and perhaps avoid several hard-to-track errors.

No. Threads won't help fault tolerance much.

> Infact, on contrary, having lots of specifics to threads might make
> code even harder to debug, and by definition, if any thread of control
> can bring the whole proccess down, there isn't much difference whether
> you try to write code that can tolerate some errors or try to write
> code without any errors ;)

On that I do not agree. There are lots of things which can be done to
write code that tolerates errors, however that is not what we are
discussing in the threads discussion.

Some ideas was in my notes on a multiprocess design (not threads within
one process), but automatic crash recovery is only the tip of the
iceberg in writing fault tolerant code.

> Actually, it is possible to rewrite squid so that it looks totally
> as fully threaded code while at the same time not using a single
> thread library. All it needs is to wrap all system calls that could
> block and implement stack save/restore for each such call, and then
> actually block in comm_select. Later, when socket is ready, instead
> of calling callbacks, simply restore the stack and return to the
> caller until it shortly blocks again. Infact this is exactly what
> user-level threads are doing, just hidden from the programmer.
> Later, we could actually implement those wrapped calls in several
> different ways, like kernel threads or separate processes, depending
> on OS support.

We are heading for a processing design quite similar to the current
design, based around select/poll handlers, not threading.

> In terms of fault-tolerance, the building block could only be process.
> (can't really cope with SIGSEGV in reasonable way within a thread)

As I have said all the time.

> If we can split several tasks of squid into separate processes that
> are self-sufficient, then we can get to some fault-tolerance. But
> definitely at a price, most probably performance wise.

Well aware of the performance penalties. However I do beleive that the
design could be done in such a manner that the penalty will be quite
small compared to what is gained.

> In terms of splitting squid into separate tasks (processes/threads),
> we should very clearly think about _why_ we would want some task be
> separated, what it gives us when separated and at what price.

There has been a gread deal of thought on that.

> For eg, if we had a separate thread for every client session, we'd
> most of the time be blocking in either client-write or server-read.
> So, basically for every packet coming from server, system has to
> schedule the thread to run, and all it does is to copy 1 packet from
> server socket to client socket before being blocked again. The
> overhead of switching kernel-threads grows upto dominating all the
> CPU time at high loads and many concurrent sessions.
> While splitting tasks, we should always be keeping such things in
> mind.

Doing so has never been an option. If such a redesign was to be taken
then it wouldn't be Squid any longer. And I don't see the point as there
already are at least one free proxy implementation along those lines
(oops).

> In general, it isn't very useful to split such task into separate
> thread if execution of that task takes very little time and/or
> overhead of scheduling that task is comparable to execution time.
> Using async-io to read/write network sockets makes very little sense.

No it doesn't.

> At the same time, for eg. ICP server could be very effective as
> separate thread. It is pretty much independant of any other squid
> tasks and runs very fast. The reason why separate thread may be
> useful here is that we can avoid regular polling of ICP socket
> and make it totally asyncronous. ICP traffic is relatively small
> also so that awaking ICP thread for a single packet isn't very
> much overhead. We could also use separate thread for http accept
> socket.
>
> But to make a separate thread to only issue lookups in Store index
> database may be expensive, because current lookups are very fast
> with no overhead, and adding layer of separation (thread) can add
> noticable overhead.

Ok. Maybe I should repeat the process split suggestion I made:

1. Network I/O processes. One per CPU. Each handling lots of concurrent
connections. All request forwarding for a single client connection takes
place in a single process.

2. Storage processes, one per cache_dir (disk spindle). Takes care of
reading/writing to the disk and manages a definite index of that
directory. These can in turn be multithreaded ala async-io if so is
wanted.

3. A master process, keeping an eye on everything and making sure
everything is up and running.

4. Other helper processes as needed.

The above might differ slightly from previous two process descriptions,
however the basic ideas is the same.

The store index is shared between the storage and network via compact
hints, for example cache digests. Sharing could be done using for
example shared memory or memory mapped data depending on the OS and
taste. How does not matter for the design, only that the regions are
there with one writer and multiple readers and not sensitive to race
conditions.

The idea behind this multi-process design is that each unit is self
contained for the operations it performs. If a network I/O process dies
and restarts then only the requests currently being processed by that
process gets affected.

The exact details of the ICP/RPC mechanisms between the various parts
remains to be spelled out in detail. Before that work is started we need
to agree on the basic principles of having a multi-process design.

Adrian is discussing a different design based on threads, but the split
is along similar lines. The main difference is in how the store index is
maintained, and the communication mechanisms for communicating between
the various parts. Also the threaded design does not provide the
distributed crash recovery of the multi-process design.

/Henrik
Received on Mon Jun 12 2000 - 15:08:53 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:12:29 MST