0100,0100,0100On 15 Sep 2001, at 1:41, Henrik Nordstrom < wrote:
7F00,0000,0000> Andres Kroonmaa wrote:
>
> > - signals are expensive when most fds are active.
>
> well.. not entirely sure on this one. Linux RT signals is a very light
> weight notification model, only the implementation sucks due to
> worthless notification storms.. (if you have already received a
> notification that there is data available for reading, there is
> absolutely no value in receiving yet another notification when there is
> more data before you have acted on the first.. and similar for writing)
I don't know the details. I imagine that signal queueing to the proc
is very lightweight. Its the setting up and dequeueing that burns the
benefits imho. Mostly because its done 1 FD at a time. I can't imagine
any benefits compared to devpoll, unless you need to handle IO from
signal handler.
7F00,0000,0000> > - syscalls are expensive if doing little work, mostly for similar
> > reasons as threads.
>
> depends on platform and syscall, but generally true. The actual syscall
> overhead is however often overestimated in discussions like this. A
> typical syscall consists of
> * light context switch
> * argument verification
> * data copying
> * processing
>
> What you can optimize by aggregation is the light context switches.
> The7F00,0000,0000 rest will still be there.
Not sure what you mean by light context switches. Perhaps you make
distinction between CPU protmode change, kernel doing queueing and
kernel going through the scheduling stuff. In my view CPU protmode
change alone is burning alot of CPU, and typical syscall is read()
or write() in Squid. Both are cancellation points, meaning some
scheduling checks are done. Same stands for sigwait(), poll, etc.
Perhaps only calls like fcntl() are lightweight in this sense.
Basically any syscall that can possibly block is heavy syscall.
But again, CPU-specific overhead can't be underestimated. CPU mode
change causes cpu-cache flushes on almost all CPUs (prefetching,
pipelines, VM maps). This means that you have high misrate freshly
after. With CPU clockrate very high compared to RAM clockrates this
translates to alot of CPU cycles lost. You could have syscall doing
single memory write and return, but burning CPU as much as few hundred
lines of code eventually.
The only difference is that CPU is stalled instead of running code.
This all is sensed only when syscall rate is very high, when code
leaves process very often doing only very little work at a time in
either kernel or userspace. We should stay longer in userspace,
preparing several sockets for IO, and then stay longer in kernel,
handling IO.
Imagine we had to loop through all FD's in poll_array and poll()
each FD individually. This is where we are today with IO queueing.
7F00,0000,0000> > Eventually we'll strike syscall rate that wastes most CPU cycles in
> > context-switches and cache-misses.
>
> The I/O syscall overhead should stay fairly linear with the I/O request
> rate I think. I don't see how context switches and cache misses can
> increase a lot only because the rate increases. It is still the same
> amount of code running in the same amount of execution units.
Under light loads syscall overhead is small because time between the
two is relatively large. The problem is that when reqrate increases,
expected overhead per request is the same, but timeframe between the
switches reduces, and proportion of CPU time burned in syscall overhead
goes up. If overhead is around 1% with load of 100 req/sec it should
be 10% at 1K req/sec. But it is worse. Possibly much worse. And very
difficult to measure. Eventually we loose CPU resource to nothing.
To make things worse, we optimise our code, the goal of what is to
further reduce time between successive syscalls.
Saying "most" is exaggerated, though. ;)
7F00,0000,0000> > Ideally, kernel should be given a list of sockets to work on, not in just
> > terms of readiness detection, but actual IO itself. Just like in devpoll,
> > where kernel updates ready fd list as events occur, it should be made to
> > actually do the IO as events occur. Squid should provide a list of FDs,
> > commands, timeout values and bufferspace per FD and enqueue the list.
> > This is like kernel-aio, but not quite.
>
> Sounds very much like LIO. Main theoretical problem is what kind of
> notification mechanism to use for good latency.
Yes, LIO. Problem is that most current LIO implementations are done
in library by use of aio calls per FD. And as aio is typically done
with thread per FD, this is unacceptable. Its important that kernel
level syscall was there to take a list of commands, and equally return
a list of results, not necessarily same set as requests were.
7F00,0000,0000> > From other end sleep in a wakeup function which returns a list of completed
> > events, thus dequeueing events that either complete or error. Then handle
> > all data in a manner coded into Squid, and enqueue another list of work.
> > Again, point is on returning list of completed events, not 1 event at a
> > time. Much like poll returns possibly several ready FD's.
>
> I am not sure I get this part.. are you talking about I/O or only
> notifications?
both, combined. Most of the time you poll just as a means to know
when IO doesn't block. If you can enqueue the IO to the kernel and
read results when it either completes or times out, you don't really
need poll. You need to wakeup on a list of IOCB's that has changed
status: done, error, or timeout. But IOCB could also have a command
for doing just poll() for a FD.
7F00,0000,0000> > I believe this is useful, because only kernel can really do work in
> > async manner - at packet arrival time. It could even skip buffering
> > packets in kernel space, but append data directly to userspace buffs,
> > or put data on wire from userspace. Same for disk io.
>
> Apart from the direct userspace copy, this is already what modern
> kernels does on networking..
oh, no. I'm about different model of communicating to kernel.
Sure kernel works in async manner, there is no other way infact.
There just have to be more work delegated to kernel to do at packet
arrival time to reduce useless jerking between process and kernel.
Just as with devpoll.
7F00,0000,0000> > In regards to eventio branch, new network API, seems it allows to
> > implement almost any io model behind the scenes. What seems to stick
> > is FD-centric and one-action-at-a-time approach. Also it seems that
> > it could be made more general and expandable, possibly covering also
> > disk io. Also, some calls assume that they are fulfilled immediately,
> > no async nature, no callbacks (close for eg). This makes it awkward
> > to issue close while read/write is pending.
>
> Regarding the eventio close call: This does not close, it only signals
> EOF. You can enqueue N writes, then close to signal that you are done.
> And there is a callback, registered when the filehandle is created.
> Serialization is guaranteed.
ok. I should have realised that..
Btw, why is close callback registration separated from the call?
To follow existing code style more closely?
7F00,0000,0000> > One thing that bothers me abit is that you can't proceed before FD
> > is known. For disk io, for eg. it would help if you could schedule
> > open/read/close in one shot. For that some kind of abstract session
> > ID could be used I guess. Then such triplets could be scheduled to
> > the same worker-thread avoiding several context-switches.
>
> The eventio does not actually care about the Unix FD. The exact same API
> can be used just fine with asyncronous file opens or even aggregated
> lowlevel functions if you like (well.. aggregation of close may be a bit
> hard unless there is a pending I/O queue)
Hmm, I assumed that you cannot call read/write unless filehandle is
provided by initial callback. Do you mean we can?
7F00,0000,0000> > Also, how about some more general ioControlBlock struct, that defines
> > all the callback, cbdata, iotype, size, offset, etc... And is possibly
> > expandable in future.
>
> ???
struct IOCB {
int errno;
int command; /* read,write,poll,accept,close... */
7F00,0000,0000 COMMIOCB *callback;
void *cbdata;
IOBUF *buf;
7F00,0000,0000 size_t max_size;
off_t offset;
7F00,0000,0000 COMMCLOSECB *handler;
... etc ...
7F00,0000,0000};
dunno, maybe this packing is a job for actual io-model behind the api.
It just seems to me that it would be nice if we can pass array of
such IOCB's to the api.
7F00,0000,0000> > Hmm, probably it would be even possible to generalise the api to such
> > extent, that you could schedule acl-checks, dns, redirect lookups all
> > via same api. Might become useful if we want main squid thread to do
> > nothing else but be a broker between worker-threads. Not sure if that
> > makes sense though, just a wild thought.
>
> Define threads in this context.
Of course I'm into scaling here.
pthreads. all DNS lookups could be done by one worker-thread, which
might setup and handle multiple dnsservers, whatever. ACL-checks
that eat CPU could also be done on separate cpu-threads.
All this can only make sense if message-passing is very lightweight
and does not require thread/context switching per action. Its about
pipelining. Same or separate worker-thread could handle redirectors,
messagepassing to/from them.
Suppose that we read from /dev/poll 10 results that notify about 10
accepted sockets. We can now check all ACL's in main thread, or we
can queue all 10 to separate thread, freeing main thread. If we also
had 20 sockets ready for write, we could either handle them in main
thread, or enqueue-append (gather) to a special worker-thread.
Difference is that all process overhead and kernel overhead is in
one case consuming only single CPU, in other case load is shared or
work is postponed to time when main thread is idle in wakeup.
7F00,0000,0000> > Also, I think we should think about trying to be more threadsafe.
> > Having one compact IOCB helps here. Maybe even allowing to pass a
> > list of IOCB's.
>
> See previous discussions about threading. My view is that threading is
> good to certain extent but locality should be kept strong. The same
> filehandle should not be touched by more than one thread (with the
> exception of accept()).
I agree. Thread is bad if it does little work. But thread is good
if it has decent amount of work. Makes sense only with CPUs >1.
7F00,0000,0000> The main goal of threading is scalability on SMP.
In this context, its not about SMP so much. More like offloading
work from overloaded CPU in decently efficient manner. Better SMP
scaling is a wanted byproduct, but not the main goal that much.
To pursue perfect SMP scaling would need to redesign too much.
7F00,0000,0000> The same goal can be acheived by a multi-process design, which also
> scales on assymetric architectures, but for this we need some form of
> low overhead shared object broker (mainly for the disk cache).
Yes. Just seems that it would be easier to start using threads
for limited tasks, at least meanwhile.
------------------------------------
Andres Kroonmaa
CTO, Microlink Online
Tel: 6501 731, Fax: 6501 725
Pärnu mnt. 158, Tallinn,
11317 Estonia