Socket problem using beta2 on Windows-XP

Thomas Hallgren

unread,

Sep 29, 2005, 2:50:30 AM9/29/05

to

Hi,
I've installed PostgreSQL 8.1-beta2 as a service on my Windows-XP box.
It runs fine but I get repeated messages like this in the log:

2005-09-29 00:41:09 FATAL: could not duplicate socket 1880 for use
in backend: error code 10038

and for each message printed, a new postgres process is created. To make
things worse, those processes do not die when I stop the service.

I use sysinternals tcpview to monitor my sockets. I know that no other
process is using 1880. Each started postgres process will occupy two,
seemingly random ports that apparently form a loop somehow. This is a
typical entry:

<non-existent>:3136 TCP 127.0.0.1:1554 127.0.0.1:1555 ESTABLISHED
<non-existent>:3136 TCP 127.0.0.1:1555 127.0.0.1:1554 ESTABLISHED

The weird thing is that there is no process with pid 3136 (hence the
name <non-existent>). There is a postgres process with another pid in my
process listing. If I kill that, the <non-existstent> entries go away.

Looks like pid 3136 is talking to itself. A pipe() followed by failure
to start the new process perhaps?

Regards,
Thomas Hallgren

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Magnus Hagander

unread,

Sep 29, 2005, 3:07:59 AM9/29/05

to

> Hi,
> I've installed PostgreSQL 8.1-beta2 as a service on my
> Windows-XP box.
> It runs fine but I get repeated messages like this in the log:
>
> 2005-09-29 00:41:09 FATAL: could not duplicate socket
> 1880 for use in backend: error code 10038
>
> and for each message printed, a new postgres process is
> created. To make things worse, those processes do not die
> when I stop the service.
>
> I use sysinternals tcpview to monitor my sockets. I know that
> no other process is using 1880. Each started postgres process
> will occupy two, seemingly random ports that apparently form
> a loop somehow. This is a typical entry:
>
> <non-existent>:3136 TCP 127.0.0.1:1554
> 127.0.0.1:1555 ESTABLISHED
> <non-existent>:3136 TCP 127.0.0.1:1555
> 127.0.0.1:1554 ESTABLISHED
>
> The weird thing is that there is no process with pid 3136
> (hence the name <non-existent>). There is a postgres process
> with another pid in my process listing. If I kill that, the
> <non-existstent> entries go away.
>
> Looks like pid 3136 is talking to itself. A pipe() followed
> by failure to start the new process perhaps?

Do you by any chance run any antivirus or firewall software? If so, can
you try removing it (note! actual uninstall, not just disabling it!)

//Magnus

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Thomas Hallgren

unread,

Sep 29, 2005, 3:48:25 AM9/29/05

to

Nope, no anti-virus and no firewall (other then the box that fronts my
home-network to the outside world).

- thomas

Magnus Hagander wrote:

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Martijn van Oosterhout

unread,

Sep 29, 2005, 4:11:12 AM9/29/05

to

On Thu, Sep 29, 2005 at 08:50:30AM +0200, Thomas Hallgren wrote:
> Hi,
> I've installed PostgreSQL 8.1-beta2 as a service on my Windows-XP box.
> It runs fine but I get repeated messages like this in the log:
>
> 2005-09-29 00:41:09 FATAL: could not duplicate socket 1880 for use
> in backend: error code 10038

That's from postmaster.c:write_inheritable_socket(). Error 10038 is
WSAENOTSOCK. Very odd, time to get out the debugger? Get a backtrace at
least.

Hope this helps,
--
Martijn van Oosterhout <kle...@svana.org> http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Magnus Hagander

unread,

Sep 29, 2005, 4:03:39 PM9/29/05

to

Hmm. Bummer.

Anyway. The netstat indicates that the pipe() call works. The order is
pretty much:

parent: create socket pair, connected to each other.
parent: Duplicate socket [this is what fails]
parent: close own copy of socket
child: recreate socket from structure [this is never called, thus the
new socket is never "attached" to a process]

Now *why* it's doing this, I hav eno idea.

Questions:
1) Does it actually work? ;-) And just logs the error anyway?
2) Does this happen on *every* connection?
3) Can you reproduce this on a different machine, or just one?

//Magnus

> -----Original Message-----
> From: Thomas Hallgren [mailto:th...@mailblocks.com]
> Sent: Thursday, September 29, 2005 9:48 AM
> To: Magnus Hagander
> Cc: PostgreSQL-development
> Subject: Re: [HACKERS] Socket problem using beta2 on Windows-XP
>
> Nope, no anti-virus and no firewall (other then the box that
> fronts my home-network to the outside world).
>
> - thomas
>

> Magnus Hagander wrote:
>
> >>Hi,
> >>I've installed PostgreSQL 8.1-beta2 as a service on my
> Windows-XP box.
> >>It runs fine but I get repeated messages like this in the log:
> >>
> >> 2005-09-29 00:41:09 FATAL: could not duplicate socket
> 1880 for use
> >>in backend: error code 10038
> >>

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Thomas Hallgren

unread,

Sep 29, 2005, 5:43:37 PM9/29/05

to

Hi,
I'm Sorry, Time was short today. To answer your questions.
1. I can run a psql and other client programs. Everything works fine.
But while doing it, I get a lot of zombies in the tcpview and
eventually, I think I run out of ports. Psql just hangs when I try to
connect. When that happens, I have two choices; a) Stop the service and
then kill off all processes by hand (there's now a *lot* of them), or b)
reboot.
2. It happens while the postmaster is idle. If I leave it idle for a
while and then come back, I'll have a whole bunch of new processes in my
task-manager and zombies in tcpview.
3. I don't have another machine handy for this right now.

It sounds like you know where it happens. Martijn requested a
stacktrace. Do you still need that? If you do, I'll try to get some time
over this weekend.

Regards,
Thomas Hallgren

Magnus Hagander wrote:

---------------------------(end of broadcast)---------------------------

Alvaro Herrera

unread,

Sep 29, 2005, 6:23:50 PM9/29/05

to

On Thu, Sep 29, 2005 at 11:43:37PM +0200, Thomas Hallgren wrote:

> 2. It happens while the postmaster is idle. If I leave it idle for a
> while and then come back, I'll have a whole bunch of new processes in my
> task-manager and zombies in tcpview.

Hmm ... how many processes? Did you enable autovacuum perchance? If
so, does the number of processes correspond approximately to the
"autovacuum_naptime"?

--
Alvaro Herrera http://www.advogato.org/person/alvherre
"La espina, desde que nace, ya pincha" (Proverbio africano)

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Magnus Hagander

unread,

Sep 30, 2005, 1:44:39 AM9/30/05

to

> > 2. It happens while the postmaster is idle. If I leave it
> idle for a
> > while and then come back, I'll have a whole bunch of new
> processes in
> > my task-manager and zombies in tcpview.
>
> Hmm ... how many processes? Did you enable autovacuum
> perchance? If so, does the number of processes correspond
> approximately to the "autovacuum_naptime"?

IIRC, the win32 installer will enable autovacuum by default. And yes,
autovacuum was my first thought as well after Thomas last mail - that
would be a good explanation to why it happens when the postmaster is
idle.

//Magnus

---------------------------(end of broadcast)---------------------------

Thomas Hallgren

unread,

Sep 30, 2005, 1:50:52 AM9/30/05

to

Magnus Hagander wrote:

>IIRC, the win32 installer will enable autovacuum by default. And yes,
>autovacuum was my first thought as well after Thomas last mail - that
>would be a good explanation to why it happens when the postmaster is
>idle.
>
>

I used the win32 installer defaults so autovacuum is probably a safe
assumption.

- thomas

Magnus Hagander

unread,

Sep 30, 2005, 2:09:48 AM9/30/05

to

> >IIRC, the win32 installer will enable autovacuum by default.
> And yes,
> >autovacuum was my first thought as well after Thomas last
> mail - that
> >would be a good explanation to why it happens when the postmaster is
> >idle.
> >
> >
> I used the win32 installer defaults so autovacuum is
> probably a safe assumption.

Right. Please try turning it off and see if the problem goes away.

//Magnus

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Thomas Hallgren

unread,

Sep 30, 2005, 2:21:44 AM9/30/05

to

Magnus Hagander wrote:

>Right. Please try turning it off and see if the problem goes away.
>
>

It does (go away).

- thomas

---------------------------(end of broadcast)---------------------------

Thomas Hallgren

unread,

Sep 30, 2005, 2:29:07 AM9/30/05

to

Magnus Hagander wrote:

>Right. Please try turning it off and see if the problem goes away.
>
>

No, wait! It does *not* go away. Do I need to do anything more than
setting this in my postgresql.conf file:

autovacuum = false # enable autovacuum subprocess?

and restart the service?

The two zombie entries occurs directly when I start the service, then
there's two new entries popping up every minute.

- thomas

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majo...@postgresql.org so that your
message can get through to the mailing list cleanly

Magnus Hagander

unread,

Sep 30, 2005, 2:31:09 AM9/30/05

to

> >Right. Please try turning it off and see if the problem goes away.
> >
> >
> No, wait! It does *not* go away. Do I need to do anything
> more than setting this in my postgresql.conf file:
>
> autovacuum = false # enable autovacuum subprocess?
>
> and restart the service?
>
> The two zombie entries occurs directly when I start the
> service, then there's two new entries popping up every minute.

Yes, that should be enough.

Hmm. Weird!

If you can get a backtrace from the point where the error msg shows up,
that certainly would help - this means it's not coming from where we
thought it was coming from :-(

//Magnus

---------------------------(end of broadcast)---------------------------

Alvaro Herrera

unread,

Sep 30, 2005, 8:06:53 AM9/30/05

to

On Fri, Sep 30, 2005 at 08:29:07AM +0200, Thomas Hallgren wrote:
> Magnus Hagander wrote:
>
> >Right. Please try turning it off and see if the problem goes away.
> >
> >

> No, wait! It does *not* go away. Do I need to do anything more than
> setting this in my postgresql.conf file:
>
> autovacuum = false # enable autovacuum subprocess?
>
> and restart the service?
>
> The two zombie entries occurs directly when I start the service, then
> there's two new entries popping up every minute.

If it's two zombies per minute, then I bet it's the stat collector and
stat bufferer. They are restarted by the postmaster if not found to be
running.

The weird thing is that the postmaster _should_ call wait() for them if
it detects that they died (when receiving a SIGCHLD signal AFAIR). If
it doesn't, maybe it indicates there's a problem with the signal
handling on Win32.

--
Alvaro Herrera Valdivia, Chile ICBM: S 39º 49' 17.7", W 73º 14' 26.8"
"We are who we choose to be", sang the goldfinch
when the sun is high (Sandman)

Tom Lane

unread,

Sep 30, 2005, 9:58:51 AM9/30/05

to

Alvaro Herrera <alvh...@alvh.no-ip.org> writes:
> If it's two zombies per minute, then I bet it's the stat collector and
> stat bufferer. They are restarted by the postmaster if not found to be
> running.

That would make some sense, because the stat processes need to set up new
sockets (for the pipe between them). The autovacuum theory didn't hold
any water in my eyes because autovacuum doesn't create any new sockets.

However, why two zombies? That would mean that the grandchild process
started, which should mean that the pipe was already created ...

Does Windows have any equivalent of strace whereby we could watch what's
happening during stats process launch?

regards, tom lane

Magnus Hagander

unread,

Sep 30, 2005, 10:13:37 AM9/30/05

to

> > If it's two zombies per minute, then I bet it's the stat
> collector and
> > stat bufferer. They are restarted by the postmaster if not
> found to
> > be running.
>
> That would make some sense, because the stat processes need
> to set up new sockets (for the pipe between them). The
> autovacuum theory didn't hold any water in my eyes because
> autovacuum doesn't create any new sockets.
>
> However, why two zombies? That would mean that the
> grandchild process started, which should mean that the pipe
> was already created ...
>
> Does Windows have any equivalent of strace whereby we could
> watch what's happening during stats process launch?

First of all, I won't be able to dig into this any more until next week
- sorry about that. But others are always free to :-)

There is no strace equivalent builtin, but you can get an addon from
http://www.bindview.com/Services/RAZOR/Utilities/Windows/strace_readme.c
fm. Don't put it on a production box permanently, though, it tends to
cause BSODs in some cases.

//Magnus

Thomas Hallgren

unread,

Sep 30, 2005, 12:12:03 PM9/30/05

to

Tom Lane wrote:

>However, why two zombies? That would mean that the grandchild process
>started, which should mean that the pipe was already created ...
>
>

To clarify, I talk about the tcpview window and connections, and thus
zombi-connections. They both belong to the same pid and seems to point
to eachother. The actual process no longer exists (it can't be viewed
anywhere).

Regards,
Thomas Hallgren

Thomas Hallgren

unread,

Oct 2, 2005, 3:46:27 AM10/2/05

to

Martijn van Oosterhout wrote:

>On Thu, Sep 29, 2005 at 08:50:30AM +0200, Thomas Hallgren wrote:
>
>
>>Hi,
>>I've installed PostgreSQL 8.1-beta2 as a service on my Windows-XP box.
>>It runs fine but I get repeated messages like this in the log:
>>
>> 2005-09-29 00:41:09 FATAL: could not duplicate socket 1880 for use
>>in backend: error code 10038
>>
>>
>
>That's from postmaster.c:write_inheritable_socket(). Error 10038 is
>WSAENOTSOCK. Very odd, time to get out the debugger? Get a backtrace at
>least.
>
>

I finally managed to debug the postmaster and I'm now pretty sure the
message is not from the postmaster itself. I put a breakpoint where the
message is printed (postmaster.c:3762) and in errstart() where elevel >=
ERROR (elog.c:152) but I never get there although the message is
printed. I know that my debugger works because if I put a break on
elog.c:194 it stops for other messages.

Regards,
Thomas Hallgren

---------------------------(end of broadcast)---------------------------

Thomas Hallgren

unread,

Oct 2, 2005, 6:20:05 AM10/2/05

to

I added some traces to the code. I know that the following happens when
I start a postmaster.

StartupDatabase will call internal_fork_exec, it calls
write_inheritable_socket 4 times and succeeds.

During the first iteration of ServerLoop:
StartBackgroundWriter will call internal_fork_exec and succeed.
pgstat_forkexec will call internal_fork_exec and succeed.

In the second iteration of ServerLoop, pgstat_forkexec will again call
will call internal_fork_exec. This time it fails.
According to the log it fails on line:

write_inheritable_socket(&param->pgStatSock, pgStatSock, childPid);

i.e. on it's second call to write_inheriable_socket. The failure is in a
postgres.exe process, not postmaster.exe (and that's why I can't debug
propery on Windoz).

Hope this helps.

Regards,
Thomas Hallgren

Magnus Hagander wrote:

---------------------------(end of broadcast)---------------------------

Martijn van Oosterhout

unread,

Oct 2, 2005, 11:12:23 AM10/2/05

to

On Sun, Oct 02, 2005 at 12:20:05PM +0200, Thomas Hallgren wrote:
> I added some traces to the code. I know that the following happens when
> I start a postmaster.

<snip>

> In the second iteration of ServerLoop, pgstat_forkexec will again call
> will call internal_fork_exec. This time it fails.
> According to the log it fails on line:
>
> write_inheritable_socket(&param->pgStatSock, pgStatSock, childPid);

Well, pgStatSock is the only SOCK_DGRAM socket, all the others are
SOCK_STREAM, maybe that's the difference? It's also connected to
itself, although for DGRAM sockets that's not that special.

The documentation isn't totally clear about this. Yet the error thrown
should terminate the process, yet it obviously isn't. Very odd.

Any Windows programmers with ideas?

Thomas Hallgren

unread,

Oct 12, 2005, 4:41:12 AM10/12/05

to

With great help from Magnus, who advised me to use lspfix from cexx.org
to list my lsp's, I found that I had gapsp.dll, "Neoteris DNS Provider"
installed. An uninstall of the Neoteris software made this problem go away.

Regards,
Thomas Hallgren

Thomas Hallgren wrote:
> I added some traces to the code. I know that the following happens
> when I start a postmaster.
>

> StartupDatabase will call internal_fork_exec, it calls
> write_inheritable_socket 4 times and succeeds.
>
> During the first iteration of ServerLoop:
> StartBackgroundWriter will call internal_fork_exec and succeed.
> pgstat_forkexec will call internal_fork_exec and succeed.
>

> In the second iteration of ServerLoop, pgstat_forkexec will again
> call will call internal_fork_exec. This time it fails.
> According to the log it fails on line:
>
> write_inheritable_socket(&param->pgStatSock, pgStatSock, childPid);
>

> i.e. on it's second call to write_inheriable_socket. The failure is in
> a postgres.exe process, not postmaster.exe (and that's why I can't
> debug propery on Windoz).
>
> Hope this helps.
>
> Regards,
> Thomas Hallgren
>
>
> Magnus Hagander wrote:
>

---------------------------(end of broadcast)---------------------------

Alvaro Herrera

unread,

Oct 12, 2005, 7:21:07 AM10/12/05

to

Thomas Hallgren wrote:

> With great help from Magnus, who advised me to use lspfix from cexx.org
> to list my lsp's, I found that I had gapsp.dll, "Neoteris DNS Provider"
> installed. An uninstall of the Neoteris software made this problem go away.

I guess the question is, why is a "DNS Provider" software blocking
socket creation? Is there a way we could work around that?

--
Alvaro Herrera Architect, http://www.EnterpriseDB.com
"El destino baraja y nosotros jugamos" (A. Schopenhauer)

Magnus Hagander

unread,

Oct 12, 2005, 7:30:53 AM10/12/05

to

> > With great help from Magnus, who advised me to use lspfix from
> > cexx.org to list my lsp's, I found that I had gapsp.dll,
> "Neoteris DNS Provider"
> > installed. An uninstall of the Neoteris software made this
> problem go away.
>
> I guess the question is, why is a "DNS Provider" software
> blocking socket creation? Is there a way we could work around that?
>

It's just another version of the "Broken LSP" that we've been having
problems iwth before. But before, it's only been AV and firewall stuff.

I guess they somehow put a LSP in there to intercept DNS packets or
soemthign. Completely broken design IMHO, but that's a different thing
;-) And they apparantly don't support socket inheritance. The only way
we can work around them breaking the concept of socket inheritance is to
stop using it. Which would mean going multithread instead of
multiprocess, which isn't very likely...

To reiterate the basic point: The broken LSP breaks a fundamental
promise in the sockets API that we absolutely require. The bug is
completely within the LSP.

//Magnus

---------------------------(end of broadcast)---------------------------

Magnus Hagander

unread,

Oct 12, 2005, 10:35:54 AM10/12/05

to

> > To reiterate the basic point: The broken LSP breaks a fundamental
> > promise in the sockets API that we absolutely require. The bug is
> > completely within the LSP.
>

> ISTM that maybe what we have here is a documentation shortcoming.
> I'm thinking that our Windows FAQ ought to suggest
> troubleshooting socket-related problems by removing LSPs one
> at a time.

We used to have this, but we removed it when we aded the code that fixed
the problem in 95% of the cases. It's probably a good idea to bring it
back :-(

Tom Lane

unread,

Oct 12, 2005, 10:34:46 AM10/12/05

to

"Magnus Hagander" <m...@sollentuna.net> writes:
> To reiterate the basic point: The broken LSP breaks a fundamental
> promise in the sockets API that we absolutely require. The bug is
> completely within the LSP.

ISTM that maybe what we have here is a documentation shortcoming.

I'm thinking that our Windows FAQ ought to suggest troubleshooting
socket-related problems by removing LSPs one at a time.

regards, tom lane

---------------------------(end of broadcast)---------------------------