[SOLVED] Infuriating network problem

The place to post if you need help or advice

Moderators: ChrisThornett, LXF moderators

[SOLVED] Infuriating network problem

Postby OnlyTheTony » Sun Jul 11, 2010 3:15 pm

I've recently updated my home server from OpenSuse 10.3 to Lucid Server.

I replaced the motherboard with an Intel DQ35JO that I had lying around. I've dropped in a Core 2 Quad and 4Gb RAM.

Additionally I added a 3Ware/AMCC 9650-2LP RAID controller for the disks, which runs off one of the PCI-E x1 slots.

From the "old" server I brought across an Intel Pro 1000 PT Dual Gigabit PCI-E Server Adaptor which is utilising the mobo's PCI-E x16 slot. On the old motherboard (an Asus) this worked without issue.

The problem I'm having is that the NIC seems to be going into some kind of sleep mode once other computers on the network disconnect. Booting up any of the machines directly attached to the same gigabit hub (a Netgear ProSafe GS116) reinstates the connection.

The server logs don't show any indication that that the network link is dropping at any point - it just seems to be waiting for a signal from any LAN-connected device. If I try connect using any wireless devices through the wireless router (Netgear DGN2000) I get no response.

Once it's up and running it works flawlessly - but I didn't have any of these problems under OpenSuse with the older motherboard.

It helps - the hub is connected to the router by a single cable. All other wired network connections are made through the gigabit hub. Internet and wireless links are made through the router via the single link.

The server is used for web (public facing development server), emails (SMTP/IMAP), NFS/Samba (internal network only) and VPN.

It really is driving me mad - I'm considering replacing the motherboard to see if that has any effect so any help you guys could give would be REALLY appreciated.

T.

Edit: If forgot to mention it's using the 1.0.2-k2 driver. I've downloaded the latest e1000e driver 1.2.8 - I'll install that later and see if it makes a difference. I'll post the result on here in case anyone else has a similar problem....
Last edited by OnlyTheTony on Tue Jul 20, 2010 8:59 am, edited 1 time in total.
If at first you don't succeed, call it v1.0
User avatar
OnlyTheTony
LXF regular
 
Posts: 303
Joined: Mon Jan 08, 2007 11:51 am

Postby Dutch_Master » Mon Jul 12, 2010 12:05 am

Things to consider:
1) static IP, no DHCP
2) longer lease times
3) new kernel

My tuppence :)
Dutch_Master
LXF regular
 
Posts: 2439
Joined: Tue Mar 27, 2007 1:49 am

Postby ollie » Mon Jul 12, 2010 11:14 am

Check the BIOS for power settings and check the power management settings in YaST to ensure the network Wake On LAN (WOL) is turned off. This is what shuts down the ethernet connection.

Ref: http://www.lesswatts.org/tips/ethernet.php
User avatar
ollie
Moderator
 
Posts: 2749
Joined: Mon Jul 25, 2005 11:26 am
Location: Bathurst NSW Australia

Postby OnlyTheTony » Mon Jul 12, 2010 12:47 pm

Thanks for your answers guys.

It turns out it was much more simple(?).

Intel motherboards have a bios-based "lights out" management system "Intel ME" which prioritises the onboard LAN adapter - for obvious reasons. Once I disabled "Intel ME" and switched the onboard LAN off it worked a treat. The connection has been fine ever since!

Can't believe I wasted a week trying to fix that!!!
If at first you don't succeed, call it v1.0
User avatar
OnlyTheTony
LXF regular
 
Posts: 303
Joined: Mon Jan 08, 2007 11:51 am

Postby OnlyTheTony » Tue Jul 13, 2010 12:51 pm

Okay.. I was wrong.

Even updating the driver to 1.1.2 (1.2.8 wouldn't build) hasn't solved the problem. Shortly after network connections are removed (either imap connection or NFS) the server's network connection just drops. There's nothing in the syslogs or even dmeg. Only re-establishing a connection from another desktop machine restarts it!

Dutch master - thanks for your input but it's already running on a static IP and I've updated the kernel to the latest ones in the repos.

The network hardware is the same as I used under opensuse 10.3 - so it's either an Ubuntu bug or a problem with the motherboard (which is nearly 2 years old so it's possible).

I'm open to any other suggestions here!!
If at first you don't succeed, call it v1.0
User avatar
OnlyTheTony
LXF regular
 
Posts: 303
Joined: Mon Jan 08, 2007 11:51 am

Postby wyliecoyoteuk » Tue Jul 13, 2010 2:07 pm

Could be a keepalive issue?
http://en.wikipedia.org/wiki/Keepalive
The sig between the asterisks is so cool that only REALLY COOL people can even see it!

*************** ************
User avatar
wyliecoyoteuk
LXF regular
 
Posts: 3452
Joined: Sun Apr 10, 2005 10:41 pm
Location: Birmingham, UK

Postby OnlyTheTony » Wed Jul 14, 2010 2:57 pm

It could be.

I've been doing some research and there were a lot of threads about the e1000e driver closing the connection - as of yet nobody's posted a solution.

To see whether it's an OS or driver issue I've deactivated the (expensive) Intel adapter and I'm trying the onboard gigabit LAN to see if that maintains the connection overnight.

If it is a keepalive issue what would I look for and how could I get around it?
If at first you don't succeed, call it v1.0
User avatar
OnlyTheTony
LXF regular
 
Posts: 303
Joined: Mon Jan 08, 2007 11:51 am

Postby nelz » Wed Jul 14, 2010 3:16 pm

Why are you using an external driver and not the e1000e driver in the kernel?
"Insanity: doing the same thing over and over again and expecting different results." (Albert Einstein)
User avatar
nelz
Site admin
 
Posts: 8469
Joined: Mon Apr 04, 2005 11:52 am
Location: Warrington, UK

Postby OnlyTheTony » Wed Jul 14, 2010 7:08 pm

Because the e1000e driver in the kernel was outdated so I installed a new version in the hope it would solve the timeout problem.

It didn't.

The thing is the problem still exists whether I use the Intel LAN adapter or the realtek onboard - which makes me suspect it's an issue with Ubuntu rather than any of the hardware or drivers.

I'm considering ditching Ubuntu for something like CentOS to see if this removes the problem.

It's a total pain - I never had this problem under opensuse 10.3 - the only reason I "upgraded" was because the install failed and I thought I'd take the opportunity to rebuild. I wish I hadn't.

If it's any help - dmesg returns:

[37402.040022] NETDEV WATCHDOG: eth3 (r8169): transmit queue 0 timed out

and the last few lines are:

[37402.080072] r8169: eth3: link up
[37426.080071] r8169: eth3: link up
[37456.080069] r8169: eth3: link up
[37498.080070] r8169: eth3: link up
[37540.080057] r8169: eth3: link up
[37582.080064] r8169: eth3: link up
[37624.080063] r8169: eth3: link up
[37666.080065] r8169: eth3: link up
[37708.080068] r8169: eth3: link up
[37750.080066] r8169: eth3: link up

As you can see it's not reporting the link as being down - just constantly coming back up.

My internet connection keeps dropping for some weird reason and I'm also wondering if the two are linked. I'm frustrated and confused.
If at first you don't succeed, call it v1.0
User avatar
OnlyTheTony
LXF regular
 
Posts: 303
Joined: Mon Jan 08, 2007 11:51 am

Postby wyliecoyoteuk » Wed Jul 14, 2010 7:42 pm

Seem to be a lot of posts saying that this is an APIC related bug.

https://bugs.launchpad.net/ubuntu/+sour ... bug/574281
The sig between the asterisks is so cool that only REALLY COOL people can even see it!

*************** ************
User avatar
wyliecoyoteuk
LXF regular
 
Posts: 3452
Joined: Sun Apr 10, 2005 10:41 pm
Location: Birmingham, UK

Postby OnlyTheTony » Wed Jul 14, 2010 7:59 pm

Wylie, doing more digging I've come across that too. I've added "noapic" to the boot parameters and restarted. I've also reinstated the Intel card - I'll see how that goes....

Edit: Further research indicates that kernel 2.3.34 doesn't have this problem - so just waiting for that to hit the repos now.
If at first you don't succeed, call it v1.0
User avatar
OnlyTheTony
LXF regular
 
Posts: 303
Joined: Mon Jan 08, 2007 11:51 am

Postby OnlyTheTony » Thu Jul 15, 2010 10:38 am

Still doing it!

I've decided to backup everything and switch distros to CentOS over the weekend as the errors I'm getting on Ubuntu don't seem to be present on that!

Fingers crossed...
If at first you don't succeed, call it v1.0
User avatar
OnlyTheTony
LXF regular
 
Posts: 303
Joined: Mon Jan 08, 2007 11:51 am

Postby OnlyTheTony » Sun Jul 18, 2010 10:37 am

After 2 x motherboards, 2 x distros and several nights of sitting around until 2am sobbing I may have found the culprit.

It was nothing to do with the server at all - it would appear to be a problem with my desktop machine. The lan was always activated and didn't appear to be sending a disconnect signal to the server which consequently hung waiting for a response.

I've replaced the onboard LAN with a PCIe x 4 dual Marvell lan adapter - let's see if this solves the problem.

Oh, and I went back to Ubuntu because CentOS, whilst good, was far too slow on my hardware.
If at first you don't succeed, call it v1.0
User avatar
OnlyTheTony
LXF regular
 
Posts: 303
Joined: Mon Jan 08, 2007 11:51 am

Postby OnlyTheTony » Tue Jul 20, 2010 9:01 am

*****SOLVED*****

It turns out that it was the LAN adapter on the desktop PC that was causing the problem. Had no trouble with network connectivity since changing the onboard LAN for a PCIe card.

Thanks to everyone who offered potential solutions.
If at first you don't succeed, call it v1.0
User avatar
OnlyTheTony
LXF regular
 
Posts: 303
Joined: Mon Jan 08, 2007 11:51 am


Return to Help!

Who is online

Users browsing this forum: No registered users and 0 guests