ENOLCK errors on dib


#1

Hello.

I am using Procmail with Bogofilter to handle my incoming mail. Bogofilter works by (1) opening a wordlist BDB file, (2) locking the file (fcntl() with F_SETLK, apparently), and (3) mumble mumble. Lately, this has been failing in step 2. I’m seeing errors in my procmail log like

----(cut here)----

... From bounce-indiv-skunk=iskunk.org@craigslist.org Mon Jun 21 15:39:01 2004 Subject: (rooms & shares) 600 - Sunny Room in 4BR/1BA near Coolidge Corner ( Folder: /dev/null 2275 Can't open file 'wordlist.db' in directory '/home/iskunk/.bogofilter'. error #37 - No locks available. procmail: Program failure (3) of "/home/iskunk/bin/bogofilter" procmail: Rescue of unfiltered data succeeded From bounce-indiv-skunk=iskunk.org@craigslist.org Mon Jun 21 15:40:27 2004 Subject: (rooms & shares) 600 - JULY 1! Great room, Near Kendall (Cambridge Folder: /home/iskunk/Maildir/new/1087857631.12053_3.plunder 1562 ...----(cut here)----

The machine seems to be running out of file locks. (Errno 37 corresponds to ENOLCK, “No record locks available.”)

I believe this may be a system configuration issue; is there a way to prevent this from coming up?

–iskunk


#2

We have mostly gotten the fcntl() locking problems under control as far as I know, however the safest solution is probably to find the patch that uses dotlocking instead (I believe that such a patch exists, or existed at one point, and I know that at least one of our users reported success after applying this patch).

Are you seeing this error consistently, or just sometimes?

I sent a quick couple of test messages on the mail machine itself, and they went through just fine (no errors in the procmail log) - and running bogofilter by hand on them worked ok too.

plunder:~$ ./bin/bogofilter -v < /home/iskunk/Maildir/new/1088641156.21936_3.plunder
X-Label: ham, score=0.009469

I also did an strace on this, and verified that the fcntl() function seems to be working properly. The problems we were seeing before were intermittent and hard to track down, though.

Removing the lock on the procmail recipe that invokes bogofilter doesn’t help, does it?

When I used bogofilter, I just did:
:0fw
| [bogofilter commands]


#3

The error comes up very intermittently. Once on the 21st, and then a few times today (Wednesday) in the late morning (between 10:20am and 11:30 or so, your time). It’s annoying mainly because the few times that Bogofilter fails like this, spam always gets through :-]

Procmail’s lockfile won’t affect this; there, the program’s not using a kernel lock record. (It does do some magic when creating the lockfile, but nothing that would fail in this context.)

You’re saying there’s a patch for Bogofilter that makes it use a lockfile to lock the database, instead of fcntl()? That would be fantastic, though Google’s not turning up any hits; would you have a link for that?

–iskunk


#4

I checked the customer’s support history and couldn’t find it. I’ll ask him if he remembers.

If the error is happening that intermittently, may not be worth fixing (the message should still get delivered).


#5

I’m going on the (perhaps unfounded) assumption that the fix would amount to bumping up some kernel sysctl, or (at worst) a compile-time parameter. The last time DH ran into this problem, how was it addressed?

These lock droughts don’t come up terribly often, fortunately, but when they do, they hit doggedly. I told you that it previously happened on the 21st, but what I didn’t mention was that it hit about twenty times between the 19th and 20th. The droughts aren’t short-lived, either. I recently revised my .procmailrc to retry Bogofilter after 10, 40, and 100 seconds in case of error—and every time it has failed since, it has failed a straight four times in a row.

–iskunk


#6

My understanding is that the fix involved upping some resource limits on the machine. Unfortunately, the customer I was thinking of doesn’t recall patching bogofilter - so I may be thinking of someone else (or I may be mistaken entirely). If I have time, I’ll do some more extensive poking around through support histories…

I have passed this thread on to the rest of our team; we’ll try to look into the file locking problems in more detail.


#7

Kind thanks for having them follow this up. One last point worth mentioning:

I’ve noticed that dib runs a security-hardened kernel, and that some effort has been put into isolating users from each other (not least w(1) and ps(1) only showing information for oneself). Here, it does seem that the kernel has a fixed [though manually adjustable] number of lock records to be shared between all users. This scenario is ripe for a local DoS attack. It may be worth looking into some way of limiting each user’s individual usage of this resource.

(I don’t think the problems I’ve encountered are the result of a deliberate DoS; more likely, someone’s carelessly-written script, etc., was inadvertently doing the same. Either way, I’ll wager that when the locks run out, it’s because one user on the system is holding >90% of them.)

Thanks again, and keep up the good work.

–iskunk


#8

Just for the record, NONE of the stuff we’re talking about is happening on your user machine (and we don’t run the grsec kernels on the mail machines). I’m not an expert on this stuff, but I would think the general process and file descriptor limits for users would probably limit damage at least somewhat.

What particular kernel parameters are you concerned about?


#9

My bad; should’ve remembered that dib doesn’t do everything :-] (Granted, it could always be someone else’s badly-behaved .procmailrc running on the mail server…)

I think you’ve got a point there, w.r.t. ulimits. I don’t suppose you can have more than one lock per open file descriptor, and file descriptors are definitely regulated. So if the kernel allows at least as many locks as fds (makes sense, eh?), then you should never hit ENOLCK before hitting EMFILE/ENFILE on your open() call… scratches head

I don’t know enough about this issue, and I haven’t had much luck finding useful information on it, so there’s not much more I can say—other than that I’d love to find out what sysctls/parameters/etc. will make it go away. (About the only tidbit I could find is that /proc/locks is basically the system lock table. I might just try adding a rule that saves a copy whenever Bogofilter fails…)

–iskunk


#10

I think I’ve found what’s eating up all the locks on the mail server: Postfix itself.

I did what I’d mentioned earlier: added a procmail rule that would make a copy of /proc/locks whenever Bogofilter failed. The fifth column of information there (if you skip any “->” tokens) indicates the PID of the lock holder; ps(1) identified the postfix user as the owner of most of the listed processes.

Which is curious, as I thought that DH used maildir inboxes partly to avoid the need for file locking. (Granted, I can’t tell the filename of what was being locked; only the inode number—a couple of inodes did have more than 30 locks held on them, so it could be e.g. a global database file or the like.) A bit of Googling turned up pages like this one, which hint that Postfix can use dot-locking instead of standard fcntl() (though only if the data being locked is not on NFS). Might there be room for a solution in Postfix’s configuration?

FYI: Have a look at ~iskunk/adm/failure.*.{locks,processes} to see what I logged. (These were generated by ~iskunk/bin/log_lock_drought, which is run from my .procmailrc.) This page describes the /proc/locks format. The *.processes file is the output of “ps uww” invoked on all PIDs listed in *.locks.

–iskunk


#11

Maybe the problem is due to a large number of users not using Postfix’s builtin LDA (i.e., users with Procmail filters setup)?

I will look into this…


#12

Hi Will,

Were you able to find anything new on this? The problem has recurred several more times (esp. Wednesday and this afternoon, if you look in ~iskunk/adm/).

–iskunk


#13

I talked this over with Jason, and we noticed a few things:

  1. The number of locks isn’t that high - the range seems to be from 32 to 250 (with only a couple cases where there are over 200 locks) when you’re seeing these failures. Also, it’s not consistent - if you were bumping up against some limit, I think the number of existing locks would be at least somewhat consistent at the time of failure.

  2. If you read the fcntl(2) man page, you’ll see that ENOLCK can also refer to “remote locking protocol failed”. I suspect that this is more likely the problem. This could be due to a network problem of some sort, or the NFS file locking problems mentioned elsewhere in this thread.

We are doing some general kernel upgrades on the mail machines, and new ulimits should take effect when these reboots happen as well. I /think/ these two things may help a little.

That said, we don’t guarantee that NFS file locking will always work properly - that’s why we use Maildir in the first place. So you’re welcome to continue doing what you’re doing, but I don’t think we can really help you too much beyond that. The information you’ve provided is very helpful / useful, and we’ll certainly look into these problems when we have time, but I can’t promise anything beyond that.


#14

[Sorry I didn’t reply sooner; was out of town.]

Argh, yes, a NFS locking problem sounds much more plausible than a local one. I’ll keep an eye on how the kernel upgrade(s) affect this.

Do you know what it is that Postfix is locking, by the way? I don’t suppose it’s the user mailboxes, which makes me wonder what else would need so many locks that NFS can’t handle it.

At any rate, I’ll be happy to leave the current logging mechanism in place, and you are of course welcome to continue reviewing the logs it produces. (There has been, alas, no shortage of them lately…) Thank you for looking into this, and whatever the outcome, NFS is already a difficult enough beast to deal with. I’ll be happy if DH puts up a good fight :slight_smile:

–iskunk


#15

I don’t think it’s the number of locks that’s the problem. Given the number of users on our mail machines, the number of locks you’re seeing is pretty small.

Could be problems with lockd; could be general resource usage problems somewhere…

Also, remember that the same home directories are being mounted by the user machines as well, so if it /were/ a problem based on the number of locks, it could well be related to another machine.


#16

It looks like there are some newer versions of lockd statd etc. that we can try that may fix these… I believe this has worked for us on other machines. We’ll be trying that out in the next few days hopefully.


#17

After a long search, I finally found this thread. I have the exact same problem, using bogofilter.
The error: Can’t open file ‘wordlist.db’ in directory …
error #37 - No locks available.
Has anyone found a fix yet?