Arc Forumnew | comments | leaders | submitlogin
1 point by akkartik 5171 days ago | link | parent

Update: I've been running 4.2.1 for several hours, and I haven't been able to reproduce either Bug A or Bug C (from http://akkartik.name/blog/2010-03-02-01-45-31-soc)


1 point by akkartik 5171 days ago | link

Correction: that's v4.2.1 on 32-bit fedora 8 on EC2.

-----

1 point by akkartik 5171 days ago | link

Argh, my new server died. So Bug C is alive and well even though the scheme sample I isolated works.

-----

1 point by aw 5171 days ago | link

Not sure if I'm quite following you. I think you're saying that you're getting segfaults on your new server even though your sample regex test is working; but if that's true, how do you know that it's bug C, that the segfaults are coming from the regexs?

-----

1 point by akkartik 5171 days ago | link

It's dying in the same phase as before. I emit prints before and after the regex-heavy phase. That's where it's dying. The error is still "SIGSEGV fault on 0x10000084"[1]

I mean, clearly I'm not capturing some aspect of the bug since it's not showing up in the sample test. It seems simpler to assume there's one bug in PLT scheme rather than two, and that we just don't fully understand it.

[1] update: I suppose that's just some error handler

-----

1 point by aw 5171 days ago | link

If you want to pursue this, here are some steps...

If your server dies when running live, you can get it to die yourself by feeding it the same input as it gets from your browser and/or your users.

Now you have a large and complex program that you can get to segfault by feeding it some large and complex set of data.

Then you can run the same program with the same data in the same version of MzScheme on your personal computer, and see if it segfaults there as well. (If not, perhaps it is an OS library problem).

If it also segfaults on your computer, that's good news, you're on your way to track it down.

You can ask on the PLT mailing list what debugging information they'd like to see to track down a segfault.

And, if you feel like it, you can try removing chunks of your program to find a smaller test case that triggers the segfault.

If you can get it to segfault with chunks of code that are non-proprietary to you, that you don't mind publishing, then you can even post a link to a zip or a tar containing the code and the input data and instructions on how to trigger the segfault. Then it would be possible for one of the PLT people to find their bug, even if you have to apologize that your test case is so giant because you couldn't find a smaller test case to trigger the bug.

Alternatively, if this regex code is something you could push off to do in another process (such as by shelling out to Perl to do the regex stuff for you), you could try that.

-----

1 point by akkartik 5171 days ago | link

You know, those steps were approximately what I've been trying to follow :) I thought I had boiled it down to a test case, but now I'm betrayed by it. Betrayed! :)

I don't really have a problem posting the code, but it'll be a huge ball of mud since it includes arc as well. I'm not sure the PLT folks will want to mess with that to find a bug that only occurs sometimes and may well be arc's fault.

It doesn't seem as easy as replaying a request. I encounter the error during a periodic data import phase in my server. It's not being triggered by a specific request. And it doesn't happen everytime the program goes into that phase. Sometimes I see the segfault 6 times a day (the phase runs once an hour), and sometimes everything's peachy for 3 days.

Shelling out to perl is a good idea. I think I'll try using your hack for that.

-----

1 point by aw 5171 days ago | link

If you do the exact same data import with the program in the same initial state, will it segfault again? (You may need to add code to record e.g. the contents of your data files prior to the import and what exactly is being imported).

Or, if you start in a known state and run fifty data imports in a row, can you get it to segfault?

Arc is not supposed to be able to segfault MzScheme. To MzScheme, Arc is just a big program written in Scheme. And MzScheme running Scheme programs is not supposed to segfault.

(If you aren't using the C foreign function interface, of course. A messed up pointer can cause unrelated code to blow up later. So the above paragraph is true if you are only using libraries which are either written in Scheme [and/or Arc, since Arc is written in Scheme] or part of the official MzScheme distribution).

It's true that the PLT folks certainly don't want to be debugging your Arc program, but they do want to be able to get MzScheme to work. And the way to enable them to do that is to give them a test case that they can run to see the segfault. If you can. Even if it is thousands of lines of code and megabytes of data. If you give them a shell script "diedie" that runs a MzScheme program (that runs Arc and loads your program and imports your data) that segfaults, then you make it possible for them to find out what is causing the segfault.

-----

1 point by akkartik 5171 days ago | link

"If you aren't using the C foreign function interface, of course."

Yeah that phase actually performs stemming as well using the FFI. For a long time that was my prime suspect, but I have never been able to get stemming to segfault in isolation, either within arc or from a C program. And I have been able to get my program to segfault without the stemming.

You're right, I'll take another stab at a test case. And I'll not try so hard to make it tiny.

-----

1 point by akkartik 5171 days ago | link

You know, that was one reason I was inclined to move to v372 - version 4 is a huge change. If PLT seems less mature now because of these errors I wanted to treat pre-4.0 and post-4.0 as separate beasts.

But I gave up on porting json.ss to v372 (it's prob something really simple; I'll mail you a file just FYI) and now I'm going to go fix non-stability issues on readwarp for a couple of days..

-----