The strangest crash I’ve seen in a while

Last week, I wrote about a new logging infrastructure I built to track down
a bug. In the original post, I warned that logging wasn’t a panacea – that it
could cause more trouble than it solved. I was more right than I knew, as I
added a bit of logging which led inexorably to a crash by JS_ASSERT failure.
(For those who don’t know what a JS_ASSERT is, it’s an assertion inside the
JavaScript Engine that all is well. Since JavaScript is one of the primary
languages of the World Wide Web, these assertions are really important. If
JavaScript is misbehaving, it can’t be trusted. Fail one such assertion, and
the program aborts, forcing the developer to take notice.)

At first, the crash didn’t make a damned bit of sense. I prefer to write JavaScript,
not just because I entered into software engineering (and Mozilla development)
through JavaScript, but also because it’s super-stable. In fact, the only new
code I was introducing was in JavaScript. It shouldn’t crash. Through
an unlikely chain of events, it does.

This article details how I was able to analyze the crash itself, how logging
set up the conditions for the crash, and how other logging ultimately
proved that the crash resulted from a logging call. I’ve filed this as bug 543304. Ultimately, reducing
this test case to a usable bug report will not be easy – but worth it. More details in the extended entry. (Warning: technobabble content ratio is high. If you do not want to see how this
developer thinks and debugs Mozilla code, do not read on.)

Where this all began

First, I’ll show you the code I wrote, and what I was trying to catch:

const serial = C_c["@mozilla.org/xmlextras/xmlserializer;1"]
.createInstance(C_i.nsIDOMSerializer);
function serializeNode(aNode)
{
var node = aNode.cloneNode(true);
return serial.serializeToString(node, "application/xml");
}
/* Much later... */
// xeIMarkupTemplate
showTemplate: function showTemplate(aNode)
{
dump("Entering showTemplatenn");
// ...
var panel = this.getUserInterfacePanel();
getLoggingService().log("Templates", C_i.xeILoggingService.INFO,
"Panel retrieved", serializeNode(panel));
// This should force <markup:children/> elements to take XBL bindings.
aNode.ownerDocument.documentElement.appendChild(panel);
// ...
},

Basically, I built a <xul:panel/> element, and I wanted to see what
its contents were. The panel had as its only child one of my
<markup:ui-section/> elements. Then I would append the panel to the
document, and go on with other tasks. The logging statement was so I could see
what the panel’s contents were.

That was the plan.
Unfortunately, it
didn’t frakking happen
.

Stack trace (read from the bottom upwards)

Notes Stack line
*crunch* KernelBase.dll!_DebugBreak@0() + 0x2 bytes
Wait, you’re in garbage collection? js3250.dll!JS_Assert(const char * s=0x6f3e6478, const char *
file=0x6f3e6444, int ln=110) Line 65C++
js3250.dll!js_FillPropertyCache(JSContext * cx=0x04120970, JSObject *
obj=0x05143120, unsigned int scopeIndex=0, unsigned int protoIndex=0,
JSObject * pobj=0x05143120, JSScopeProperty * sprop=0x06149a58, int
adding=0) Line 110 + 0x24 bytesC++
js3250.dll!js_GetPropertyHelper(JSContext * cx=0x04120970, JSObject *
obj=0x05143120, int id=59513756, int cacheResult=1, int *
vp=0x002fa900) Line 4274 + 0x1d bytesC++
js3250.dll!js_Interpret(JSContext * cx=0x04120970) Line 1520 + 0x23
bytesC++
js3250.dll!js_Invoke(JSContext * cx=0x04120970, unsigned int argc=1,
int * vp=0x0613f3c8, unsigned int flags=0) Line 1368 + 0x9 bytesC++
Execute JS for the XTF element xpc3250.dll!nsXPCWrappedJSClass::CallMethod(nsXPCWrappedJS *
wrapper=0x060cdc28, unsigned short methodIndex=9, const
XPTMethodDescriptor * info=0x04421ce0, nsXPTCMiniVariant *
nativeParams=0x002fadc8) Line 1696 + 0x1b bytesC++
xpc3250.dll!nsXPCWrappedJS::CallMethod(unsigned short methodIndex=9,
const XPTMethodDescriptor * info=0x04421ce0, nsXPTCMiniVariant *
params=0x002fadc8) Line 571C++
xpcom_core.dll!PrepareAndDispatch(nsXPTCStubBase * self=0x081f29e0,
unsigned int methodIndex=9, unsigned int * args=0x002fae88, unsigned
int * stackBytesToPop=0x002fae78) Line 114 + 0x21 bytesC++
xpcom_core.dll!SharedStub() Line 142C++
XTF element is an orphan… gklayout.dll!nsXTFElementWrapper::UnbindFromTree(int aDeep=0, int
aNullParent=1) Line 231C++
gklayout.dll!nsAttrAndChildArray::Clear() Line 664C++
gklayout.dll!nsAttrAndChildArray::~nsAttrAndChildArray() Line
135C++
gklayout.dll!nsGenericElement::~nsGenericElement() Line 1792 + 0xb
bytesC++
gklayout.dll!nsXULElement::~nsXULElement() + 0x1a bytesC++
gklayout.dll!nsXULElement::`scalar deleting destructor'() + 0xf
bytesC++
No more XUL element! gklayout.dll!nsNodeUtils::LastRelease(nsINode * aNode=0x06701b98)
Line 272 + 0x21 bytesC++
gklayout.dll!nsGenericElement::Release() Line 4155 + 0xc8
bytesC++
XUL element refcount– gklayout.dll!nsXULElement::Release() Line 369 + 0xd bytesC++
xpc3250.dll!XPC_SWN_Finalize(JSContext * cx=0x04120970, JSObject *
obj=0x05143240) Line 1535 + 0xe bytesC++
js3250.dll!FinalizeObject(JSContext * cx=0x04120970, JSObject *
obj=0x05143240) Line 3190 + 0x10 bytesC++
js3250.dll!js_GC(JSContext * cx=0x04120970, JSGCInvocationKind
gckind=GC_NORMAL) Line 3622 + 0xd bytesC++
JS Garbage Collection js3250.dll!JS_GC(JSContext * cx=0x04120970) Line 2439 + 0xb
bytesC++
xpc3250.dll!nsXPConnect::Collect() Line 477 + 0xa bytesC++
xpcom_core.dll!nsCycleCollector::Collect(unsigned int
aTryCollections=1) Line 2434 + 0x19 bytesC++
xpcom_core.dll!nsCycleCollector_collect() Line 3129 + 0x16
bytesC++
gklayout.dll!nsJSContext::CC() Line 3578 + 0x6 bytesC++
Cycle collection starts gklayout.dll!nsJSContext::IntervalCC() Line 3667C++
gklayout.dll!nsUserActivityObserver::Observe(nsISupports *
aSubject=0x00000000, const char * aTopic=0x6e50fcb0, const wchar_t *
aData=0x00000000) Line 269C++
xpcom_core.dll!nsObserverList::NotifyObservers(nsISupports *
aSubject=0x00000000, const char * aTopic=0x6e50fcb0, const wchar_t *
someData=0x00000000) Line 131C++
xpcom_core.dll!nsObserverService::NotifyObservers(nsISupports *
aSubject=0x00000000, const char * aTopic=0x6e50fcb0, const wchar_t *
someData=0x00000000) Line 185C++
gklayout.dll!nsUITimerCallback::Notify(nsITimer * aTimer=0x02f34d80)
Line 277C++
xpcom_core.dll!nsTimerImpl::Fire() Line 431C++
xpcom_core.dll!nsTimerEvent::Run() Line 521C++
xpcom_core.dll!nsThread::ProcessNextEvent(int mayWait=1, int *
result=0x002ff17c) Line 527 + 0x19 bytesC++
xpcom_core.dll!NS_ProcessNextEvent_P(nsIThread * thread=0x008d4e88,
int mayWait=1) Line 250 + 0x16 bytesC++
gkwidget.dll!nsBaseAppShell::Run() Line 170 + 0xc bytesC++
tkitcmps.dll!nsAppStartup::Run() Line 182 + 0x1c bytesC++
xul.dll!XRE_main(int argc=5, char * * argv=0x00a6c890, const
nsXREAppData * aAppData=0x008d0118) Line 3505 + 0x25 bytesC++
firefox.exe!NS_internal_main(int argc=5, char * * argv=0x00a6c890)
Line 158 + 0x12 bytesC++
firefox.exe!wmain(int argc=5, wchar_t * * argv=0x00a6f590) Line 120 +
0xd bytesC++
firefox.exe!__tmainCRTStartup() Line 583 + 0x19 bytesC
firefox.exe!wmainCRTStartup() Line 403C
kernel32.dll!@BaseThreadInitThunk@12() + 0x12 bytes
ntdll.dll!___RtlUserThreadStart@8() + 0x27 bytes
ntdll.dll!__RtlUserThreadStart@8() + 0x1b bytes

I had added a bunch of logging at the same time, and I couldn’t figure out
what was causing it. I kept looking at my XTF element, because I think bugs are
most likely my fault. Several days later, I looked a few levels deeper and
spotted the nsXULElement::~nsXULElement line… which was my first
clue.

That line indicates a destructor, which JavaScript authors rarely
have to think about. For C++ authors, this is routine: you have a function that
potentially cleans up after anything you left hanging around in memory as your
object goes away. But why would the destructor be executing? That means
nobody’s holding any references to it: the “reference count” (which most XPCOM
objects depend on) was zero, indicating no one needed the object. Hence why the
“cycle collection” and garbage collection was destroying it.

Now, having the debugger told me the XTF element was one of my
<markup:ui-section/> elements. In my current implementation of the markup
language, these elements can be children of only two elements: a
<markup:template/> element, and a <xul:panel/> element. From the
stack, it obviously was a XUL element, so it had to be a panel.

Except as far as I could remember, the XUL panel was always appended to the
document’s root element. Always. (I’d forgotten about the
cloneNode part above.) After all, in the showTemplate
method which retrieved the panel, the next thing it did was append that panel
to the root element. So somebody – namely, the parent element of the panel –
should’ve held a strong reference on it. The panel element ultimately should
have had a reference count of at least one. It should not have been
garbage-collected.

Now, this is an equally bizarre thought. Just as JavaScript shouldn’t crash,
XUL elements (in fact, pretty much anything in the DOM) should have its
reference counting accurate. You might have extra reference counts (that’s one
reason the cycle collector is around), but you never find yourself
short one. The destructor call, though, was telling me that’s exactly what
happened.

The depths of reference counting

Well, at this point, I didn’t see I had much of a choice: I had to track
down that one element and see who was “addref’ing” it (holding strong
references) and “releasing” it (letting those strong references go). There’s
two difficulties with this choice: there are a lot of XUL elements, far more
than just panel elements, and I had to write out stack traces.

First, I tried using my logging service (implemented in JavaScript) to
capture reference counts.That didn’t work – there was so much static from
additional addrefs/releases for XPConnect that it started getting crazy – and
then crashing on another garbage-collection assertion while calling my logger.
I gave up and invented another new interface:

[uuid(...)]
interface xeIRefCntLogger : nsISupports
{
void logRefCnt(in nsISupports component,
in PRUint32 count,
in short offset);
};

This had to be C++ to reduce the noise. Then I tried another new interface,
this one for XUL elements:

[scriptable, uuid(...)]
interface xeILogRefCnt : nsISupports
{
void LogAllRefCnts(in PRBool shouldLog);
};

I figured I’d call that on the XUL panel I knew about. As for the actual
logging, I had to do some dirty hacks. (Don’t try this at home.) In
nsXULElement, it ended up looking like this:

diff --git a/content/xul/content/src/nsXULElement.cpp b/content/xul/content/src/nsXULElement.cpp
--- a/content/xul/content/src/nsXULElement.cpp
+++ b/content/xul/content/src/nsXULElement.cpp
@@ -234,16 +234,17 @@ NS_INTERFACE_MAP_BEGIN_CYCLE_COLLECTION(
NS_INTERFACE_MAP_END_AGGREGATED(mElement)
//----------------------------------------------------------------------
// nsXULElement
//
nsXULElement::nsXULElement(nsINodeInfo* aNodeInfo)
: nsGenericElement(aNodeInfo),
+      mLogAllRefCnts(PR_FALSE),
mBindingParent(nsnull)
{
XUL_PROTOTYPE_ATTRIBUTE_METER(gNumElements);
}
nsXULElement::nsXULSlots::nsXULSlots(PtrBits aFlags)
: nsXULElement::nsDOMSlots(aFlags)
{
@@ -258,16 +259,24 @@ nsXULElement::nsXULSlots::~nsXULSlots()
}
nsINode::nsSlots*
nsXULElement::CreateSlots()
{
return new nsXULSlots(mFlagsOrSlots);
}
+NS_IMETHODIMP
+nsXULElement::LogAllRefCnts(PRBool shouldLog)
+{
+  RefCntToLoggingService(mRefCnt.get(), 0);
+  mLogAllRefCnts = shouldLog;
+  return NS_OK;
+}
+
/* static */
already_AddRefed<nsXULElement>
nsXULElement::Create(nsXULPrototypeElement* aPrototype, nsINodeInfo *aNodeInfo,
PRBool aIsScriptable)
{
nsXULElement *element = new nsXULElement(aNodeInfo);
if (element) {
NS_ADDREF(element);
@@ -360,24 +369,50 @@ NS_NewXULElement(nsIContent** aResult, n
NS_IMPL_CYCLE_COLLECTION_CLASS(nsXULElement)
NS_IMPL_CYCLE_COLLECTION_TRAVERSE_BEGIN_INHERITED(nsXULElement,
nsGenericElement)
NS_IMPL_CYCLE_COLLECTION_TRAVERSE_NATIVE_MEMBER(mPrototype,
nsXULPrototypeElement)
NS_IMPL_CYCLE_COLLECTION_TRAVERSE_END
-NS_IMPL_ADDREF_INHERITED(nsXULElement, nsGenericElement)
-NS_IMPL_RELEASE_INHERITED(nsXULElement, nsGenericElement)
+NS_IMETHODIMP_(nsrefcnt) nsXULElement::AddRef(void)
+{
+  nsrefcnt r = nsGenericElement::AddRef();
+  //NS_LOG_ADDREF(this, r, nsXULElement, sizeof(*this));
+  RefCntToLoggingService(r, +1);
+  return r;
+}
+NS_IMETHODIMP_(nsrefcnt) nsXULElement::Release(void)
+{
+  RefCntToLoggingService(mRefCnt.get() - 1, -1);
+  nsrefcnt r = nsGenericElement::Release();
+  //NS_LOG_RELEASE(this, r, nsXULElement);
+  return r;
+}
+
+void
+nsXULElement::RefCntToLoggingService(PRUint32 count, PRInt16 offset)
+{
+  if (!mLogAllRefCnts)
+    return;
+
+  nsCOMPtr<xeIRefCntLogger> logger
+    = do_GetService(XE_REFCNTLOGGER_CONTRACTID);
+  if (!logger)
+    return;
+  logger->LogRefCnt((nsIContent*) this, mRefCnt.get(), offset);
+};
NS_INTERFACE_TABLE_HEAD_CYCLE_COLLECTION_INHERITED(nsXULElement)
NS_NODE_OFFSET_AND_INTERFACE_TABLE_BEGIN(nsXULElement)
NS_INTERFACE_TABLE_ENTRY(nsXULElement, nsIDOMNode)
NS_INTERFACE_TABLE_ENTRY(nsXULElement, nsIDOMElement)
NS_INTERFACE_TABLE_ENTRY(nsXULElement, nsIDOMXULElement)
+        NS_INTERFACE_TABLE_ENTRY(nsXULElement, xeILogRefCnt)
NS_OFFSET_AND_INTERFACE_TABLE_END
NS_ELEMENT_INTERFACE_TABLE_TO_MAP_SEGUE
NS_INTERFACE_MAP_ENTRY_TEAROFF(nsIScriptEventHandlerOwner,
new nsScriptEventHandlerOwnerTearoff(this))
NS_INTERFACE_MAP_ENTRY_TEAROFF(nsIDOMElementCSSInlineStyle,
new nsXULElementTearoff(this))
NS_INTERFACE_MAP_ENTRY_TEAROFF(nsIFrameLoaderOwner,
new nsXULElementTearoff(this))
diff --git a/content/xul/content/src/nsXULElement.h b/content/xul/content/src/nsXULElement.h
--- a/content/xul/content/src/nsXULElement.h
+++ b/content/xul/content/src/nsXULElement.h
@@ -73,16 +73,18 @@
#include "nsLayoutCID.h"
#include "nsAttrAndChildArray.h"
#include "nsGkAtoms.h"
#include "nsAutoPtr.h"
#include "nsGenericElement.h"
#include "nsDOMScriptObjectHolder.h"
#include "nsIFrameLoader.h"
+#include "xeILoggingService.h"
+
class nsIDocument;
class nsString;
class nsIDocShell;
class nsICSSStyleRule;
class nsIObjectInputStream;
class nsIObjectOutputStream;
class nsIScriptGlobalObjectOwner;
@@ -464,17 +466,18 @@ public:
The XUL element.
*/
#define XUL_ELEMENT_TEMPLATE_GENERATED 1 << NODE_TYPE_SPECIFIC_BITS_OFFSET
class nsScriptEventHandlerOwnerTearoff;
-class nsXULElement : public nsGenericElement, public nsIDOMXULElement
+class nsXULElement : public nsGenericElement, public nsIDOMXULElement,
+                     public xeILogRefCnt
{
public:
/** Typesafe, non-refcounting cast from nsIContent.  Cheaper than QI. **/
static nsXULElement* FromContent(nsIContent *aContent)
{
if (aContent->IsNodeOfType(eXUL))
return static_cast<nsXULElement*>(aContent);
@@ -571,16 +574,24 @@ public:
NS_FORWARD_NSIDOMNODE(nsGenericElement::)
// nsIDOMElement
NS_FORWARD_NSIDOMELEMENT(nsGenericElement::)
// nsIDOMXULElement
NS_DECL_NSIDOMXULELEMENT
+    // xeILogRefCnt
+    NS_DECL_XEILOGREFCNT
+private:
+  PRBool mLogAllRefCnts;
+  void RefCntToLoggingService(PRUint32 count, PRInt16 offset);
+
+public:
+
virtual nsresult Clone(nsINodeInfo *aNodeInfo, nsINode **aResult) const;
virtual PRInt32 IntrinsicState() const;
nsresult GetStyle(nsIDOMCSSStyleDeclaration** aStyle);
nsresult GetFrameLoader(nsIFrameLoader** aFrameLoader);
nsresult SwapFrameLoaders(nsIFrameLoaderOwner* aOtherOwner);

I only recommend replacing NS_IMPL_ADDREF and NS_IMPL_RELEASE if you really,
really have to. As for the logging code itself,

#ifndef __xeRefCountLogger_h__
#define __xeRefCountLogger_h__
#include "nsCOMPtr.h"
#include "nsAutoPtr.h"
#include "nsIOutputStream.h"
#include "nsIStreamListener.h"
#include "nsIException.h"
#include "xeILoggingService.h"
class xeRefCountLogger : public xeIRefCntLogger
{
public:
NS_DECL_ISUPPORTS
NS_DECL_XEIREFCNTLOGGER
xeRefCountLogger();
~xeRefCountLogger();
nsresult Init();
void LogStackFrame(void *aPC);
private:
nsCOMPtr<nsIOutputStream> mLogStream;
PRPackedBool mIsInitialized;
nsCOMPtr<nsIStackFrame> mCurrentJSFrame;
protected:
/* additional members */
};
/* 8383d82e-4046-444a-aa73-5b420fb1a6f2 */
#define XE_REFCNTLOGGER_CID 
{ 
0x8383d82e, 
0x4046, 
0x444a, 
{ 0xaa, 0x73, 0x5b, 0x42, 0x0f, 0xb1, 0xa6, 0xf2 } 
}
#endif // __xeRefCountLogger_h__
(C++ here)
#include "xeRefCountLogger.h"
#include "nsIFileStreams.h"
#include "nsServiceManagerUtils.h"
#include "nsComponentManagerUtils.h"
#include "nsIProperties.h"
#include "nsILocalFile.h"
#include "nsIStreamListener.h"
#include "nsIInputStreamPump.h"
#include "nsStringGlue.h"
#include "nsStackWalk.h"
NS_IMPL_ISUPPORTS1(xeRefCountLogger, xeIRefCntLogger)
xeRefCountLogger::xeRefCountLogger() :
mIsInitialized(PR_FALSE)
{
/* member initializers and constructor code */
}
xeRefCountLogger::~xeRefCountLogger()
{
/* destructor code */
}
nsresult
xeRefCountLogger::Init()
{
if (mIsInitialized)
return NS_OK;
nsresult rv;
nsCOMPtr<nsIProperties> propertiesService
= do_GetService("@mozilla.org/file/directory_service;1", &rv);
NS_ENSURE_SUCCESS(rv, rv);
nsCOMPtr<nsILocalFile> profDFile;
rv = propertiesService->Get("ProfD", NS_GET_IID(nsILocalFile),
getter_AddRefs(profDFile));
NS_ENSURE_SUCCESS(rv, rv);
nsCOMPtr<nsIFile> refCntLogFile;
rv = profDFile->Clone(getter_AddRefs(refCntLogFile));
NS_ENSURE_SUCCESS(rv, rv);
rv = refCntLogFile->Append(NS_LITERAL_STRING("refcnt.log"));
NS_ENSURE_SUCCESS(rv, rv);
rv = refCntLogFile->CreateUnique(nsIFile::NORMAL_FILE_TYPE, 777);
NS_ENSURE_SUCCESS(rv, rv);
nsCOMPtr<nsIFileOutputStream> fileOut
= do_CreateInstance("@mozilla.org/network/file-output-stream;1", &rv);
NS_ENSURE_SUCCESS(rv, rv);
rv = fileOut->Init(refCntLogFile, -1, -1, 0);
NS_ENSURE_SUCCESS(rv, rv);
mLogStream = do_QueryInterface(fileOut, &rv);
NS_ENSURE_SUCCESS(rv, rv);
mIsInitialized = PR_TRUE;
return NS_OK;
}
extern "C" {
static void PrintStackFrame(void *aPC, void *aClosure)
{
xeRefCountLogger* logger = (xeRefCountLogger*) aClosure;
logger->LogStackFrame(aPC);
}
}
void
xeRefCountLogger::LogStackFrame(void *aPC)
{
nsCodeAddressDetails details;
char buf[1024];
NS_DescribeCodeAddress(aPC, &details);
NS_FormatCodeAddressDetails(aPC, &details, buf, sizeof(buf));
nsCString data(buf);
PRUint32 writeCount;
nsresult rv = mLogStream->Write(data.get(), data.Length(), &writeCount);
NS_WARN_IF_FALSE(NS_SUCCEEDED(rv), "Oops");
}
/* void logRefCnt (in nsISupports component, in PRUint32 count); */
NS_IMETHODIMP
xeRefCountLogger::LogRefCnt(nsISupports *component,
PRUint32 count,
PRInt16 offset)
{
nsresult rv = Init();
if (NS_FAILED(rv))
return rv;
nsCString msg;
msg.Truncate();
msg.AppendInt(count);
msg.Append(" (");
if (offset == 0)
msg.Append("---");
else if (offset == +1)
msg.Append("Add");
else if (offset == -1)
msg.Append("Rel");
else
return NS_ERROR_ILLEGAL_VALUE;
msg.Append(") ");
PRUint32 addr = (unsigned int) &(*component);
msg.AppendInt(addr, 16);
msg.Append('n');
// C++ Stack
PRUint32 writeCount;
rv = mLogStream->Write(msg.get(), msg.Length(), &writeCount);
NS_ENSURE_SUCCESS(rv, rv);
NS_StackWalk(PrintStackFrame, 2, this);
rv = mLogStream->Write("nn", 2, &writeCount);
NS_ENSURE_SUCCESS(rv, rv);
return mLogStream->Flush();
}

Here I found something which was actually pretty cool – the NS_StackWalk
function. I’d seen other cases where Firefox code could generate a stack, and I
said to myself, “I want that.” It turned out to be a pretty good idea for
figuring out why something is adding or releasing references. The stacks would
(on Windows) look like this:

2 (Add) 7bb0998
xpcom_core!NS_TableDrivenQI+0x0000000000000040 (c:edit-templatefx-debugxpcombuildnsisupportsimpl.cpp, line 50)
gklayout!nsXULElement::QueryInterface+0x00000000000000AD (c:edit-templatemozillacontentxulcontentsrcnsxulelement.cpp, line 410)
xpc3250!XPCConvert::JSObject2NativeInterface+0x0000000000000153 (c:edit-templatemozillajssrcxpconnectsrcxpcconvert.cpp, line 1484)
xpc3250!XPCConvert::JSData2Native+0x000000000000104F (c:edit-templatemozillajssrcxpconnectsrcxpcconvert.cpp, line 1017)
xpc3250!XPCWrappedNative::CallMethod+0x0000000000000C7E (c:edit-templatemozillajssrcxpconnectsrcxpcwrappednative.cpp, line 2564)
xpc3250!XPC_WN_CallMethod+0x00000000000001C6 (c:edit-templatemozillajssrcxpconnectsrcxpcwrappednativejsops.cpp, line 1740)
js3250!js_Invoke+0x000000000000087D (c:edit-templatemozillajssrcjsinterp.cpp, line 1360)
... // and so on for another fifty or so lines.

By the way, if you’re in a debug build and trying to match reference counts,
don’t forget this patch either:

diff --git a/xpcom/glue/nsCOMPtr.h b/xpcom/glue/nsCOMPtr.h
--- a/xpcom/glue/nsCOMPtr.h
+++ b/xpcom/glue/nsCOMPtr.h
@@ -87,17 +87,17 @@
//  about (unused) inline functions being removed.  This is to be expected with
//  templates, so we disable the warning.
#pragma warning( disable: 4514 )
#endif
#define NSCAP_FEATURE_USE_BASE
#ifdef NS_DEBUG
-  #define NSCAP_FEATURE_TEST_DONTQUERY_CASES
+//#define NSCAP_FEATURE_TEST_DONTQUERY_CASES
#undef NSCAP_FEATURE_USE_BASE
//#define NSCAP_FEATURE_TEST_NONNULL_QUERY_SUCCEEDS
#endif
/*
|...TEST_DONTQUERY_CASES| and |...DEBUG_PTR_TYPES| introduce some code that is
problematic on a select few of our platforms, e.g., QNX.  Therefore, I'm providing
a mechanism by which these features can be explicitly disabled from the command-line.

As I found out the hard way, that DONTQUERY_CASES thing generates a new
nsCOMPtr for every nsCOMPtr you are using in a QueryInterface call. That means
an extra addref and release pair. Very, very noisy when you’re trying to track
down a possible extra release or missing addref. This one change cut my log
file sizes in half.

So once I cleared out all that clutter, I found that the panel element I was
following didn’t look like the one being destroyed. At the end, it had a
reference count of 12, which didn’t make sense for something being deleted. So
I tried something nastier, and added this code to both nsXULElement::AddRef and
nsXULElement::Release:

+    if (mNodeInfo->Equals(nsGkAtoms::panel))
+    {
+        mLogAllRefCnts = PR_TRUE;
+    }
+

Here comes the horde. This time I did find the panel element with a refcount
approaching zero… and several other panels. Fortunately, my debugger offered
an easy way to tell which panel was dying: in its stack it had a pointer value
for the dead panel. That same pointer value (the address) matched the address
line for several of my addref/release stack traces. Also, in filtering for just
that panel’s memory address, I found the ref count going all the way down to 1,
right before the crash. Yippie-kay-yay.

This is where the investigation began in earnest. I noticed the first
appearance of the dead panel (while it was being born) had an interesting
signature:

2 (Add) 7bb0998
gklayout!nsXULElement::Clone+0x0000000000000182 (c:edit-templatemozillacontentxulcontentsrcnsxulelement.cpp, line 465)
gklayout!nsNodeUtils::CloneAndAdopt+0x00000000000002AF (c:edit-templatemozillacontentbasesrcnsnodeutils.cpp, line 574)
gklayout!nsNodeUtils::Clone+0x0000000000000026 (c:edit-templatemozillacontentbasesrcnsnodeutils.h, line 171)
gklayout!nsNodeUtils::CloneNodeImpl+0x000000000000004A (c:edit-templatemozillacontentbasesrcnsnodeutils.cpp, line 450)
gklayout!nsGenericElement::CloneNode+0x0000000000000018 (c:edit-templatemozillacontentbasesrcnsgenericelement.h, line 537)
gklayout!nsXULElement::CloneNode+0x0000000000000016 (c:edit-templatemozillacontentxulcontentsrcnsxulelement.h, line 574)
xpc3250!nsIDOMNode_CloneNode+0x0000000000000198 (c:edit-templatefx-debugjssrcxpconnectsrcdom_quickstubs.cpp, line 3868)
js3250!js_Interpret+0x000000000000FE71 (c:edit-templatemozillajssrcjsops.cpp, line 2208)
js3250!js_Invoke+0x00000000000008C2 (c:edit-templatemozillajssrcjsinterp.cpp, line 1368)

The panel was being born of a call to node.cloneNode(). I don’t know many
XUL panels being cloned, so why not stick the flag for turning on refcount
logging there instead?

@@ -407,16 +442,21 @@ nsXULElement::Clone(nsINodeInfo *aNodeIn
element->SetScriptTypeID(GetScriptTypeID());
}
}
if (!element) {
return NS_ERROR_OUT_OF_MEMORY;
}
+    if (aNodeInfo->Equals(nsGkAtoms::panel))
+    {
+        element->mLogAllRefCnts = PR_TRUE;
+    }
+
// XXX TODO: set up RDF generic builder n' stuff if there is a
// 'datasources' attribute? This is really kind of tricky,
// because then we'd need to -selectively- copy children that
// -weren't- generated from RDF. Ugh. Forget it.
// Note that we're _not_ copying mControllers.
nsresult rv = CopyInnerTo(element);
 

Bingo. That gave me only one panel in the refcount log. I wanted to see
where this panel was being inserted in a document. Scanning through the log, I
was surprised: it was never inserted into any document. It would be
cloned, go through a DOM serializer, and then be caught in cycle collection.
That’s all that ever happened to it.

By now I had more than enough to place a breakpoint and see who was creating
this dead panel in the first place. The JavaScript frames pointed right to my
own serializeNode() function… at which point I remembered where and why I was
serializing a panel. To diagnose another bug. Whoops.

So there you have it… how one piece of logging code blamed another logging
fragment for crashing my Firefox builds.

But wait, THERE’S MORE!!!

At this point, I was pretty much convinced that I was dead, unable to make
further progress on this project until someone could solve that JS_ASSERT
trickiness. I decided as almost the first step in reducing this monster
testcase to move the offending logging call after the appendChild call.

      // This should force <markup:children/> elements to take XBL bindings.
aNode.ownerDocument.documentElement.appendChild(panel);
getLoggingService().log("Templates", C_i.xeILoggingService.INFO,
"Panel retrieved", serializeNode(panel));

I don’t remember why I did that, but this was where I got another big
surprise. It stopped crashing. I swapped it back, and it crashed. I
moved the logging call after the appendChild call again, and it didn’t
crash.

So let me try to get this straight, because it’s still mind-boggling to
me:

  • XTF + CloneNode + SerializeToString + XUL, followed by an appendChild,
    leads to garbage collection and a JS_ASSERT() failure…
  • AppendChild, followed by XTF + CloneNode + SerializeToString + XUL, leads
    to normal (well, buggy), not-crashing behavior.

What the hell is going on here? I really can’t
understand this, how just the order of operations in JavaScript is the
difference between a death spiral, and a shiny day. I know this isn’t a
Heisenbug, since it’s 100% reproducible in one case and 0% reproducible in the
other… but this just goes so far outside my understanding of “How Things
Work” that it blows my mind this is even possible.

That’s three times over the course of a month trying to figure this sick,
twisted little freak out, where something that makes absolutely no sense
appeared to be real:

  1. A JS_ASSERT being triggered where I had written no native-level code,
    just JavaScript, XUL, XBL, CSS, and IDL.
  2. XUL reference counting leading to garbage collection when the item being
    collected looked like 12 consumers had a hold on it. (Fortunately, this
    turned out to be false.)
  3. The program does not crash if I insert the panel into the document before
    trying to serialize the panel.

All this in a pet project, found by one guy working alone in his apartment
for years.

So now I have to take some 300KB of extension + Mochitest apart (again!), to
a reasonably small reproducible test case. This I have to do before someone else will find the bug painless enough to actually look at and fix. I still don’t know why it failed, and my confidence in my ability to understand – and in the Mozilla platform – have been badly shaken.

Worst of all, I fear that when I reduce the testcase and submit it, someone
competent will decide that the cloned node – after its original has been
appended to the document – should have been garbage-collected just as if the
original hadn’t been appended. That means I will universally crash again, and
there will be nothing I can do to stop it. Except wait for that ever-moving
target, XBL 2.0, to replace XTF.

For those of you boldly going where no one has gone before:

If you really want to build this test and see for yourself, here’s my build
instructions:

  1. hg clone http://hg.mozilla.org/releases/mozilla-1.9.2 mozilla
  2. hg clone http://hg.mozdev.org/verbosio verbosio
  3. cd verbosio
  4. hg update mozilla-bug534412
  5. cd ..
  6. Copy verbosio/src/experimental/edit-template to
    mozilla/extensions/edit-template
  7. In your .mozconfig, add ac_add_options
    --enable=extensions=default,edit-template
  8. Also add ac_add_options --enable-tests
  9. Build Mozilla Firefox.
  10. cd fx-debug
  11. python _tests/testing/mochitest/runtests.py –chrome
    –test-path=extensions/edit/tests/test_templates.xul
  12. When the page appears, click on the “Run Test” button, and buckle your
    seatbelt.

As I said, I have to try to reduce this to a manageable testcase. If you want to try debugging it now, go ahead. I’ve reserved this branch specifically for the crash bug I saw here.