{"id":228970,"date":"2022-05-17T10:34:14","date_gmt":"2022-05-17T17:34:14","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/java\/?p=228970"},"modified":"2022-09-19T13:25:44","modified_gmt":"2022-09-19T20:25:44","slug":"debugging-a-jvm-crash-for-linkedin-part-2","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/java\/debugging-a-jvm-crash-for-linkedin-part-2\/","title":{"rendered":"Debugging a JVM Crash for LinkedIn &#8211; Part 2"},"content":{"rendered":"<h2><a name=\"_Toc99970364\"><\/a>Introduction<\/h2>\n<p>Welcome to Part 2 of our investigation into a JVM crash for LinkedIn.\u00a0This blog post continues the investigation we began in <a href=\"https:\/\/devblogs.microsoft.com\/java\/debugging-a-jvm-crash-for-linkedin-part-1\/\">Debugging a JVM Crash for LinkedIn &#8211; Part 1<\/a>. In Part 1, we analyzed the HotSpot error log file to identify the instruction where the JVM crashed.<\/p>\n<p>As a reminder, this series is broken down as follows:<\/p>\n<ul>\n<li><a href=\"https:\/\/devblogs.microsoft.com\/java\/debugging-a-jvm-crash-for-linkedin-part-1\/\">Part 1: Analyzing the Log<\/a><\/li>\n<li><a href=\"#part2-analyzing-core-dump\">Part 2: Analyzing the Core Dump<\/a><\/li>\n<li><a href=\"https:\/\/devblogs.microsoft.com\/java\/debugging-a-jvm-crash-for-linkedin-part-3\/\">Part 3: Looking for a Fix<\/a><\/li>\n<\/ul>\n<h2><a name=\"part2-analyzing-core-dump\"><\/a>Analyzing the Core Dump<\/h2>\n<p>In addition to validating the disassembly we see with ODA, we can use the core dump to view the disassembly of an entire compiled method. The HotSpot error log file only provides an excerpt from the compiled function, but the core dump has access to all of it.<\/p>\n<p>After obtaining the 15 GB core dump file from LinkedIn, we run Ubuntu on <a href=\"https:\/\/docs.microsoft.com\/en-us\/windows\/wsl\/about\">Windows Subsystem for Linux (WSL)<\/a> (which allows one to run a Linux distribution on a Windows machine) and load the core dump using <span style=\"font-family: terminal, monaco, monospace;\">gdb<\/span>:<\/p>\n<pre class=\"prettyprint\">gdb .\/jdk-11.0.8+10\/bin\/java core.19538<\/pre>\n<p>We need to make sure that we are using the same Java executable that the Espresso-snapshots service is using, which in this case was version 11.0.8+10 of the <a href=\"https:\/\/www.microsoft.com\/openjdk\">Microsoft Build of OpenJDK<\/a>.<\/p>\n<p>Next, we want to look at the disassembled code where the crash occurred. The log file gives us this information about the compiled method:<\/p>\n<pre class=\"prettyprint\">J 52708 c2 sun.security.jgss.krb5.Krb5Context.initSecContext(Ljava\/io\/InputStream;I)[B java.security.jgss@11.0.8 (698 bytes) @ 0x00007ffb860d705f [0x00007ffb860d47c0+0x000000000000289f]<\/pre>\n<p>From this we can see that the compiled method starts at <span style=\"font-family: terminal, monaco, monospace;\">0x00007ffb860d47c0<\/span>, and the crash occurred at an offset of <span style=\"font-family: terminal, monaco, monospace;\">0x000000000000289f<\/span> from there, which results in an address of <span style=\"font-family: terminal, monaco, monospace;\">0x00007ffb860d705f (0x00007ffb860d47c0 + 0x000000000000289f). <\/span>This resulting address is also indicated here and elsewhere in the log file. Note that <span style=\"font-family: terminal, monaco, monospace;\">0x000000000000289f<\/span> in decimal is <span style=\"font-family: terminal, monaco, monospace;\">10399<\/span>.<\/p>\n<p>In <span style=\"font-family: terminal, monaco, monospace;\">gdb<\/span>, we can view the contents of the memory where the compiled method resides. We display <span style=\"font-family: terminal, monaco, monospace;\">10500<\/span> bytes (&#8216;<span style=\"font-family: terminal, monaco, monospace;\">b<\/span>&#8216;), starting at the beginning of the method, and show machine instructions (&#8216;<span style=\"font-family: terminal, monaco, monospace;\">i<\/span>&#8216;), using the following command:<\/p>\n<pre class=\"prettyprint\">x\/10500ib 0x00007ffb860d47c0<\/pre>\n<p>This gives us a paginated view of memory, starting from the very beginning of the compiled method. We can scroll through these pages by pressing <span style=\"font-family: terminal, monaco, monospace;\">&lt;return&gt;<\/span> until we can see the instruction at <span style=\"font-family: terminal, monaco, monospace;\">0x00007ffb860d705f<\/span>:<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/java\/wp-content\/uploads\/sites\/51\/2022\/05\/gdb.png\"><img decoding=\"async\" class=\"alignnone wp-image-228976\" src=\"https:\/\/devblogs.microsoft.com\/java\/wp-content\/uploads\/sites\/51\/2022\/05\/gdb-300x294.png\" alt=\"Image gdb\" width=\"549\" height=\"539\" srcset=\"https:\/\/devblogs.microsoft.com\/java\/wp-content\/uploads\/sites\/51\/2022\/05\/gdb-300x294.png 300w, https:\/\/devblogs.microsoft.com\/java\/wp-content\/uploads\/sites\/51\/2022\/05\/gdb-24x24.png 24w, https:\/\/devblogs.microsoft.com\/java\/wp-content\/uploads\/sites\/51\/2022\/05\/gdb-48x48.png 48w, https:\/\/devblogs.microsoft.com\/java\/wp-content\/uploads\/sites\/51\/2022\/05\/gdb.png 522w\" sizes=\"(max-width: 549px) 100vw, 549px\" \/><\/a><\/p>\n<p>Here&#8217;s the instruction at the address in question:<\/p>\n<pre class=\"prettyprint\">0x7ffb860d705f:\u00a0\u00a0\u00a0\u00a0\u00a0 vpxor\u00a0 0x18(%rdi,%r9,1),%xmm0,%xmm0<\/pre>\n<p>Note that <span style=\"font-family: terminal, monaco, monospace;\">gdb<\/span> is displaying the disassembly here in AT&amp;T syntax, not Intel syntax (see <a href=\"http:\/\/web.mit.edu\/rhel-doc\/3\/rhel-as-en-3\/i386-syntax.html\">AT&amp;T Syntax versus Intel Syntax<\/a> for further details on the differences). However, it is the same instruction we saw when using ODA, so we take this as confirmation that this is the correct instruction we need to examine. So, let\u2019s examine it!<\/p>\n<h2>Examining the Instruction<\/h2>\n<p>We note that <a href=\"https:\/\/www.felixcloutier.com\/x86\/pxor\"><span style=\"font-family: terminal, monaco, monospace;\">vpxor<\/span> <\/a>is a vector instruction, so it looks like the code is doing some vector operations here. Let\u2019s take a look at the Java code in question: \u00a0<a href=\"https:\/\/github.com\/openjdk\/jdk11u\/blob\/master\/src\/java.security.jgss\/share\/classes\/sun\/security\/jgss\/krb5\/Krb5Context.java\">Krb5Context.java<\/a>. This code is security-related, and it\u2019s not surprising that cryptographic algorithms require vector math. It&#8217;s also possible that the offending code is from a method inlined in <span style=\"font-family: terminal, monaco, monospace;\">initSecContext<\/span>, for example something related to <span style=\"font-family: terminal, monaco, monospace;\">InitSecContextToken<\/span> (see <a href=\"https:\/\/github.com\/openjdk\/jdk11u\/blob\/master\/src\/java.security.jgss\/share\/classes\/sun\/security\/jgss\/krb5\/InitSecContextToken.java\">InitSecContextToken.java<\/a>).<\/p>\n<p>Let&#8217;s examine what this instruction is doing:<\/p>\n<ol>\n<li>It calculates an address based off of <span style=\"font-family: terminal, monaco, monospace;\">0x18(%rdi, %r9,1)<\/span>.\u00a0 This addressing mode results in the following calculation: <span style=\"font-family: terminal, monaco, monospace;\">%rdi + %r9 * 1 + 0x18<\/span>. In other words, it scales the contents of <span style=\"font-family: terminal, monaco, monospace;\">r9<\/span> by <span style=\"font-family: terminal, monaco, monospace;\">1<\/span>, adds it to the contents of <span style=\"font-family: terminal, monaco, monospace;\">rdi<\/span>, and adds the constant <span style=\"font-family: terminal, monaco, monospace;\">0x18<\/span>.<\/li>\n<li>Loads a value from that address.<\/li>\n<li>Computes the <span style=\"font-family: terminal, monaco, monospace;\">xor<\/span> of that value and the contents of the <span style=\"font-family: terminal, monaco, monospace;\">xmm0<\/span> register (a vector register).<\/li>\n<li>Stores it back into the <span style=\"font-family: terminal, monaco, monospace;\">xmm0<\/span> register.<\/li>\n<\/ol>\n<p>Let&#8217;s see what this address calculation gives us. To do this, we look at the value of the registers at the time of the crash. You can find those values in the log:<\/p>\n<pre class=\"prettyprint\">Registers:\r\n\r\nRAX=0x0000000000000007, RBX=0x0000000000000020, RCX=0x00007ffb75418ce0, RDX=0x000000000000001f\r\nRSP=0x00007ff751956420, RBP=0x00007ffb75417b40, RSI=0x0000000000000019, RDI=0x00007ffb79ffffc8\r\nR8 =0x0000000000000019, R9 =0x0000000000000018, R10=0x0000000000001f40, R11=0x0000000000000019\r\nR12=0x00007ff751956900, R13=0x0000000000000001, R14=0x0000000000000000, R15=0x00007ff7fbaa0000\r\nRIP=0x00007ffb860d705f, EFLAGS=0x0000000000010297, CSGSFS=0x0000000000000033, ERR=0x0000000000000004\r\nTRAPNO=0x000000000000000e<\/pre>\n<p>Here are the values of the registers we&#8217;re interested in:<\/p>\n<pre class=\"prettyprint\">RDI: 0x00007ffb79ffffc8\r\nR9: 0x0000000000000018<\/pre>\n<p>The base address that we&#8217;re accessing with this instruction is therefore:<\/p>\n<pre class=\"prettyprint\">Addr: 0x00007ffb79ffffc8 + 0x0000000000000018 * 1 + 0x18\u00a0 = 0x7FFB79FFFFF8<\/pre>\n<p>Note that the address that is causing us trouble, <span style=\"font-family: terminal, monaco, monospace;\">0x00007ffb7a000000<\/span>, is only <span style=\"font-family: terminal, monaco, monospace;\">(0x00007ffb7a000000 &#8211; 0x7FFB79FFFFF8) = 8<\/span> bytes away from this calculated address.<\/p>\n<p>What appears to be happening is that the vector instruction here, <span style=\"font-family: terminal, monaco, monospace;\">vpxor<\/span>, is trying to load 128 bits (16 bytes) from <span style=\"font-family: terminal, monaco, monospace;\">0x7FFB79FFFFF8<\/span>. As a result, it&#8217;s reading 8 bytes from accessible, legally mapped memory, and trying to read the following <span style=\"font-family: terminal, monaco, monospace;\">8 <\/span>bytes from <span style=\"font-family: terminal, monaco, monospace;\">0x00007ffb7a000000<\/span>, which is inaccessible, unmapped memory. The result is that we get a <span style=\"font-family: terminal, monaco, monospace;\">SIGSEGV<\/span>.<\/p>\n<p>Let&#8217;s take a deeper look at <span style=\"font-family: terminal, monaco, monospace;\">RDI<\/span>. The error log gives us as much information as it can about the data that resides in registers, in the &#8220;Register to memory mapping:&#8221; section:<\/p>\n<pre class=\"prettyprint\">Register to memory mapping:\r\n\r\nRAX=0x0000000000000007 is an unknown value\r\nRBX=0x0000000000000020 is an unknown value\r\nRCX=0x00007ffb75418ce0 is an oop: [B\r\n{0x00007ffb75418ce0} - klass: {type array byte}\r\n- length: 32\r\nRDX=0x000000000000001f is an unknown value\r\nRSP=0x00007ff751956420 is pointing into the stack for thread: 0x00007ff7fbaa0000\r\nRBP=0x00007ffb75417b40 is an oop: sun.security.krb5.EncryptionKey\r\n{0x00007ffb75417b40} - klass: 'sun\/security\/krb5\/EncryptionKey'\r\nRSI=0x0000000000000019 is an unknown value\r\nRDI=0x00007ffb79ffffc8 is an oop: [B\r\n{0x00007ffb79ffffc8} - klass: {type array byte}\r\n\u00a0- length: 32\r\nR8 =0x0000000000000019 is an unknown value\r\nR9 =0x0000000000000018 is an unknown value\r\nR10=0x0000000000001f40 is an unknown value\r\nR11=0x0000000000000019 is an unknown value\r\nR12=0x00007ff751956900 is pointing into the stack for thread: 0x00007ff7fbaa0000\r\nR13=0x0000000000000001 is an unknown value\r\nR14=0x0 is NULL\r\nR15=0x00007ff7fbaa0000 is a thread<\/pre>\n<p>It looks like the value in <span style=\"font-family: terminal, monaco, monospace;\">RDI<\/span> is an ordinary object pointer (oop) which represents a byte array. What we seem to be seeing here is that the compiler has generated code that is either improperly indexing this byte array or is reading too much data from this address. If this was a write instruction, the error would likely be catastrophic, more frequently occurring, and probably would have been found and fixed a long time ago.<\/p>\n<p>However, because it is a read instruction, this kind of error in the code can go unnoticed for a long time, as it does not produce an error in most cases (this is assuming that the code after this instruction operates only on the data it is supposed to operate on, without looking at the extra data that is read). A read becomes a problem, though, once it reaches into unmapped territory. In the case here, the data that this instruction operates on must be at the end of a region for it to be a problem. In most instances, this instruction will be operating on data away from the end of the region and everyone is blissfully unaware that it&#8217;s reading more than it should.<\/p>\n<p>The thing that initially looks suspicious here is the repeated use of <span style=\"font-family: terminal, monaco, monospace;\">0x18<\/span>. It\u2019s added twice, both from the contents of <span style=\"font-family: terminal, monaco, monospace;\">R9<\/span> and from the literal offset in the instruction. This makes us wonder whether the compiler is inadvertently adding an offset of <span style=\"font-family: terminal, monaco, monospace;\">0x18<\/span> twice, pushing the address into territory it shouldn&#8217;t be in.<\/p>\n<h2>What&#8217;s Next?<\/h2>\n<p>In Part 3 of this series, we\u2019ll look for a fix and\/or a workaround for this issue. Stay tuned!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction Welcome to Part 2 of our investigation into a JVM crash for LinkedIn.\u00a0This blog post continues the investigation we began in Debugging a JVM Crash for LinkedIn &#8211; Part 1. In Part 1, we analyzed the HotSpot error log file to identify the instruction where the JVM crashed. As a reminder, this series is [&hellip;]<\/p>\n","protected":false},"author":83088,"featured_media":227205,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1,249],"tags":[793,25,248,741,319,28],"class_list":["post-228970","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-java","category-openjdk","tag-crash","tag-debugging","tag-java","tag-java-11","tag-openjdk","tag-testing"],"acf":[],"blog_post_summary":"<p>Introduction Welcome to Part 2 of our investigation into a JVM crash for LinkedIn.\u00a0This blog post continues the investigation we began in Debugging a JVM Crash for LinkedIn &#8211; Part 1. In Part 1, we analyzed the HotSpot error log file to identify the instruction where the JVM crashed. As a reminder, this series is [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/java\/wp-json\/wp\/v2\/posts\/228970","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/java\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/java\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/java\/wp-json\/wp\/v2\/users\/83088"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/java\/wp-json\/wp\/v2\/comments?post=228970"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/java\/wp-json\/wp\/v2\/posts\/228970\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/java\/wp-json\/wp\/v2\/media\/227205"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/java\/wp-json\/wp\/v2\/media?parent=228970"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/java\/wp-json\/wp\/v2\/categories?post=228970"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/java\/wp-json\/wp\/v2\/tags?post=228970"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}