{"id":26368,"date":"2020-02-10T10:58:47","date_gmt":"2020-02-10T17:58:47","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/dotnet\/?p=26368"},"modified":"2020-02-10T14:32:42","modified_gmt":"2020-02-10T21:32:42","slug":"using-net-for-apache-spark-to-analyze-log-data","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/dotnet\/using-net-for-apache-spark-to-analyze-log-data\/","title":{"rendered":"Using .NET for Apache\u00ae Spark\u2122 to Analyze Log Data"},"content":{"rendered":"<p>At Spark + AI Summit in <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/introducing-net-for-apache-spark\/\">May 2019<\/a>, we released <a href=\"https:\/\/dot.net\/spark\">.NET for Apache Spark<\/a>. .NET for Apache Spark is aimed at making <a href=\"https:\/\/spark.apache.org\/\">Apache\u00ae Spark\u2122<\/a>, and thus the exciting world of big data analytics, accessible to .NET developers.<\/p>\n<p>.NET for Spark can be used for processing batches of data, real-time streams, machine learning, and ad-hoc query. In this blog post, we\u2019ll explore how to use .NET for Spark to perform a very popular big data task known as <strong>log analysis<\/strong>.<\/p>\n<p>The remainder of this post describes the following topics:<\/p>\n<ul>\n<li><a href=\"#whatislog\">What is log analysis?<\/a><\/li>\n<li><a href=\"#sparkforlog\">Writing a .NET for Spark log analysis app<\/a><\/li>\n<li><a href=\"#running\">Running a .NET for Spark app<\/a><\/li>\n<li><a href=\"#wrapup\">Wrap Up<\/a><\/li>\n<\/ul>\n<h3><a id=\"whatislog\"><\/a>What is log analysis?<\/h3>\n<p>Log analysis, also known as <em>log processing<\/em>, is the process of analyzing computer-generated records called logs. Logs tell us what\u2019s happening on a tool like a computer or web server, such as what applications are being used or the top websites users visit.<\/p>\n<p>The goal of log analysis is to gain meaningful insights from these logs about activity and performance of our tools or services. .NET for Spark enables us to analyze anywhere from megabytes to petabytes of log data with blazing fast and efficient processing!<\/p>\n<p>In this blog post, we\u2019ll be analyzing a set of <a href=\"https:\/\/httpd.apache.org\/docs\/1.3\/logs.html\">Apache log entries<\/a> that express how users are interacting with content on a web server. You can view a sample of Apache log entries <a href=\"https:\/\/raw.githubusercontent.com\/elastic\/examples\/master\/Common%20Data%20Formats\/apache_logs\/apache_logs\">here<\/a>.<\/p>\n<h3><a id=\"sparkforlog\"><\/a>Writing a .NET for Spark log analysis app<\/h3>\n<p>Log analysis is an example of <a href=\"https:\/\/docs.microsoft.com\/en-us\/dotnet\/spark\/tutorials\/batch-processing\">batch processing<\/a> with Spark. Batch processing is the transformation of data at rest, meaning that the source data has already been loaded into data storage. In our case, the input text file is already populated with logs and won\u2019t be receiving new or updated logs as we process it.<\/p>\n<p>When creating a new .NET for Spark application, there are just a few steps we need to follow to start getting those interesting insights from our data:<\/p>\n<ol>\n<li>Create a Spark Session.<\/li>\n<li>Read input data, typically using a DataFrame.<\/li>\n<li>Manipulate and analyze input data, typically using Spark SQL.<\/li>\n<\/ol>\n<h4><a id=\"createsession\"><\/a> Create a Spark Session<\/h4>\n<p>In any Spark application, we start off by establishing a new <a href=\"https:\/\/docs.microsoft.com\/en-us\/dotnet\/api\/microsoft.spark.sql.sparksession?view=spark-dotnet\">SparkSession<\/a>, which is the entry point to programming with Spark:<\/p>\n<pre class=\"prettyprint\">SparkSession spark = SparkSession\n    .Builder()\n    .AppName(\"Apache User Log Processing\")\n    .GetOrCreate();<\/pre>\n<p>By calling on the <code>spark<\/code> object created above, we can now access Spark and DataFrame functionality throughout our program \u2013 great! But what is a DataFrame? Let\u2019s learn about it in the next step.<\/p>\n<h4><a id=\"readindata\"><\/a> Read input data<\/h4>\n<p>Now that we have access to Spark functionality, we can read in the log data we\u2019ll be analyzing. We store input data in a <a href=\"https:\/\/docs.microsoft.com\/en-us\/dotnet\/api\/microsoft.spark.sql.dataframe?view=spark-dotnet\">DataFrame<\/a>, which is a distributed collection of data organized into named columns:<\/p>\n<pre class=\"prettyprint\">DataFrame generalDf = spark.Read().Text(\"&lt;path to input data set>\");<\/pre>\n<p>When our input is contained in a <em>.txt<\/em> file, we use the <code>.Text()<\/code> method, as shown above. There are <a href=\"https:\/\/docs.microsoft.com\/en-us\/dotnet\/api\/microsoft.spark.sql.dataframereader?view=spark-dotnet#methods\">other methods<\/a> to read in data from other sources, such as <code>.Csv()<\/code> to read in comma-separated values files.<\/p>\n<h4><a id=\"manipulatedata\"><\/a> Manipulate and analyze input data<\/h4>\n<p>With our input logs stored in a DataFrame, we can start analyzing them \u2013 now things are getting exciting!<\/p>\n<p>An important first step is <strong>data preparation<\/strong>. Data prep involves cleaning up our data in some way. This could include removing incomplete entries to avoid error in later calculations or removing irrelevant input to improve performance.<\/p>\n<p>In our example, we should first ensure all of our entries are complete logs. We can do this by comparing each log entry to a <a href=\"https:\/\/docs.microsoft.com\/en-us\/dotnet\/standard\/base-types\/regular-expression-language-quick-reference\">regular expression<\/a> (AKA a regex), which is a sequence of characters that defines a pattern.<\/p>\n<p>Let\u2019s define a regex expressing a pattern all valid Apache log entries should follow:<\/p>\n<pre class=\"prettyprint\">string s_apacheRx = \"^(\\S+) (\\S+) (\\S+) [([\\w:\/]+\\s[+-]\\d{4})] \\\"(\\S+) (\\S+) (\\S+)\\\" (\\d{3}) (\\d+)\";<\/pre>\n<p>How do we perform a calculation on each row of a DataFrame, like comparing each log entry to the above regex? The answer is <em>Spark SQL<\/em>.<\/p>\n<h4>Spark SQL<\/h4>\n<p>Spark SQL provides many great functions for working with the structured data stored in a DataFrame. One of the most popular features of Spark SQL is <em>UDFs<\/em>, or user-defined functions. We define the type of input they take and the type of output they produce, and then the actual calculation or filtering they perform.<\/p>\n<p>Let\u2019s define a new UDF <code>GeneralReg<\/code> to compare each log entry to the <code>s_apacheRx<\/code> regex. Our UDF requires an Apache log entry, which is a string, and will return a true or false depending upon if the log matches the regex:<\/p>\n<pre class=\"prettyprint\">spark.Udf().Register&lt;string, bool>(\"GeneralReg\", log => Regex.IsMatch(log, s_apacheRx));<\/pre>\n<p>So how do we call <code>GeneralReg<\/code>?<\/p>\n<p>In addition to UDFs, Spark SQL provides the ability to write <strong>SQL calls<\/strong> to analyze our data \u2013 how convenient! It\u2019s common to write a SQL call to apply a UDF to each row of data.<\/p>\n<p>To call <code>GeneralReg<\/code> from above, let\u2019s use the following SQL call:<\/p>\n<pre class=\"prettyprint\">DataFrame generalDf = spark.Sql(\"SELECT logs.value, GeneralReg(logs.value) FROM Logs\");<\/pre>\n<p>This SQL call tests each row of <code>generalDf<\/code> to determine if it\u2019s a valid and complete log.<\/p>\n<p>We can use <a href=\"https:\/\/docs.microsoft.com\/en-us\/dotnet\/api\/microsoft.spark.sql.dataframe.filter?view=spark-dotnet\">.Filter()<\/a> to only keep the complete log entries in our data, and then <a href=\"https:\/\/docs.microsoft.com\/en-us\/dotnet\/api\/microsoft.spark.sql.dataframe.show?view=spark-dotnet\">.Show()<\/a> to display our newly filtered DataFrame:<\/p>\n<pre class=\"prettyprint\">generalDf = generalDf.Filter(generalDf[\"GeneralReg(value)\"]);\ngeneralDf.Show();<\/pre>\n<p>Now that we\u2019ve performed some initial data prep, we can continue filtering and analyzing our data. Let\u2019s find log entries from IP addresses starting with 10 and related to spam in some way:<\/p>\n<pre class=\"prettyprint\">\/\/ Choose valid log entries that start with 10\nspark.Udf().Register&lt;string, bool>(\n    \"IPReg\",\n    log => Regex.IsMatch(log, \"^(?=10)\"));\n\ngeneralDf.CreateOrReplaceTempView(\"IPLogs\");\n\n\/\/ Apply UDF to get valid log entries starting with 10\nDataFrame ipDf = spark.Sql(\n    \"SELECT iplogs.value FROM IPLogs WHERE IPReg(iplogs.value)\");\nipDf.Show();\n\n\/\/ Choose valid log entries that start with 10 and deal with spam\nspark.Udf().Register&lt;string, bool>(\n    \"SpamRegEx\",\n    log => Regex.IsMatch(log, \"\\\\b(?=spam)\\\\b\"));\n\nipDf.CreateOrReplaceTempView(\"SpamLogs\");\n\n\/\/ Apply UDF to get valid, start with 10, spam entries\nDataFrame spamDF = spark.Sql(\n    \"SELECT spamlogs.value FROM SpamLogs WHERE SpamRegEx(spamlogs.value)\");<\/pre>\n<p>Finally, let\u2019s count the number of GET requests in our final cleaned dataset. The magic of .NET for Spark is that we can combine it with other popular .NET features to write our apps. We\u2019ll use <a href=\"https:\/\/docs.microsoft.com\/en-us\/dotnet\/csharp\/programming-guide\/concepts\/linq\/\">LINQ<\/a> to analyze the data in our Spark app one last time:<\/p>\n<pre class=\"prettyprint\">int numGetRequests = spamDF \n    .Collect() \n    .Where(r => ContainsGet(r.GetAs&lt;string>(\"value\"))) \n    .Count();<\/pre>\n<p>In the above code, <code>ContainsGet()<\/code> checks for GET requests using <a href=\"https:\/\/docs.microsoft.com\/en-us\/dotnet\/api\/system.text.regularexpressions.regex.match?view=netframework-4.8\">regex matching<\/a>:<\/p>\n<pre class=\"prettyprint\">\/\/ Use regex matching to group data \n\/\/ Each group matches a column in our log schema \n\/\/ i.e. first group = first column = IP\npublic static bool ContainsGet(string logLine) \n{ \n    Match match = Regex.Match(logLine, s_apacheRx);\n\n    \/\/ Determine if valid log entry is a GET request\n    if (match.Success)\n    {\n        Console.WriteLine(\"Full log entry: '{0}'\", match.Groups[0].Value);\n    \n        \/\/ 5th column\/group in schema is \"method\"\n        if (match.Groups[5].Value == \"GET\")\n        {\n            return true;\n        }\n    }\n\n    return false;\n\n} <\/pre>\n<p>As a final step in our Spark apps, we call <code>spark.Stop()<\/code> to shut down the underlying Spark Session and Spark Context.<\/p>\n<p>You can view the <a href=\"https:\/\/github.com\/dotnet\/spark\/blob\/master\/examples\/Microsoft.Spark.CSharp.Examples\/Sql\/Batch\/Logging.cs\">complete log processing example<\/a> in our GitHub repo.<\/p>\n<h3><a id=\"running\"><\/a> Running your app<\/h3>\n<p><a href=\"https:\/\/docs.microsoft.com\/en-us\/dotnet\/spark\/tutorials\/get-started\">To run a .NET for Apache Spark app<\/a>, you need to use the <code>spark-submit<\/code> command, which will submit your application to run on Apache Spark.<\/p>\n<p>The main parts of <code>spark-submit<\/code> include:<\/p>\n<ul>\n<li>&#8211;class, to call the DotnetRunner.<\/li>\n<li>&#8211;master, to determine if this is a local or cloud Spark submission.<\/li>\n<li>Path to the Microsoft.Spark jar file.<\/li>\n<li>Any arguments or dependencies for your app, such as the path to your input file or the dll containing UDF definitions.<\/li>\n<\/ul>\n<p>You\u2019ll also need to download and setup some dependencies before running a .NET for Spark app locally, such as Java and Apache Spark.<\/p>\n<p>A sample Windows command for running your app is as follows:<\/p>\n<p><code>spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local \/path\/to\/microsoft-spark-&lt;version&gt;.jar dotnet \/path\/to\/netcoreapp&lt;version&gt;\/LoggingApp.dll<\/code><\/p>\n<h3><a id=\"wrapup\"><\/a>.NET for Apache Spark Wrap Up<\/h3>\n<p>We\u2019d love to help you get started with .NET for Apache Spark and hear your feedback.<\/p>\n<p>You can <a href=\"https:\/\/dot.net\/spark\">Request a Demo<\/a> from our landing page and check out the <a href=\"https:\/\/github.com\/dotnet\/spark\">.NET for Spark GitHub repo<\/a> to learn more about how you can apply .NET for Spark in your apps and get involved with our effort to make .NET a great tech stack for building big data applications!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>.NET for Apache Spark makes Apache\u00ae Spark\u2122, and thus the exciting world of big data analytics, accessible to .NET developers. .NET for Spark can be used for processing batches of data, real-time streams, machine learning, and ad-hoc query. In this post, we explore how to use .NET for Spark to perform log analysis.<\/p>\n","protected":false},"author":16274,"featured_media":58792,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[685,6071],"tags":[6075,6072,2854,6074,6073],"class_list":["post-26368","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-dotnet","category-apache","tag-net-for-spark","tag-apache-spark","tag-big-data","tag-log-analysis","tag-spark-net"],"acf":[],"blog_post_summary":"<p>.NET for Apache Spark makes Apache\u00ae Spark\u2122, and thus the exciting world of big data analytics, accessible to .NET developers. .NET for Spark can be used for processing batches of data, real-time streams, machine learning, and ad-hoc query. In this post, we explore how to use .NET for Spark to perform log analysis.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts\/26368","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/users\/16274"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/comments?post=26368"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts\/26368\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/media\/58792"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/media?parent=26368"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/categories?post=26368"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/tags?post=26368"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}