{"@attributes":{"version":"2.0"},"channel":{"title":"FireDucks \u2013 FireDucks","link":"https:\/\/fireducks-dev.github.io\/","description":"Recent content on FireDucks","generator":"Hugo -- gohugo.io","language":"en","lastBuildDate":"Fri, 31 Jan 2025 00:00:00 +0000","item":[{"title":"Posts: Unveiling the Optimization Benefit of FireDucks Lazy Execution: Part #3","link":"https:\/\/fireducks-dev.github.io\/posts\/data_flow_optimization\/","pubDate":"Fri, 31 Jan 2025 00:00:00 +0000","guid":"https:\/\/fireducks-dev.github.io\/posts\/data_flow_optimization\/","description":"\n<p>In the previous <a href=\"..\/efficient_caching\">article<\/a>, we have talked about how FireDucks lazy-execution can take care of the\ncaching for the intermediate results in order to avoid recomputation of an expensive operation.\nIn today&rsquo;s article, we will focus on the <strong>efficient data flow optimization<\/strong> by its JIT compiler.\nWe will first try to understand some best practices when performing large-scale data analysis in pandas\nand then discuss how those can be automatically taken care by FireDucks lazy execution model.<\/p>\n<h2 id=\"challenge-1\">Challenge #1<\/h2>\n<p>Let&rsquo;s consider the following two queries solving the same problem: <em>Find top 2 &ldquo;A&rdquo; based on the &ldquo;B&rdquo; column<\/em>.<\/p>\n<p>\ud83d\udc49 <strong>Can you guess which one is better from performance point of view?<\/strong><\/p>\n<p>(1) version 1<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span>res <span style=\"color:#f92672\">=<\/span> df<span style=\"color:#f92672\">.<\/span>sort_values(by<span style=\"color:#f92672\">=<\/span><span style=\"color:#e6db74\">&#34;B&#34;<\/span>)[<span style=\"color:#e6db74\">&#34;A&#34;<\/span>]<span style=\"color:#f92672\">.<\/span>head(<span style=\"color:#ae81ff\">2<\/span>)\n<\/span><\/span><\/code><\/pre><\/div><p>(2) version 2<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span>tmp <span style=\"color:#f92672\">=<\/span> df[[<span style=\"color:#e6db74\">&#34;A&#34;<\/span>, <span style=\"color:#e6db74\">&#34;B&#34;<\/span>]]\n<\/span><\/span><span style=\"display:flex;\"><span>res <span style=\"color:#f92672\">=<\/span> tmp<span style=\"color:#f92672\">.<\/span>sort_values(by<span style=\"color:#f92672\">=<\/span><span style=\"color:#e6db74\">&#34;B&#34;<\/span>)[<span style=\"color:#e6db74\">&#34;A&#34;<\/span>]<span style=\"color:#f92672\">.<\/span>head(<span style=\"color:#ae81ff\">2<\/span>)\n<\/span><\/span><\/code><\/pre><\/div><p>Well, when we conducted this quiz in one of the recent Data Science events,\n45% of the participants answered the first one is more efficient, while the remaining 55% answered the second one is more efficient.<\/p>\n<p>Congratulations, if your answer is (2) as well. \ud83d\udc4f<\/p>\n<p>In real world situation the target data might have many columns and when we invoked sort operation on <code>df<\/code> instance,\nit performed sorting the entire data involving a significant cost in terms of memory and computational power.<\/p>\n<p>As depicted in the following diagram, if the data have columns from &lsquo;a&rsquo; to &lsquo;j&rsquo;, when performing the first query,\nit also sorts the column &lsquo;c&rsquo; to &lsquo;j&rsquo; that is not of our interest. Hence, it is a wise call to create a view of\nthe part of data that is of our interest (as shown in the following figure) before performing an computationally intensive operation\nlike sort, groupby, join etc. At this we can save significant amount of runtime memory and computational time.\nSuch optimization is typically known as <code>projection pushdown<\/code>.\n<img src=\"pushdown_projection.png\" alt=\"projection pushdown example\"><\/p>\n<h2 id=\"challenge-2\">Challenge #2<\/h2>\n<p>Let&rsquo;s now consider another example:<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span>m <span style=\"color:#f92672\">=<\/span> employee<span style=\"color:#f92672\">.<\/span>merge(country, on<span style=\"color:#f92672\">=<\/span><span style=\"color:#e6db74\">&#34;C_Code&#34;<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span>f <span style=\"color:#f92672\">=<\/span> m[m[<span style=\"color:#e6db74\">&#34;Gender&#34;<\/span>] <span style=\"color:#f92672\">==<\/span> <span style=\"color:#e6db74\">&#34;Male&#34;<\/span>]\n<\/span><\/span><span style=\"display:flex;\"><span>r <span style=\"color:#f92672\">=<\/span> f<span style=\"color:#f92672\">.<\/span>groupby(<span style=\"color:#e6db74\">&#34;C_Name&#34;<\/span>)[<span style=\"color:#e6db74\">&#34;E_Name&#34;<\/span>]<span style=\"color:#f92672\">.<\/span>count()\n<\/span><\/span><\/code><\/pre><\/div><p>The following diagram illustrates the operations that takes place while executing the query above:\n<img src=\"illustration1.png\" alt=\"country-wise count of male employees\"><\/p>\n<p>\ud83d\udc49 <strong>Can you guess the performance bottleneck involved in the above query?<\/strong><\/p>\n<p>Probably you guessed it correct!!<\/p>\n<p>The query wants to analyze only the <code>male<\/code> employees.\nThen why to include all the employees at the very first step while joining the two dataframes <code>employee<\/code> and <code>country<\/code>?\nWe could simply filter only the male employees from the <code>employee<\/code> data and\nperform the rest of the operations like merge, groupby etc. on the filtered result as shown below.\nAt this we could save significant execution time and memory during the expensive merge operation.<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span>f <span style=\"color:#f92672\">=<\/span> employee[employee[<span style=\"color:#e6db74\">&#34;Gender&#34;<\/span>] <span style=\"color:#f92672\">==<\/span> <span style=\"color:#e6db74\">&#34;Male&#34;<\/span>]\n<\/span><\/span><span style=\"display:flex;\"><span>m <span style=\"color:#f92672\">=<\/span> f<span style=\"color:#f92672\">.<\/span>merge(country, on<span style=\"color:#f92672\">=<\/span><span style=\"color:#e6db74\">&#34;C_Code&#34;<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span>r <span style=\"color:#f92672\">=<\/span> m<span style=\"color:#f92672\">.<\/span>groupby(<span style=\"color:#e6db74\">&#34;C_Name&#34;<\/span>)[<span style=\"color:#e6db74\">&#34;E_Name&#34;<\/span>]<span style=\"color:#f92672\">.<\/span>count()\n<\/span><\/span><\/code><\/pre><\/div><p>Such optimization is typically known as <code>predicate pushdown<\/code>.\n<img src=\"predicate_pushdown.png\" alt=\"predicate pushdown example\"><\/p>\n<h2 id=\"lets-follow-these-best-practices\">Let&rsquo;s follow these best practices<\/h2>\n<p>When dealing with large-scale data, sometime we might not be interested on all part of the data.\nHence, its always the best practice to reduce the scope of your data before applying an\nexpensive operation on it to reduce a significant amount of runtime memory and computational time.<\/p>\n<ul>\n<li>When it is known that you are going to perform an operation that involves only some of the columns,\nit is recommended to project the target columns first to reduce it in the horizontal direction.<\/li>\n<li>Again, if your operation targets only some selected rows of the data,\nit is recommended to filter the target rows before performing the operation to reduce it further in the vertical direction.<\/li>\n<\/ul>\n<p>For example, let&rsquo;s consider the below example:<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span>df<span style=\"color:#f92672\">.<\/span>sort_values(<span style=\"color:#e6db74\">&#34;A&#34;<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>query(<span style=\"color:#e6db74\">&#34;B &gt; 1&#34;<\/span>)[<span style=\"color:#e6db74\">&#34;E&#34;<\/span>]\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>head(<span style=\"color:#ae81ff\">2<\/span>)\n<\/span><\/span><\/code><\/pre><\/div><p>Let&rsquo;s consider the data with following color codes, where the expected sorted order is: <strong>yellow, red, green, blue.<\/strong><\/p>\n<p>Also, let&rsquo;s assume B=1 for darker shade and B=2 for lighter shade.\nThe flow of the above operation will be as follows:\n<img src=\"illustration2.png\" alt=\"sample data flow\"><\/p>\n<p>As you can see the columns <code>C<\/code> and <code>D<\/code> have been used in all the first three steps, but they have never been required in the final result.\nAlso, the sort operation is performed on all the rows of the data, whereas we are only interested in the data of the lighter shades.<\/p>\n<p>Hence, the optimized data flow could be as follows:<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span>df<span style=\"color:#f92672\">.<\/span>loc[:, [<span style=\"color:#e6db74\">&#34;A&#34;<\/span>, <span style=\"color:#e6db74\">&#34;B&#34;<\/span>, <span style=\"color:#e6db74\">&#34;E&#34;<\/span>]]\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>query(<span style=\"color:#e6db74\">&#34;B &gt; 1&#34;<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>sort_values(<span style=\"color:#e6db74\">&#34;A&#34;<\/span>)[<span style=\"color:#e6db74\">&#34;E&#34;<\/span>]\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>head(<span style=\"color:#ae81ff\">2<\/span>)\n<\/span><\/span><\/code><\/pre><\/div><p>It efficiently reduces the data in the horizontal (applying projection pushdown) and vertical (applying predicate pushdown) direction,\nbefore applying the expensive sort operation as depicted follows:\n<img src=\"illustration3.png\" alt=\"sample optimized data flow\"><\/p>\n<h2 id=\"case-study\">Case Study<\/h2>\n<p>Now let&rsquo;s understand how such optimization can be useful in real world situations.<\/p>\n<p>The <a href=\"https:\/\/www.tpc.org\/tpch\/\">TPC-H<\/a> is a decision support benchmark that consists of a suite of business-oriented ad-hoc queries and concurrent data modifications.\nWe will use <a href=\"https:\/\/www.tpc.org\/TPC_Documents_Current_Versions\/pdf\/TPC-H_v3.0.1.pdf#page=33\">Query-3<\/a> as an example in this demonstration that deals with three large tables,\nnamely <code>lineitem<\/code>, <code>customer<\/code>, and <code>orders<\/code> with complex join, groupby, sort etc.<\/p>\n<p>The original query was written in SQL. We can realize the following pandas equaivalent of the same query:<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span><span style=\"color:#66d9ef\">def<\/span> <span style=\"color:#a6e22e\">q3<\/span>():\n<\/span><\/span><span style=\"display:flex;\"><span> (\n<\/span><\/span><span style=\"display:flex;\"><span> pd<span style=\"color:#f92672\">.<\/span>read_parquet(<span style=\"color:#e6db74\">&#34;customer.parquet&#34;<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>merge(pd<span style=\"color:#f92672\">.<\/span>read_parquet(<span style=\"color:#e6db74\">&#34;orders.parquet&#34;<\/span>), left_on<span style=\"color:#f92672\">=<\/span><span style=\"color:#e6db74\">&#34;c_custkey&#34;<\/span>, right_on<span style=\"color:#f92672\">=<\/span><span style=\"color:#e6db74\">&#34;o_custkey&#34;<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>merge(pd<span style=\"color:#f92672\">.<\/span>read_parquet(<span style=\"color:#e6db74\">&#34;lineitem.parquet&#34;<\/span>), left_on<span style=\"color:#f92672\">=<\/span><span style=\"color:#e6db74\">&#34;o_orderkey&#34;<\/span>, right_on<span style=\"color:#f92672\">=<\/span><span style=\"color:#e6db74\">&#34;l_orderkey&#34;<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>pipe(<span style=\"color:#66d9ef\">lambda<\/span> df: df[df[<span style=\"color:#e6db74\">&#34;c_mktsegment&#34;<\/span>] <span style=\"color:#f92672\">==<\/span> <span style=\"color:#e6db74\">&#34;BUILDING&#34;<\/span>])\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>pipe(<span style=\"color:#66d9ef\">lambda<\/span> df: df[df[<span style=\"color:#e6db74\">&#34;o_orderdate&#34;<\/span>] <span style=\"color:#f92672\">&lt;<\/span> datetime<span style=\"color:#f92672\">.<\/span>date(<span style=\"color:#ae81ff\">1995<\/span>, <span style=\"color:#ae81ff\">3<\/span>, <span style=\"color:#ae81ff\">15<\/span>)])\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>pipe(<span style=\"color:#66d9ef\">lambda<\/span> df: df[df[<span style=\"color:#e6db74\">&#34;l_shipdate&#34;<\/span>] <span style=\"color:#f92672\">&gt;<\/span> datetime<span style=\"color:#f92672\">.<\/span>date(<span style=\"color:#ae81ff\">1995<\/span>, <span style=\"color:#ae81ff\">3<\/span>, <span style=\"color:#ae81ff\">15<\/span>)])\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>assign(revenue<span style=\"color:#f92672\">=<\/span><span style=\"color:#66d9ef\">lambda<\/span> df: df[<span style=\"color:#e6db74\">&#34;l_extendedprice&#34;<\/span>] <span style=\"color:#f92672\">*<\/span> (<span style=\"color:#ae81ff\">1<\/span> <span style=\"color:#f92672\">-<\/span> df[<span style=\"color:#e6db74\">&#34;l_discount&#34;<\/span>]))\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>groupby([<span style=\"color:#e6db74\">&#34;l_orderkey&#34;<\/span>, <span style=\"color:#e6db74\">&#34;o_orderdate&#34;<\/span>, <span style=\"color:#e6db74\">&#34;o_shippriority&#34;<\/span>], as_index<span style=\"color:#f92672\">=<\/span><span style=\"color:#66d9ef\">False<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>agg({<span style=\"color:#e6db74\">&#34;revenue&#34;<\/span>: <span style=\"color:#e6db74\">&#34;sum&#34;<\/span>})[[<span style=\"color:#e6db74\">&#34;l_orderkey&#34;<\/span>, <span style=\"color:#e6db74\">&#34;revenue&#34;<\/span>, <span style=\"color:#e6db74\">&#34;o_orderdate&#34;<\/span>, <span style=\"color:#e6db74\">&#34;o_shippriority&#34;<\/span>]]\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>sort_values([<span style=\"color:#e6db74\">&#34;revenue&#34;<\/span>, <span style=\"color:#e6db74\">&#34;o_orderdate&#34;<\/span>], ascending<span style=\"color:#f92672\">=<\/span>[<span style=\"color:#66d9ef\">False<\/span>, <span style=\"color:#66d9ef\">True<\/span>])\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>reset_index(drop<span style=\"color:#f92672\">=<\/span><span style=\"color:#66d9ef\">True<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>head(<span style=\"color:#ae81ff\">10<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>to_parquet(<span style=\"color:#e6db74\">&#34;result.parquet&#34;<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span> )\n<\/span><\/span><\/code><\/pre><\/div><p>The above implementation doesn&rsquo;t take care of the &ldquo;best practices&rdquo;.\nIt loads the entire data from all the three tables and directly merge them to construct a large table\nbefore performing rest of the filter, groupby etc. operations as required for the query.<\/p>\n<p>When we executed the above program in pandas for a scale-factor 10,\nit <strong>took around 203 seconds and the peak memory consumption was around 56 GB<\/strong>.\n<img src=\"q3_pandas.png\" alt=\"pandas q3 metrics\"><\/p>\n<p>Let&rsquo;s now implement the best-practices discussed in the previous section to manually optimize the query as follows:<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span><span style=\"color:#66d9ef\">def<\/span> <span style=\"color:#a6e22e\">optimized_q3<\/span>():\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#75715e\"># load only required columns from respective tables<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span> req_customer_cols <span style=\"color:#f92672\">=<\/span> [<span style=\"color:#e6db74\">&#34;c_custkey&#34;<\/span>, <span style=\"color:#e6db74\">&#34;c_mktsegment&#34;<\/span>] <span style=\"color:#75715e\"># (2\/8)<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span> req_lineitem_cols <span style=\"color:#f92672\">=<\/span> [<span style=\"color:#e6db74\">&#34;l_orderkey&#34;<\/span>, <span style=\"color:#e6db74\">&#34;l_shipdate&#34;<\/span>, <span style=\"color:#e6db74\">&#34;l_extendedprice&#34;<\/span>, <span style=\"color:#e6db74\">&#34;l_discount&#34;<\/span>] <span style=\"color:#75715e\">#(4\/16)<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span> req_orders_cols <span style=\"color:#f92672\">=<\/span> [<span style=\"color:#e6db74\">&#34;o_custkey&#34;<\/span>, <span style=\"color:#e6db74\">&#34;o_orderkey&#34;<\/span>, <span style=\"color:#e6db74\">&#34;o_orderdate&#34;<\/span>, <span style=\"color:#e6db74\">&#34;o_shippriority&#34;<\/span>] <span style=\"color:#75715e\">#(4\/9)<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span> customer <span style=\"color:#f92672\">=<\/span> pd<span style=\"color:#f92672\">.<\/span>read_parquet(<span style=\"color:#e6db74\">&#34;customer.parquet&#34;<\/span>, columns <span style=\"color:#f92672\">=<\/span> req_customer_cols)\n<\/span><\/span><span style=\"display:flex;\"><span> lineitem <span style=\"color:#f92672\">=<\/span> pd<span style=\"color:#f92672\">.<\/span>read_parquet(<span style=\"color:#e6db74\">&#34;lineitem.parquet&#34;<\/span>, columns <span style=\"color:#f92672\">=<\/span> req_lineitem_cols)\n<\/span><\/span><span style=\"display:flex;\"><span> orders <span style=\"color:#f92672\">=<\/span> pd<span style=\"color:#f92672\">.<\/span>read_parquet(<span style=\"color:#e6db74\">&#34;orders.parquet&#34;<\/span>, columns <span style=\"color:#f92672\">=<\/span> req_orders_cols)\n<\/span><\/span><span style=\"display:flex;\"><span>\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#75715e\"># advanced-filter: to reduce scope of \u201ccustomer\u201d table to be processed<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span> f_cust <span style=\"color:#f92672\">=<\/span> customer[customer[<span style=\"color:#e6db74\">&#34;c_mktsegment&#34;<\/span>] <span style=\"color:#f92672\">==<\/span> <span style=\"color:#e6db74\">&#34;BUILDING&#34;<\/span>]\n<\/span><\/span><span style=\"display:flex;\"><span>\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#75715e\"># advanced-filter: to reduce scope of \u201corders\u201d table to be processed<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span> f_ord <span style=\"color:#f92672\">=<\/span> orders[orders[<span style=\"color:#e6db74\">&#34;o_orderdate&#34;<\/span>] <span style=\"color:#f92672\">&lt;<\/span> datetime<span style=\"color:#f92672\">.<\/span>date(<span style=\"color:#ae81ff\">1995<\/span>, <span style=\"color:#ae81ff\">3<\/span>, <span style=\"color:#ae81ff\">15<\/span>)]\n<\/span><\/span><span style=\"display:flex;\"><span>\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#75715e\"># advanced-filter: to reduce scope of \u201clineitem\u201d table to be processed<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span> f_litem <span style=\"color:#f92672\">=<\/span> lineitem[lineitem[<span style=\"color:#e6db74\">&#34;l_shipdate&#34;<\/span>] <span style=\"color:#f92672\">&gt;<\/span> datetime<span style=\"color:#f92672\">.<\/span>date(<span style=\"color:#ae81ff\">1995<\/span>, <span style=\"color:#ae81ff\">3<\/span>, <span style=\"color:#ae81ff\">15<\/span>)]\n<\/span><\/span><span style=\"display:flex;\"><span>\n<\/span><\/span><span style=\"display:flex;\"><span> (\n<\/span><\/span><span style=\"display:flex;\"><span> f_cust<span style=\"color:#f92672\">.<\/span>merge(f_ord, left_on<span style=\"color:#f92672\">=<\/span><span style=\"color:#e6db74\">&#34;c_custkey&#34;<\/span>, right_on<span style=\"color:#f92672\">=<\/span><span style=\"color:#e6db74\">&#34;o_custkey&#34;<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>merge(f_litem, left_on<span style=\"color:#f92672\">=<\/span><span style=\"color:#e6db74\">&#34;o_orderkey&#34;<\/span>, right_on<span style=\"color:#f92672\">=<\/span><span style=\"color:#e6db74\">&#34;l_orderkey&#34;<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>assign(revenue<span style=\"color:#f92672\">=<\/span><span style=\"color:#66d9ef\">lambda<\/span> df: df[<span style=\"color:#e6db74\">&#34;l_extendedprice&#34;<\/span>] <span style=\"color:#f92672\">*<\/span> (<span style=\"color:#ae81ff\">1<\/span> <span style=\"color:#f92672\">-<\/span> df[<span style=\"color:#e6db74\">&#34;l_discount&#34;<\/span>]))\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>groupby([<span style=\"color:#e6db74\">&#34;l_orderkey&#34;<\/span>, <span style=\"color:#e6db74\">&#34;o_orderdate&#34;<\/span>, <span style=\"color:#e6db74\">&#34;o_shippriority&#34;<\/span>], as_index<span style=\"color:#f92672\">=<\/span><span style=\"color:#66d9ef\">False<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>agg({<span style=\"color:#e6db74\">&#34;revenue&#34;<\/span>: <span style=\"color:#e6db74\">&#34;sum&#34;<\/span>})[[<span style=\"color:#e6db74\">&#34;l_orderkey&#34;<\/span>, <span style=\"color:#e6db74\">&#34;revenue&#34;<\/span>, <span style=\"color:#e6db74\">&#34;o_orderdate&#34;<\/span>, <span style=\"color:#e6db74\">&#34;o_shippriority&#34;<\/span>]]\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>sort_values([<span style=\"color:#e6db74\">&#34;revenue&#34;<\/span>, <span style=\"color:#e6db74\">&#34;o_orderdate&#34;<\/span>], ascending<span style=\"color:#f92672\">=<\/span>[<span style=\"color:#66d9ef\">False<\/span>, <span style=\"color:#66d9ef\">True<\/span>])\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>reset_index(drop<span style=\"color:#f92672\">=<\/span><span style=\"color:#66d9ef\">True<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>head(<span style=\"color:#ae81ff\">10<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>to_parquet(<span style=\"color:#e6db74\">&#34;result.parquet&#34;<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span> )\n<\/span><\/span><\/code><\/pre><\/div><p>Instead of loading all the 8 columns from the <code>customer<\/code> table,\nall the 16 columns from the <code>lineitem<\/code> table,\nand all the 9 columns from the <code>orders<\/code> table,\nit loads only the target columns that would be required to implement the query\nby reducing the data in the horizontal direction (applying projection pushdown).<\/p>\n<p>Also, since we need only a specific rows from these tables based on the given conditions,\nwe performed an early filtration on the loaded data to reduce it further in the vertical direction (applying predicate pushdown).<\/p>\n<p>When we executed the above optimized implementation using pandas for a scale-factor 10,\nit <strong>took around 13 seconds and the peak memory consumption was around 5.5 GB<\/strong>.\n<img src=\"opt_q3_pandas.png\" alt=\"pandas optimized-q3 metrics\"><\/p>\n<p>From this experiment, it is quite evident that an optimized implementation of a pandas program\ncan itself improve its performance and memory consumption to a great extent.<\/p>\n<p>\ud83d\udc49 <strong>Q. Can we automate such optimization such that one can focus more on in-depth data analysis relying on some tool or library for such expert-level optimization?<\/strong><\/p>\n<p>The answer is <strong>&ldquo;YES&rdquo;<\/strong>. You can rely on FireDucks for such optimization for sure. \ud83d\ude80<\/p>\n<h2 id=\"fireducks-offerings\">FireDucks Offerings<\/h2>\n<p>While being highly compatible with pandas,\nFireDucks can perform such expert-level optimization automatically when using its default lazy execution mode.<\/p>\n<p>In order to verify the same, we have executed the methods as follows:<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span><span style=\"color:#75715e\"># to use FireDucks for all the processings<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#f92672\">import<\/span> fireducks.pandas <span style=\"color:#66d9ef\">as<\/span> pd\n<\/span><\/span><span style=\"display:flex;\"><span>\n<\/span><\/span><span style=\"display:flex;\"><span>q3()\n<\/span><\/span><span style=\"display:flex;\"><span>optimized_q3()\n<\/span><\/span><\/code><\/pre><\/div><p>And the execution could be completed within 4-5 seconds for both these cases showing FireDucks strength\nin performing such optimizations automatically even when the program itself doesn&rsquo;t take care of it (as in q3).<\/p>\n<p>We have used <code>v2-8 TPU<\/code> instance from Google Colab for this evaluation and here is the finding in detail:<\/p>\n<table>\n<thead>\n<tr>\n<th style=\"text-align:left\"><\/th>\n<th style=\"text-align:right\">(pandas, exec_time (s))<\/th>\n<th style=\"text-align:right\">(pandas, memory (GB))<\/th>\n<th style=\"text-align:right\">(FireDucks, exec_time (s))<\/th>\n<th style=\"text-align:right\">(FireDucks, memory (GB))<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"text-align:left\">q3<\/td>\n<td style=\"text-align:right\">203.18<\/td>\n<td style=\"text-align:right\">56<\/td>\n<td style=\"text-align:right\">4.24<\/td>\n<td style=\"text-align:right\">3.3<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:left\">optimized_q3<\/td>\n<td style=\"text-align:right\">12.97<\/td>\n<td style=\"text-align:right\">5.5<\/td>\n<td style=\"text-align:right\">4.81<\/td>\n<td style=\"text-align:right\">3.4<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>You might like to try this <a href=\"https:\/\/colab.research.google.com\/github\/fireducks-dev\/fireducks\/blob\/main\/notebooks\/tpch-query3-pandas-fireducks-cudf.ipynb\">notebook<\/a>\non Google colab to reproduce the same.<\/p>\n<h2 id=\"wrapping-up\">Wrapping-up<\/h2>\n<p>Thank you for your time in reading this article.\nWe have discussed a couple of best practices that one should follow when performing large-scale data analysis in pandas\nand how FireDucks can automatically implement the same. The experimental result shows when switching from pandas to FireDucks,\nit can improve performance of a poorly written program by 48x (203.18 s -&gt; 4.24 s)\nwhile reducing the memory consumption by 17x (56 GB -&gt; 3.3 GB).<\/p>\n<p>In case you have any queries or have an issue to report,\nplease feel free to get in touch with us in any of your prefered channel mentioned below:<\/p>\n<ul>\n<li>\ud83e\udd86github : <a href=\"https:\/\/github.com\/fireducks-dev\/fireducks\/issues\/new\">https:\/\/github.com\/fireducks-dev\/fireducks\/issues\/new<\/a><\/li>\n<li>\ud83d\udce7mail : <a href=\"mailto:contact@fireducks.jp.nec.com\">contact@fireducks.jp.nec.com<\/a><\/li>\n<li>\ud83e\udd1dslack : <a href=\"https:\/\/join.slack.com\/t\/fireducks\/shared_invite\/zt-34qpdgr6q-_iWdIoZW4l_hGhljKS0pyg\">https:\/\/join.slack.com\/t\/fireducks\/shared_invite\/zt-34qpdgr6q-_iWdIoZW4l_hGhljKS0pyg<\/a><\/li>\n<\/ul>"},{"title":"Posts: Pitfalls of Time Measurement for FireDucks with %%time in Notebooks","link":"https:\/\/fireducks-dev.github.io\/posts\/2024-12-26-time-pitfalls\/","pubDate":"Thu, 26 Dec 2024 09:35:10 +0900","guid":"https:\/\/fireducks-dev.github.io\/posts\/2024-12-26-time-pitfalls\/","description":"\n<p>This is Osamu Daido from the FireDucks development team. In today's developers' blog, I would like to present a subtle pitfall in time measurement.<\/p>\n<h2 id=\"quick-overview\">Quick Overview<\/h2>\n<p>When measuring the execution time of FireDucks using the <code>%%time<\/code> magic command in IPython Notebooks, make sure to always call the <code>_evaluate()<\/code> method of DataFrames or Series to ensure proper evaluation!<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span><span style=\"color:#f92672\">%%<\/span>time\n<\/span><\/span><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> pd<span style=\"color:#f92672\">.<\/span>read_csv(<span style=\"color:#e6db74\">&#34;input.csv&#34;<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span>df<span style=\"color:#f92672\">.<\/span>_evaluate()\n<\/span><\/span><\/code><\/pre><\/div><h2 id=\"time-measurement-in-notebooks\">Time Measurement in Notebooks<\/h2>\n<p>Jupyter and other IPython Notebooks provide the <code>%%time<\/code> magic command to measure the execution time of the code written in a cell. For instance, a single percent sign <code>%time<\/code> measures the execution time of only one line of code, while double percent signs <code>%%time<\/code> measure the execution time for the entire cell. You may be interested in or curious about measuring the execution time of FireDucks because it can process data faster while offering the same API as pandas.<\/p>\n<p>However&hellip;, hmm? Do you think the following time measurement is correct?<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span><span style=\"color:#f92672\">import<\/span> fireducks.pandas <span style=\"color:#66d9ef\">as<\/span> pd\n<\/span><\/span><\/code><\/pre><\/div><div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span><span style=\"color:#f92672\">%%<\/span>time\n<\/span><\/span><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> pd<span style=\"color:#f92672\">.<\/span>read_csv(<span style=\"color:#e6db74\">&#34;sample-dataset-tips.csv&#34;<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span>df<span style=\"color:#f92672\">.<\/span>head()\n<\/span><\/span><\/code><\/pre><\/div><blockquote>\n<pre><code>CPU times: user 3.37 ms, sys: 4.06 ms, total: 7.43 ms\nWall time: 6.87 ms\n<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th style=\"text-align:right\"><\/th>\n<th style=\"text-align:right\">total_bill<\/th>\n<th style=\"text-align:right\">tip<\/th>\n<th style=\"text-align:right\">sex<\/th>\n<th style=\"text-align:right\">smoker<\/th>\n<th style=\"text-align:right\">day<\/th>\n<th style=\"text-align:right\">time<\/th>\n<th style=\"text-align:right\">size<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"text-align:right\">0<\/td>\n<td style=\"text-align:right\">16.99<\/td>\n<td style=\"text-align:right\">1.01<\/td>\n<td style=\"text-align:right\">Female<\/td>\n<td style=\"text-align:right\">No<\/td>\n<td style=\"text-align:right\">Sun<\/td>\n<td style=\"text-align:right\">Dinner<\/td>\n<td style=\"text-align:right\">2<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:right\">1<\/td>\n<td style=\"text-align:right\">10.34<\/td>\n<td style=\"text-align:right\">1.66<\/td>\n<td style=\"text-align:right\">Male<\/td>\n<td style=\"text-align:right\">No<\/td>\n<td style=\"text-align:right\">Sun<\/td>\n<td style=\"text-align:right\">Dinner<\/td>\n<td style=\"text-align:right\">3<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:right\">2<\/td>\n<td style=\"text-align:right\">21.01<\/td>\n<td style=\"text-align:right\">3.50<\/td>\n<td style=\"text-align:right\">Male<\/td>\n<td style=\"text-align:right\">No<\/td>\n<td style=\"text-align:right\">Sun<\/td>\n<td style=\"text-align:right\">Dinner<\/td>\n<td style=\"text-align:right\">3<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:right\">3<\/td>\n<td style=\"text-align:right\">23.68<\/td>\n<td style=\"text-align:right\">3.31<\/td>\n<td style=\"text-align:right\">Male<\/td>\n<td style=\"text-align:right\">No<\/td>\n<td style=\"text-align:right\">Sun<\/td>\n<td style=\"text-align:right\">Dinner<\/td>\n<td style=\"text-align:right\">2<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:right\">4<\/td>\n<td style=\"text-align:right\">24.59<\/td>\n<td style=\"text-align:right\">3.61<\/td>\n<td style=\"text-align:right\">Female<\/td>\n<td style=\"text-align:right\">No<\/td>\n<td style=\"text-align:right\">Sun<\/td>\n<td style=\"text-align:right\">Dinner<\/td>\n<td style=\"text-align:right\">4<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/blockquote>\n<h3 id=\"time-measurement-for-fireducks\">Time Measurement for FireDucks<\/h3>\n<p>As explained on <a href=\"https:\/\/fireducks-dev.github.io\/docs\/user-guide\/02-exec-model\/\">the execution model page<\/a>, FireDucks uses a lazy execution model. In simple terms, FireDucks DataFrames do not begin actual processing until explicitly displayed on the screen with functions like <code>print()<\/code> or <code>display()<\/code>, or when the <code>_evaluate()<\/code> method is called. FireDucks can process data more quickly by optimizing the accumulated operations before executing them. This execution of accumulated operations is referred to as &ldquo;evaluation.&rdquo;<\/p>\n<p>In IPython Notebooks, if the last line of a cell is not an assignment statement but simply a value, it is automatically displayed on the screen, similar to the Python interpreter's REPL. In IPython terms, it's as if <code>display()<\/code> is automatically called. This means that if you place a FireDucks DataFrame at the end of a cell, it will also be automatically evaluated.<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> pd<span style=\"color:#f92672\">.<\/span>read_csv(<span style=\"color:#e6db74\">&#34;sample-dataset-tips.csv&#34;<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> df<span style=\"color:#f92672\">.<\/span>sort_values(by<span style=\"color:#f92672\">=<\/span><span style=\"color:#e6db74\">&#34;tip&#34;<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span>df <span style=\"color:#75715e\"># THIS!<\/span>\n<\/span><\/span><\/code><\/pre><\/div><p>However, there is a subtle pitfall when you want to measure execution time using the <code>%%time<\/code> magic command. Just because placing a FireDucks DataFrame at the end of a cell triggers automatic evaluation, it does not necessarily mean that the correct execution time is measured.<\/p>\n<h3 id=\"example-of-incorrect-time-measurement\">Example of Incorrect Time Measurement<\/h3>\n<p>I prepared a CSV file of about 10GB for experimentation. When executing the following cell, something strange happens.<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span><span style=\"color:#f92672\">%%<\/span>time\n<\/span><\/span><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> pd<span style=\"color:#f92672\">.<\/span>read_csv(<span style=\"color:#e6db74\">&#34;sample-dataset-tips10gb.csv&#34;<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> df<span style=\"color:#f92672\">.<\/span>sort_values(by<span style=\"color:#f92672\">=<\/span><span style=\"color:#e6db74\">&#34;tip&#34;<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span>df\n<\/span><\/span><\/code><\/pre><\/div><blockquote>\n<pre><code>CPU times: user 18.2 ms, sys: 4.31 ms, total: 22.5 ms\nWall time: 15.2 ms\n<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th style=\"text-align:right\"><\/th>\n<th style=\"text-align:right\">total_bill<\/th>\n<th style=\"text-align:right\">tip<\/th>\n<th style=\"text-align:right\">sex<\/th>\n<th style=\"text-align:right\">smoker<\/th>\n<th style=\"text-align:right\">day<\/th>\n<th style=\"text-align:right\">time<\/th>\n<th style=\"text-align:right\">size<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"text-align:right\">67<\/td>\n<td style=\"text-align:right\">3.07<\/td>\n<td style=\"text-align:right\">1.0<\/td>\n<td style=\"text-align:right\">Female<\/td>\n<td style=\"text-align:right\">Yes<\/td>\n<td style=\"text-align:right\">Sat<\/td>\n<td style=\"text-align:right\">Dinner<\/td>\n<td style=\"text-align:right\">1<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:right\">92<\/td>\n<td style=\"text-align:right\">5.75<\/td>\n<td style=\"text-align:right\">1.0<\/td>\n<td style=\"text-align:right\">Female<\/td>\n<td style=\"text-align:right\">Yes<\/td>\n<td style=\"text-align:right\">Fri<\/td>\n<td style=\"text-align:right\">Dinner<\/td>\n<td style=\"text-align:right\">2<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:right\">&hellip;<\/td>\n<td style=\"text-align:right\">&hellip;<\/td>\n<td style=\"text-align:right\">&hellip;<\/td>\n<td style=\"text-align:right\">&hellip;<\/td>\n<td style=\"text-align:right\">&hellip;<\/td>\n<td style=\"text-align:right\">&hellip;<\/td>\n<td style=\"text-align:right\">&hellip;<\/td>\n<td style=\"text-align:right\">&hellip;<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:right\">319815362<\/td>\n<td style=\"text-align:right\">50.81<\/td>\n<td style=\"text-align:right\">10.0<\/td>\n<td style=\"text-align:right\">Male<\/td>\n<td style=\"text-align:right\">Yes<\/td>\n<td style=\"text-align:right\">Sat<\/td>\n<td style=\"text-align:right\">Dinner<\/td>\n<td style=\"text-align:right\">3<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:right\">319815606<\/td>\n<td style=\"text-align:right\">50.81<\/td>\n<td style=\"text-align:right\">10.0<\/td>\n<td style=\"text-align:right\">Male<\/td>\n<td style=\"text-align:right\">Yes<\/td>\n<td style=\"text-align:right\">Sat<\/td>\n<td style=\"text-align:right\">Dinner<\/td>\n<td style=\"text-align:right\">3<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>319815680 rows x 7 columns<\/p>\n<\/blockquote>\n<p>Let's look at the line labeled &ldquo;Wall time.&rdquo; Imagine if it only took 15 milliseconds to read and sort data with 300 million rows \u2014 wouldn't that be incredible? In reality, it took about 10 seconds from the start of the cell's execution until the results were displayed. You might wonder, &ldquo;The results are displayed on the screen, so shouldn't they be properly evaluated?&rdquo; That's half true and half false.<\/p>\n<p>In fact, with this approach, the evaluation of the DataFrame begins only after the <code>%%time<\/code> timer has stopped. In other words, because the order is <strong>timer stops \u2192 evaluation \u2192 display<\/strong>, the actual processing of the DataFrame is outside the measurement range of <code>%%time<\/code>.<\/p>\n<h3 id=\"example-of-correct-time-measurement\">Example of Correct Time Measurement<\/h3>\n<p>Therefore, even in IPython Notebooks, when you want to measure execution time, make sure to explicitly call the <code>_evaluate()<\/code> method of DataFrames to properly evaluate them. Writing it as shown below will execute in the order of <strong>evaluation \u2192 timer stops \u2192 display<\/strong>. This way, the actual processing of the DataFrame falls within the measurement range of <code>%%time<\/code>.<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span><span style=\"color:#f92672\">%%<\/span>time\n<\/span><\/span><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> pd<span style=\"color:#f92672\">.<\/span>read_csv(<span style=\"color:#e6db74\">&#34;sample-dataset-tips10gb.csv&#34;<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> df<span style=\"color:#f92672\">.<\/span>sort_values(by<span style=\"color:#f92672\">=<\/span><span style=\"color:#e6db74\">&#34;tip&#34;<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span>df<span style=\"color:#f92672\">.<\/span>_evaluate()\n<\/span><\/span><\/code><\/pre><\/div><blockquote>\n<pre><code>CPU times: user 3min 58s, sys: 1min 2s, total: 5min\nWall time: 11.1 s\n<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th style=\"text-align:right\"><\/th>\n<th style=\"text-align:right\">total_bill<\/th>\n<th style=\"text-align:right\">tip<\/th>\n<th style=\"text-align:right\">sex<\/th>\n<th style=\"text-align:right\">smoker<\/th>\n<th style=\"text-align:right\">day<\/th>\n<th style=\"text-align:right\">time<\/th>\n<th style=\"text-align:right\">size<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"text-align:right\">67<\/td>\n<td style=\"text-align:right\">3.07<\/td>\n<td style=\"text-align:right\">1.0<\/td>\n<td style=\"text-align:right\">Female<\/td>\n<td style=\"text-align:right\">Yes<\/td>\n<td style=\"text-align:right\">Sat<\/td>\n<td style=\"text-align:right\">Dinner<\/td>\n<td style=\"text-align:right\">1<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:right\">92<\/td>\n<td style=\"text-align:right\">5.75<\/td>\n<td style=\"text-align:right\">1.0<\/td>\n<td style=\"text-align:right\">Female<\/td>\n<td style=\"text-align:right\">Yes<\/td>\n<td style=\"text-align:right\">Fri<\/td>\n<td style=\"text-align:right\">Dinner<\/td>\n<td style=\"text-align:right\">2<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:right\">&hellip;<\/td>\n<td style=\"text-align:right\">&hellip;<\/td>\n<td style=\"text-align:right\">&hellip;<\/td>\n<td style=\"text-align:right\">&hellip;<\/td>\n<td style=\"text-align:right\">&hellip;<\/td>\n<td style=\"text-align:right\">&hellip;<\/td>\n<td style=\"text-align:right\">&hellip;<\/td>\n<td style=\"text-align:right\">&hellip;<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:right\">319815362<\/td>\n<td style=\"text-align:right\">50.81<\/td>\n<td style=\"text-align:right\">10.0<\/td>\n<td style=\"text-align:right\">Male<\/td>\n<td style=\"text-align:right\">Yes<\/td>\n<td style=\"text-align:right\">Sat<\/td>\n<td style=\"text-align:right\">Dinner<\/td>\n<td style=\"text-align:right\">3<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:right\">319815606<\/td>\n<td style=\"text-align:right\">50.81<\/td>\n<td style=\"text-align:right\">10.0<\/td>\n<td style=\"text-align:right\">Male<\/td>\n<td style=\"text-align:right\">Yes<\/td>\n<td style=\"text-align:right\">Sat<\/td>\n<td style=\"text-align:right\">Dinner<\/td>\n<td style=\"text-align:right\">3<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>319815680 rows x 7 columns<\/p>\n<\/blockquote>\n<h3 id=\"slightly-different-solution\">Slightly Different Solution<\/h3>\n<p>If you write it as shown below, the order will be <strong>evaluation \u2192 display \u2192 timer stops<\/strong>. In this case, the process to display the DataFrame on the screen inadvertently becomes part of the time measurement. Do you notice any other differences? You might notice that the order of the DataFrame output and the timing result output is reversed.<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span><span style=\"color:#f92672\">%%<\/span>time\n<\/span><\/span><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> pd<span style=\"color:#f92672\">.<\/span>read_csv(<span style=\"color:#e6db74\">&#34;sample-dataset-tips10gb.csv&#34;<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> df<span style=\"color:#f92672\">.<\/span>sort_values(by<span style=\"color:#f92672\">=<\/span><span style=\"color:#e6db74\">&#34;tip&#34;<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span>display(df)\n<\/span><\/span><\/code><\/pre><\/div><blockquote>\n<table>\n<thead>\n<tr>\n<th style=\"text-align:right\"><\/th>\n<th style=\"text-align:right\">total_bill<\/th>\n<th style=\"text-align:right\">tip<\/th>\n<th style=\"text-align:right\">sex<\/th>\n<th style=\"text-align:right\">smoker<\/th>\n<th style=\"text-align:right\">day<\/th>\n<th style=\"text-align:right\">time<\/th>\n<th style=\"text-align:right\">size<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"text-align:right\">67<\/td>\n<td style=\"text-align:right\">3.07<\/td>\n<td style=\"text-align:right\">1.0<\/td>\n<td style=\"text-align:right\">Female<\/td>\n<td style=\"text-align:right\">Yes<\/td>\n<td style=\"text-align:right\">Sat<\/td>\n<td style=\"text-align:right\">Dinner<\/td>\n<td style=\"text-align:right\">1<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:right\">92<\/td>\n<td style=\"text-align:right\">5.75<\/td>\n<td style=\"text-align:right\">1.0<\/td>\n<td style=\"text-align:right\">Female<\/td>\n<td style=\"text-align:right\">Yes<\/td>\n<td style=\"text-align:right\">Fri<\/td>\n<td style=\"text-align:right\">Dinner<\/td>\n<td style=\"text-align:right\">2<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:right\">&hellip;<\/td>\n<td style=\"text-align:right\">&hellip;<\/td>\n<td style=\"text-align:right\">&hellip;<\/td>\n<td style=\"text-align:right\">&hellip;<\/td>\n<td style=\"text-align:right\">&hellip;<\/td>\n<td style=\"text-align:right\">&hellip;<\/td>\n<td style=\"text-align:right\">&hellip;<\/td>\n<td style=\"text-align:right\">&hellip;<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:right\">319815362<\/td>\n<td style=\"text-align:right\">50.81<\/td>\n<td style=\"text-align:right\">10.0<\/td>\n<td style=\"text-align:right\">Male<\/td>\n<td style=\"text-align:right\">Yes<\/td>\n<td style=\"text-align:right\">Sat<\/td>\n<td style=\"text-align:right\">Dinner<\/td>\n<td style=\"text-align:right\">3<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:right\">319815606<\/td>\n<td style=\"text-align:right\">50.81<\/td>\n<td style=\"text-align:right\">10.0<\/td>\n<td style=\"text-align:right\">Male<\/td>\n<td style=\"text-align:right\">Yes<\/td>\n<td style=\"text-align:right\">Sat<\/td>\n<td style=\"text-align:right\">Dinner<\/td>\n<td style=\"text-align:right\">3<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>319815680 rows x 7 columns<\/p>\n<pre><code>CPU times: user 3min 58s, sys: 55.4 s, total: 4min 53s\nWall time: 10.6 s\n<\/code><\/pre>\n<\/blockquote>\n<h2 id=\"wrap-up\">Wrap-up<\/h2>\n<p>FireDucks has the same API as pandas while adopting a lazy evaluation mechanism, so it\u2019s important to pay attention to such subtle details when measuring processing time. However, FireDucks allows you to speed up processing while using nearly the same code as you would with pandas, making it a powerful ally for data science.<\/p>\n<p>Recently, many people have shown interest in FireDucks, and we are receiving more feedback about its improved speed as well as bug reports. As the development team, we are committed to making sure FireDucks gets widely used and continues to be valuable over the long term. Please stay tuned for future updates!<\/p>\n<p>May the Acceleration be with you, FireDucks Development Team<\/p>"},{"title":"Posts: Exploring performance benefits of FireDucks over cuDF","link":"https:\/\/fireducks-dev.github.io\/posts\/cudf_vs_fireducks\/","pubDate":"Wed, 18 Dec 2024 00:00:00 +0000","guid":"https:\/\/fireducks-dev.github.io\/posts\/cudf_vs_fireducks\/","description":"\n<p><a href=\"https:\/\/www.anaconda.com\/resources\/whitepapers\/state-of-data-science-2020\">Research<\/a> says that Data\nscientists spend about 45% of their time on data preparation tasks, including loading (19%) and\ncleaning (26%) the data. <a href=\"https:\/\/pandas.pydata.org\/\">Pandas<\/a> is one of the most popular python\nlibraries for tabular data processing because of its diverse utilities and large community support.\nHowever, due to its performance issue with the large-scale data processing, there is a strong need\nfor high-performance data frame libraries for the community. Although there are many alternatives\navailable at this moment, due to compatibility issues with pandas some of those either compel\na user to learn completely new APIs (incurring migration cost) or to switch to a more\nefficient computational systems, like GPU etc. (incurring hardware cost).<\/p>\n<p>In this article we will discuss two high-performance pandas alternatives that can help a pandas programmer\nto smoothly migrate an existing application while offering promising speed. They are:<\/p>\n<ul>\n<li><a href=\"https:\/\/docs.rapids.ai\/api\/cudf\/stable\">cuDF<\/a>: GPU accelerated DataFrame library with highly compatible pandas APIs<\/li>\n<li><a href=\"https:\/\/fireducks-dev.github.io\/\">FireDucks<\/a>: A compiler accelerated DataFrame library with highly compatible pandas APIs for speedup even on CPU only systems<\/li>\n<\/ul>\n<h2 id=\"fireducks-vs-cudf\">FireDucks vs. cuDF<\/h2>\n<p>Both FireDucks and cuDF offer the following:<\/p>\n<ul>\n<li>ensure zero code changes with promising speedup<\/li>\n<li>highly-compatible pandas APIs for a seamless integration with an existing pandas application<\/li>\n<li>import-hook feature for a seamless integration with third party library using pandas<\/li>\n<li>parallel implementation of the kernel algorithms (like join, groupby etc.) to leverage all the available cores<\/li>\n<\/ul>\n<p>However, the key differences are:<\/p>\n<ul>\n<li>FireDucks can speedup an existing pandas application even on CPU only systems, whereas\none needs to prepare a GPU environment before trying cuDF.<\/li>\n<li>FireDucks supports a lazy execution model aiming for JIT query optimization, whereas\ncuDF supports only an eager execution model (similar to pandas). Therefore, if the program\nis not written carefully with the right data-flow, cuDF might suffer performance issue while\nFireDucks can outperform cuDF even on CPU only systems due to its efficient query optimization.<\/li>\n<\/ul>\n<h2 id=\"evaluation\">Evaluation<\/h2>\n<h3 id=\"multi-threaded-benefit\">Multi-threaded Benefit<\/h3>\n<p>Here is an <a href=\"https:\/\/developer.nvidia.com\/blog\/rapids-cudf-accelerates-pandas-nearly-150x-with-zero-code-changes\">article<\/a>\nexplaining the key features of cuDF along with its performance. We have used the notebook provided in that article\nto evaluate <code>pandas<\/code>, <code>fireducks.pandas<\/code>, and <code>cudf.pandas<\/code> respectively.<\/p>\n<ul>\n<li><a href=\"https:\/\/github.com\/fireducks-dev\/fireducks\/blob\/main\/notebooks\/nyc_demo\/pandas_nyc_demo.ipynb\">test drive for native pandas<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/fireducks-dev\/fireducks\/blob\/main\/notebooks\/nyc_demo\/fireducks_pandas_nyc_demo.ipynb\">test drive for fireducks.pandas<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/fireducks-dev\/fireducks\/blob\/main\/notebooks\/nyc_demo\/cudf_pandas_nyc_demo.ipynb\">test drive for cudf.pandas<\/a><\/li>\n<\/ul>\n<p>Here are some details related to the evaluation environment:<\/p>\n<ul>\n<li>CPU model: Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz<\/li>\n<li>CPU cores: 48<\/li>\n<li>Main memory: 256gb<\/li>\n<li>GPU model: NVIDIA Tesla V100<\/li>\n<\/ul>\n<p>It can be noted that, by simply enabling the extension <code>%load_ext fireducks.pandas<\/code>\nor <code>%load_ext cudf.pandas<\/code>, one can successfully speedup the operations in an\nexisting pandas notebook using FireDucks or cuDF. For this experiment, we have\ndisabled FireDucks lazy-execution mode as follows for a fair comparison among these 3 libraries:<\/p>\n<pre tabindex=\"0\"><code>from fireducks.core import get_fireducks_options\nget_fireducks_options().set_benchmark_mode(True)\n<\/code><\/pre><p>The table below summarizes the query wise execution time for these libraries:<\/p>\n<table>\n<thead>\n<tr>\n<th style=\"text-align:left\"><\/th>\n<th style=\"text-align:right\">pandas (sec)<\/th>\n<th style=\"text-align:right\">FireDucks (sec)<\/th>\n<th style=\"text-align:right\">cuDF (sec)<\/th>\n<th style=\"text-align:left\">speedup_from_FireDucks_over_pandas<\/th>\n<th style=\"text-align:left\">speedup_from_cuDF_over_pandas<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"text-align:left\">data_loading<\/td>\n<td style=\"text-align:right\">1.85<\/td>\n<td style=\"text-align:right\">0.53<\/td>\n<td style=\"text-align:right\">0.42<\/td>\n<td style=\"text-align:left\">3.49x<\/td>\n<td style=\"text-align:left\">4.4x<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:left\">query_1<\/td>\n<td style=\"text-align:right\">2.4<\/td>\n<td style=\"text-align:right\">0.08<\/td>\n<td style=\"text-align:right\">0.35<\/td>\n<td style=\"text-align:left\">30.0x<\/td>\n<td style=\"text-align:left\">6.86x<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:left\">query_2<\/td>\n<td style=\"text-align:right\">0.75<\/td>\n<td style=\"text-align:right\">0.03<\/td>\n<td style=\"text-align:right\">0.01<\/td>\n<td style=\"text-align:left\">25.0x<\/td>\n<td style=\"text-align:left\">75.0x<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:left\">query_3<\/td>\n<td style=\"text-align:right\">6.38<\/td>\n<td style=\"text-align:right\">0.15<\/td>\n<td style=\"text-align:right\">0.08<\/td>\n<td style=\"text-align:left\">42.53x<\/td>\n<td style=\"text-align:left\">79.75x<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Due to difference in the underlined hardware, cuDF operations (that worked on GPU) definitely performed much better\nwhen compared to pandas, but the performance gain from FireDucks over pandas even on CPU is quite promising.\nIn fact, the <strong>overall speedup is ~13x (11.37s -&gt; 0.87s) when using cuDF,\nwhereas it is ~14x (11.37s -&gt; 0.79s) when using FireDucks<\/strong> for the same pandas program.<\/p>\n<h3 id=\"jit-optimization-benefit\">JIT Optimization Benefit<\/h3>\n<p>The above case shows how efficiently FireDucks can leverage the available cpu cores to speedup an existing pandas program.<\/p>\n<p>Let&rsquo;s now understand how FireDucks JIT query optimization can make it even better!!<\/p>\n<p>We have used a <a href=\"https:\/\/www.tpc.org\/TPC_Documents_Current_Versions\/pdf\/TPC-H_v3.0.1.pdf#page=33\">sample query<\/a>\nfrom the <a href=\"https:\/\/www.tpc.org\/tpch\/\">TPC-H benchmark<\/a> that deals with a couple of tables of different dimensions\nfor a scale-factor 10.<\/p>\n<p>\ud83d\udc49 <strong>Purpose: To retrieve the 10 unshipped orders with the highest value.<\/strong><\/p>\n<p>Here is the pandas implementation for this query:<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span>(\n<\/span><\/span><span style=\"display:flex;\"><span> pd<span style=\"color:#f92672\">.<\/span>read_parquet(os<span style=\"color:#f92672\">.<\/span>path<span style=\"color:#f92672\">.<\/span>join(datapath, <span style=\"color:#e6db74\">&#34;customer.parquet&#34;<\/span>))\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>merge(pd<span style=\"color:#f92672\">.<\/span>read_parquet(os<span style=\"color:#f92672\">.<\/span>path<span style=\"color:#f92672\">.<\/span>join(datapath, <span style=\"color:#e6db74\">&#34;orders.parquet&#34;<\/span>)),\n<\/span><\/span><span style=\"display:flex;\"><span> left_on<span style=\"color:#f92672\">=<\/span><span style=\"color:#e6db74\">&#34;c_custkey&#34;<\/span>, right_on<span style=\"color:#f92672\">=<\/span><span style=\"color:#e6db74\">&#34;o_custkey&#34;<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>merge(pd<span style=\"color:#f92672\">.<\/span>read_parquet(os<span style=\"color:#f92672\">.<\/span>path<span style=\"color:#f92672\">.<\/span>join(datapath, <span style=\"color:#e6db74\">&#34;lineitem.parquet&#34;<\/span>)),\n<\/span><\/span><span style=\"display:flex;\"><span> left_on<span style=\"color:#f92672\">=<\/span><span style=\"color:#e6db74\">&#34;o_orderkey&#34;<\/span>, right_on<span style=\"color:#f92672\">=<\/span><span style=\"color:#e6db74\">&#34;l_orderkey&#34;<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>pipe(<span style=\"color:#66d9ef\">lambda<\/span> df: df[df[<span style=\"color:#e6db74\">&#34;c_mktsegment&#34;<\/span>] <span style=\"color:#f92672\">==<\/span> <span style=\"color:#e6db74\">&#34;BUILDING&#34;<\/span>])\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>pipe(<span style=\"color:#66d9ef\">lambda<\/span> df: df[df[<span style=\"color:#e6db74\">&#34;o_orderdate&#34;<\/span>] <span style=\"color:#f92672\">&lt;<\/span> datetime(<span style=\"color:#ae81ff\">1995<\/span>, <span style=\"color:#ae81ff\">3<\/span>, <span style=\"color:#ae81ff\">15<\/span>)])\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>pipe(<span style=\"color:#66d9ef\">lambda<\/span> df: df[df[<span style=\"color:#e6db74\">&#34;l_shipdate&#34;<\/span>] <span style=\"color:#f92672\">&gt;<\/span> datetime(<span style=\"color:#ae81ff\">1995<\/span>, <span style=\"color:#ae81ff\">3<\/span>, <span style=\"color:#ae81ff\">15<\/span>)])\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>assign(revenue<span style=\"color:#f92672\">=<\/span><span style=\"color:#66d9ef\">lambda<\/span> df: df[<span style=\"color:#e6db74\">&#34;l_extendedprice&#34;<\/span>] <span style=\"color:#f92672\">*<\/span> (<span style=\"color:#ae81ff\">1<\/span> <span style=\"color:#f92672\">-<\/span> df[<span style=\"color:#e6db74\">&#34;l_discount&#34;<\/span>]))\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>groupby([<span style=\"color:#e6db74\">&#34;l_orderkey&#34;<\/span>, <span style=\"color:#e6db74\">&#34;o_orderdate&#34;<\/span>, <span style=\"color:#e6db74\">&#34;o_shippriority&#34;<\/span>], as_index<span style=\"color:#f92672\">=<\/span><span style=\"color:#66d9ef\">False<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>agg({<span style=\"color:#e6db74\">&#34;revenue&#34;<\/span>: <span style=\"color:#e6db74\">&#34;sum&#34;<\/span>})[[<span style=\"color:#e6db74\">&#34;l_orderkey&#34;<\/span>, <span style=\"color:#e6db74\">&#34;revenue&#34;<\/span>, <span style=\"color:#e6db74\">&#34;o_orderdate&#34;<\/span>, <span style=\"color:#e6db74\">&#34;o_shippriority&#34;<\/span>]]\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>sort_values([<span style=\"color:#e6db74\">&#34;revenue&#34;<\/span>, <span style=\"color:#e6db74\">&#34;o_orderdate&#34;<\/span>], ascending<span style=\"color:#f92672\">=<\/span>[<span style=\"color:#66d9ef\">False<\/span>, <span style=\"color:#66d9ef\">True<\/span>])\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>reset_index(drop<span style=\"color:#f92672\">=<\/span><span style=\"color:#66d9ef\">True<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>head(<span style=\"color:#ae81ff\">10<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>to_parquet(os<span style=\"color:#f92672\">.<\/span>path<span style=\"color:#f92672\">.<\/span>join(datapath, <span style=\"color:#e6db74\">&#34;q3_result.parquet&#34;<\/span>))\n<\/span><\/span><span style=\"display:flex;\"><span>)\n<\/span><\/span><\/code><\/pre><\/div><p>This time we have used the default lazy-execution mode in FireDucks to demonstrate its true strength.\nThe execution time of this query for each DataFrame library is as follows:<\/p>\n<ul>\n<li>native pandas: 215.47 sec<\/li>\n<li>fireducks.pandas: 1.69 sec<\/li>\n<li>cudf.pandas: 26.79 sec<\/li>\n<\/ul>\n<p>\ud83d\ude80\ud83d\ude80 <strong>FireDucks outperformed pandas upto 127x (215.47s -&gt; 1.69s) and cuDF upto 15x (26.79s -&gt; 1.69s) for the avove query.<\/strong><\/p>\n<p>\ud83e\udd14 You might be wondering how a CPU-based implementation in FireDucks can be\nfaster than a GPU-based implementation in cuDF!!<\/p>\n<p>This speedup from FireDucks is due to the efficient query planning and optimization that\nis performed by the internal JIT compiler. Instead of executing the input query as it is,\nit attempts to optimize the same by reducing the scope of the input data for the time\nconsuming join, groupby etc. operations majorly using the following steps:<\/p>\n<ul>\n<li>loading only required columns from the input parquet files to reduce the data horizontally<\/li>\n<li>performing early filtration to reduce the data vertically<\/li>\n<\/ul>\n<p>\ud83d\udcd3 In case of FireDucks lazy-execution mode, when a method like <code>to_parquet<\/code>, <code>plot<\/code>, <code>print<\/code>\netc. are called, it enables the compiler to start optimizing the accumulated data flow. Once\nthe optimization phase is completed, it is executed by a multi-threaded CPU kernel backed by\narrow memory helping you to experience superfast data processing, along with remarkable\nreduction in the computational memory.<\/p>\n<p>The optimized implementation for the same query could be as follows:<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span>req_customer_cols <span style=\"color:#f92672\">=<\/span> [<span style=\"color:#e6db74\">&#34;c_custkey&#34;<\/span>, <span style=\"color:#e6db74\">&#34;c_mktsegment&#34;<\/span>] <span style=\"color:#75715e\"># selecting (2\/8) columns<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span>req_lineitem_cols <span style=\"color:#f92672\">=<\/span> [<span style=\"color:#e6db74\">&#34;l_orderkey&#34;<\/span>, <span style=\"color:#e6db74\">&#34;l_shipdate&#34;<\/span>, <span style=\"color:#e6db74\">&#34;l_extendedprice&#34;<\/span>, <span style=\"color:#e6db74\">&#34;l_discount&#34;<\/span>] <span style=\"color:#75715e\"># selecting (4\/16) columns<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span>req_orders_cols <span style=\"color:#f92672\">=<\/span> [<span style=\"color:#e6db74\">&#34;o_custkey&#34;<\/span>, <span style=\"color:#e6db74\">&#34;o_orderkey&#34;<\/span>, <span style=\"color:#e6db74\">&#34;o_orderdate&#34;<\/span>, <span style=\"color:#e6db74\">&#34;o_shippriority&#34;<\/span>] <span style=\"color:#75715e\"># selecting (4\/9) columns<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span>customer <span style=\"color:#f92672\">=<\/span> pd<span style=\"color:#f92672\">.<\/span>read_parquet(os<span style=\"color:#f92672\">.<\/span>path<span style=\"color:#f92672\">.<\/span>join(datapath, <span style=\"color:#e6db74\">&#34;customer.parquet&#34;<\/span>), columns <span style=\"color:#f92672\">=<\/span> req_customer_cols)\n<\/span><\/span><span style=\"display:flex;\"><span>lineitem <span style=\"color:#f92672\">=<\/span> pd<span style=\"color:#f92672\">.<\/span>read_parquet(os<span style=\"color:#f92672\">.<\/span>path<span style=\"color:#f92672\">.<\/span>join(datapath, <span style=\"color:#e6db74\">&#34;lineitem.parquet&#34;<\/span>), columns <span style=\"color:#f92672\">=<\/span> req_lineitem_cols)\n<\/span><\/span><span style=\"display:flex;\"><span>orders <span style=\"color:#f92672\">=<\/span> pd<span style=\"color:#f92672\">.<\/span>read_parquet(os<span style=\"color:#f92672\">.<\/span>path<span style=\"color:#f92672\">.<\/span>join(datapath, <span style=\"color:#e6db74\">&#34;orders.parquet&#34;<\/span>), columns <span style=\"color:#f92672\">=<\/span> req_orders_cols)\n<\/span><\/span><span style=\"display:flex;\"><span>\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#75715e\"># advanced-filter: to reduce scope of \u201ccustomer\u201d table to be processed<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span>f_cust <span style=\"color:#f92672\">=<\/span> customer[customer[<span style=\"color:#e6db74\">&#34;c_mktsegment&#34;<\/span>] <span style=\"color:#f92672\">==<\/span> <span style=\"color:#e6db74\">&#34;BUILDING&#34;<\/span>]\n<\/span><\/span><span style=\"display:flex;\"><span>\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#75715e\"># advanced-filter: to reduce scope of \u201corders\u201d table to be processed<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span>f_ord <span style=\"color:#f92672\">=<\/span> orders[orders[<span style=\"color:#e6db74\">&#34;o_orderdate&#34;<\/span>] <span style=\"color:#f92672\">&lt;<\/span> datetime(<span style=\"color:#ae81ff\">1995<\/span>, <span style=\"color:#ae81ff\">3<\/span>, <span style=\"color:#ae81ff\">15<\/span>)]\n<\/span><\/span><span style=\"display:flex;\"><span>\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#75715e\"># advanced-filter: to reduce scope of \u201clineitem\u201d table to be processed<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span>f_litem <span style=\"color:#f92672\">=<\/span> lineitem[lineitem[<span style=\"color:#e6db74\">&#34;l_shipdate&#34;<\/span>] <span style=\"color:#f92672\">&gt;<\/span> datetime(<span style=\"color:#ae81ff\">1995<\/span>, <span style=\"color:#ae81ff\">3<\/span>, <span style=\"color:#ae81ff\">15<\/span>)]\n<\/span><\/span><span style=\"display:flex;\"><span>\n<\/span><\/span><span style=\"display:flex;\"><span>(\n<\/span><\/span><span style=\"display:flex;\"><span> f_cust<span style=\"color:#f92672\">.<\/span>merge(f_ord, left_on<span style=\"color:#f92672\">=<\/span><span style=\"color:#e6db74\">&#34;c_custkey&#34;<\/span>, right_on<span style=\"color:#f92672\">=<\/span><span style=\"color:#e6db74\">&#34;o_custkey&#34;<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>merge(f_litem, left_on<span style=\"color:#f92672\">=<\/span><span style=\"color:#e6db74\">&#34;o_orderkey&#34;<\/span>, right_on<span style=\"color:#f92672\">=<\/span><span style=\"color:#e6db74\">&#34;l_orderkey&#34;<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>assign(revenue<span style=\"color:#f92672\">=<\/span><span style=\"color:#66d9ef\">lambda<\/span> df: df[<span style=\"color:#e6db74\">&#34;l_extendedprice&#34;<\/span>] <span style=\"color:#f92672\">*<\/span> (<span style=\"color:#ae81ff\">1<\/span> <span style=\"color:#f92672\">-<\/span> df[<span style=\"color:#e6db74\">&#34;l_discount&#34;<\/span>]))\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>groupby([<span style=\"color:#e6db74\">&#34;l_orderkey&#34;<\/span>, <span style=\"color:#e6db74\">&#34;o_orderdate&#34;<\/span>, <span style=\"color:#e6db74\">&#34;o_shippriority&#34;<\/span>], as_index<span style=\"color:#f92672\">=<\/span><span style=\"color:#66d9ef\">False<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>agg({<span style=\"color:#e6db74\">&#34;revenue&#34;<\/span>: <span style=\"color:#e6db74\">&#34;sum&#34;<\/span>})[[<span style=\"color:#e6db74\">&#34;l_orderkey&#34;<\/span>, <span style=\"color:#e6db74\">&#34;revenue&#34;<\/span>, <span style=\"color:#e6db74\">&#34;o_orderdate&#34;<\/span>, <span style=\"color:#e6db74\">&#34;o_shippriority&#34;<\/span>]]\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>sort_values([<span style=\"color:#e6db74\">&#34;revenue&#34;<\/span>, <span style=\"color:#e6db74\">&#34;o_orderdate&#34;<\/span>], ascending<span style=\"color:#f92672\">=<\/span>[<span style=\"color:#66d9ef\">False<\/span>, <span style=\"color:#66d9ef\">True<\/span>])\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>reset_index(drop<span style=\"color:#f92672\">=<\/span><span style=\"color:#66d9ef\">True<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>head(<span style=\"color:#ae81ff\">10<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>to_parquet(os<span style=\"color:#f92672\">.<\/span>path<span style=\"color:#f92672\">.<\/span>join(datapath, <span style=\"color:#e6db74\">&#34;opt_q3_result.parquet&#34;<\/span>))\n<\/span><\/span><span style=\"display:flex;\"><span>)\n<\/span><\/span><\/code><\/pre><\/div><p>The execution time of this optimized implementation for each DataFrame library is as follows:<\/p>\n<ul>\n<li>native pandas: 11.13 sec<\/li>\n<li>fireducks.pandas: 1.72 sec<\/li>\n<li>cudf.pandas: 0.76 sec<\/li>\n<\/ul>\n<p>It can be noted that:<\/p>\n<ul>\n<li>the native pandas could itself be optimized upto <strong>~19x (215.47 sec -&gt; 11.13 sec)<\/strong><\/li>\n<li>there is no visible change in the execution time of FireDucks (<strong>since the compiler does the same optimization automatically in the earlier case<\/strong>)<\/li>\n<li>the cudf.pandas could be optimized upto <strong>~35x (26.79 sec -&gt; 0.76 sec)<\/strong><\/li>\n<\/ul>\n<p>Most importantly there is no impact in the final result due to the optimization performed.\nYou can reproduce the same using this <a href=\"https:\/\/github.com\/fireducks-dev\/fireducks\/blob\/main\/notebooks\/tpch-query3-pandas-fireducks-cudf.ipynb\">notebook<\/a> at your end.<\/p>\n<h2 id=\"wrapping-up\">Wrapping up<\/h2>\n<p>Thank you for your time in reading this article. We have discussed performance benefit of FireDucks\nover cuDF. While cuDF shows significant speedup without modifying an existing pandas program,\nits performance relies on the underlined GPU specification and how well the program is written, whereas\nFireDucks can optimize an existing pandas program efficiently like an expert programmer and\nexecute the same without any extra overhead, that too on CPU only systems.<\/p>\n<p>Being said that, <strong>a GPU version of FireDucks is under dvelopment<\/strong>. It internally uses <code>cuDF.pandas<\/code>\nfor the kernel operations (like groupby, join etc.), while adding the JIT optimization for further\nacceleration as explained in this article. For example, even when you write the query as in the\nfirst implementation, it would be auto-optimized by the FireDucks compiler similar to the\noptimized implementation and then it will be passed to the cuDF kernel for the execution\nat the GPU side (helping you to experience the query to be finished in ~0.76 sec).\nWe will be talking about the GPU version of FireDucks in details in some other article.<\/p>\n<p>We look forward your constant feedback to make FireDucks even better.\nPlease feel free to get in touch with us in any of your prefered channel mentioned below:<\/p>\n<ul>\n<li>\ud83e\udd86github : <a href=\"https:\/\/github.com\/fireducks-dev\/fireducks\/issues\/new\">https:\/\/github.com\/fireducks-dev\/fireducks\/issues\/new<\/a><\/li>\n<li>\ud83d\udce7mail : <a href=\"mailto:contact@fireducks.jp.nec.com\">contact@fireducks.jp.nec.com<\/a><\/li>\n<li>\ud83e\udd1dslack : <a href=\"https:\/\/join.slack.com\/t\/fireducks\/shared_invite\/zt-34qpdgr6q-_iWdIoZW4l_hGhljKS0pyg\">https:\/\/join.slack.com\/t\/fireducks\/shared_invite\/zt-34qpdgr6q-_iWdIoZW4l_hGhljKS0pyg<\/a><\/li>\n<\/ul>"},{"title":"Posts: Unveiling the Optimization Benefit of FireDucks Lazy Execution: Part #1","link":"https:\/\/fireducks-dev.github.io\/posts\/lazy_execution_offering_part1\/","pubDate":"Thu, 05 Dec 2024 00:00:00 +0000","guid":"https:\/\/fireducks-dev.github.io\/posts\/lazy_execution_offering_part1\/","description":"\n<p>The availability of runtime memory is often a challenge faced at processing larger-than-memory-dataset while working with pandas.\nTo solve the problem, one can either shift to a system with larger memory capacity or consider switching to alternative libraries supporting distributed data processing like (Dask, PySpark etc.).<\/p>\n<p>Well, do you know when working with data stored in columnar formats like csv, parquet etc. and only some part of data is to be processed, manual optimization is possible even in pandas?\nFor example, let&rsquo;s consider the below data is stored in a parquet file, named sample_data.parquet (or in a csv file, named sample_data.csv):<\/p>\n<pre tabindex=\"0\"><code> a b c x y z\n0 1 0.1 1 0 t1 10\n1 2 0.2 4 1 t2 20\n2 3 0.3 9 1 t3 30\n3 4 0.4 16 0 t1 40\n4 5 0.5 25 1 t2 50\n5 6 0.6 36 1 t1 60\n6 7 0.7 49 0 t2 70\n7 8 0.8 64 1 t3 80\n<\/code><\/pre><p>And you want to perform sum of &ldquo;c&rdquo; column, when the value of &ldquo;x&rdquo; column is 1. You may simply write the program as follows:<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> pd<span style=\"color:#f92672\">.<\/span>read_parquet(<span style=\"color:#e6db74\">&#34;sample_data.parquet&#34;<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span>res <span style=\"color:#f92672\">=<\/span> df[df[<span style=\"color:#e6db74\">&#34;x&#34;<\/span>] <span style=\"color:#f92672\">==<\/span> <span style=\"color:#ae81ff\">1<\/span>][<span style=\"color:#e6db74\">&#34;c&#34;<\/span>]<span style=\"color:#f92672\">.<\/span>sum() <span style=\"color:#75715e\"># filter data based on condition and calculate sum of &#34;c&#34; column from filtered frame<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span>print (res)\n<\/span><\/span><\/code><\/pre><\/div><p>Now the problem may occur when the parquet file is too large to fit in your system memory, although you are interested only a part of it (column &ldquo;x&rdquo; and &ldquo;c&rdquo;).\nThankfully, read_parquet() method has a parameter named <code>columns<\/code> and you can specify the target columns to be loaded from the input parquet file:<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> pd<span style=\"color:#f92672\">.<\/span>read_parquet(<span style=\"color:#e6db74\">&#34;sample_data.parquet&#34;<\/span>, columns <span style=\"color:#f92672\">=<\/span>[<span style=\"color:#e6db74\">&#34;x&#34;<\/span>, <span style=\"color:#e6db74\">&#34;c&#34;<\/span>])\n<\/span><\/span><span style=\"display:flex;\"><span>res <span style=\"color:#f92672\">=<\/span> df[df[<span style=\"color:#e6db74\">&#34;x&#34;<\/span>] <span style=\"color:#f92672\">==<\/span> <span style=\"color:#ae81ff\">1<\/span>][<span style=\"color:#e6db74\">&#34;c&#34;<\/span>]<span style=\"color:#f92672\">.<\/span>sum() <span style=\"color:#75715e\"># filter data based on condition and calculate sum of &#34;c&#34; column from filtered frame<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span>print (res)\n<\/span><\/span><\/code><\/pre><\/div><p>Similarly, read_csv() has a parameter, named <code>usecols<\/code> that can be specified when loading only target columns from a CSV file.<\/p>\n<h2 id=\"fireducks-offerings\">FireDucks Offerings<\/h2>\n<p>Although such parameters can be specified to optimize runtime memory consumption when using pandas, it\nmight be difficult to know what all columns are required at the very begining of analysing the data.\nAn automatic optimization for such cases would definitely be useful for users of pandas-like libraries.<\/p>\n<p>Since <strong>FireDucks 1.1.1<\/strong>, we have supported such optimization to be taken care of by its internal JIT compiler.\nEven though such parameters are not manually specified, the JIT compiler can inspect the projection targets\non various stages for a given data and it can automatically specify such parameters when generating the optimized code.\nSuch optimization is commonly known as <strong>pushdown-projection<\/strong>. By specifiying the environment variable <strong>FIRE_LOG_LEVEL=3<\/strong>,\nyou can inspect the before and after optimization for the below example.<\/p>\n<pre tabindex=\"0\"><code>$ cat read_parquet_opt_demo.py\nimport pandas as pd\ndf = pd.read_parquet(&#34;sample_data.parquet&#34;)\nr1 = df[df[&#34;x&#34;] == 1][&#34;c&#34;].sum()\nprint(r1)\n<\/code><\/pre><p>Execute the program as follows:<\/p>\n<pre tabindex=\"0\"><code>$ FIRE_LOG_LEVEL=3 python -mfireducks.pandas read_parquet_opt_demo.py\n<\/code><\/pre><p>It will then show the intermediate representation (IR) generated for the above program before execution as follows:<\/p>\n<pre tabindex=\"0\"><code>2024-12-04 13:12:40.618398: 543780 fireducks\/lib\/fireducks_core.cc:64] Input IR:\nfunc @main() {\n%t0 = read_parquet(&#39;sample_data.parquet&#39;, []) &lt;- load the input parquet file\n%t1 = project(%t0, &#39;x&#39;) &lt;- project &#34;x&#34; column from loaded data (df[&#34;x&#34;])\n%t2 = eq.vector.scalar(%t1, 1) &lt;- generate mask with equality check with scalar value, 1 (mask = df[&#34;x&#34;] == 1)\n%t3 = filter(%t0, %t2) &lt;- perform filter with computed mask (fdf = df[mask])\n%t4 = project(%t3, &#39;c&#39;) &lt;- project &#34;c&#34; column from filtered data (fdf[&#34;c&#34;])\n%v5 = aggregate_column.scalar(%t4, &#39;sum&#39;) &lt;- calculate sum of projected column (fdf[&#34;c&#34;].sum())\nreturn(%t4, %v5)\n}\n<\/code><\/pre><p>And the Optimized IR (target for execution) is as follows.\nYou can see that it is mostly the same with the optimization added in the instruction for read_parquet()\nby automatically specifying the target columns to be loaded for the computation of this specific result (r1).<\/p>\n<pre tabindex=\"0\"><code>2024-12-04 13:12:40.619360: 543780 fireducks\/lib\/fireducks_core.cc:73] Optimized IR:\nfunc @main() {\n%t0 = read_parquet(&#39;sample_data.parquet&#39;, [&#39;c&#39;, &#39;x&#39;])\n%t1 = project(%t0, &#39;x&#39;)\n%t2 = eq.vector.scalar(%t1, 1)\n%t3 = project(%t0, [&#39;c&#39;])\n%t4 = filter(%t3, %t2)\n%t5 = project(%t4, &#39;c&#39;)\n%v6 = aggregate_column.scalar(%t5, &#39;sum&#39;)\nreturn(%t5, %v6)\n}\n<\/code><\/pre><p>The python equivalent of the above optimized IR (that will be executed by the FireDucks multi-threaded kernel) is as follows:<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> pd<span style=\"color:#f92672\">.<\/span>read_parquet(<span style=\"color:#e6db74\">&#34;sample_data.parquet&#34;<\/span>, columns<span style=\"color:#f92672\">=<\/span>[<span style=\"color:#e6db74\">&#34;c&#34;<\/span>, <span style=\"color:#e6db74\">&#34;x&#34;<\/span>]) <span style=\"color:#75715e\"># load only required column for analysis<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span>t1 <span style=\"color:#f92672\">=<\/span> df[<span style=\"color:#e6db74\">&#34;x&#34;<\/span>] <span style=\"color:#75715e\"># projection of target column for equality check<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span>t2 <span style=\"color:#f92672\">=<\/span> (t1 <span style=\"color:#f92672\">==<\/span> <span style=\"color:#ae81ff\">1<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span>t3 <span style=\"color:#f92672\">=<\/span> df[<span style=\"color:#e6db74\">&#34;c&#34;<\/span>] <span style=\"color:#75715e\"># projection of only target column to be filtered<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span>t4 <span style=\"color:#f92672\">=<\/span> t3[t2]\n<\/span><\/span><span style=\"display:flex;\"><span>t5 <span style=\"color:#f92672\">=<\/span> t4[<span style=\"color:#e6db74\">&#34;c&#34;<\/span>]\n<\/span><\/span><span style=\"display:flex;\"><span>v6 <span style=\"color:#f92672\">=<\/span> t5<span style=\"color:#f92672\">.<\/span>sum()\n<\/span><\/span><\/code><\/pre><\/div><p>\u26a0\ufe0f Please note that the verification through this environment variable setting is mainly for the developers and\nwe might change the way of representing the IRs in future. As a user, it would be good to inspect the optimization\nusing this variable at this moment though.<\/p>\n<h2 id=\"lets-put-it-into-a-test-drive\">Let&rsquo;s put it into a test drive<\/h2>\n<p>You can refer to the <a href=\"https:\/\/github.com\/fireducks-dev\/fireducks\/blob\/main\/notebooks\/read_parquet_optimization.ipynb\">notebook<\/a>.\nIt demonstrates the performance benefit of such optimization on a real dataset.\nYou may like to experiment around the query to realize the efficiency of FireDucks optimization.\nFor a sample query, <strong>FireDucks performed 45x faster than Pandas, that too without any modification in the source program and affecting the result<\/strong>.<\/p>\n<p>It also explains some Do&rsquo;s and Don&rsquo;ts when executing a query in notebook-like platform. In case of notebook, the execution takes place cell-by-cell.\nThus when keeping the intermediate results in some cell variables, FireDucks compiler assumes that those might be used at some later stage.\nSo it will keep all of them alive hindering the optimization. Therefore, it is highly recommended to write a query in chained expression\nwhen using notebook.<\/p>\n<h2 id=\"wrapping-up\">Wrapping-up<\/h2>\n<p>Thank you for your time in reading this article.\nIn case you have any queries or have an issue to report, please feel free to get in touch with us in any of your prefered channel mentioned below:<\/p>\n<ul>\n<li>\ud83e\udd86github : <a href=\"https:\/\/github.com\/fireducks-dev\/fireducks\/issues\/new\">https:\/\/github.com\/fireducks-dev\/fireducks\/issues\/new<\/a><\/li>\n<li>\ud83d\udce7mail : <a href=\"mailto:contact@fireducks.jp.nec.com\">contact@fireducks.jp.nec.com<\/a><\/li>\n<li>\ud83e\udd1dslack : <a href=\"https:\/\/join.slack.com\/t\/fireducks\/shared_invite\/zt-34qpdgr6q-_iWdIoZW4l_hGhljKS0pyg\">https:\/\/join.slack.com\/t\/fireducks\/shared_invite\/zt-34qpdgr6q-_iWdIoZW4l_hGhljKS0pyg<\/a><\/li>\n<\/ul>"},{"title":"Posts: Unveiling the Optimization Benefit of FireDucks Lazy Execution: Part #2","link":"https:\/\/fireducks-dev.github.io\/posts\/efficient_caching\/","pubDate":"Thu, 05 Dec 2024 00:00:00 +0000","guid":"https:\/\/fireducks-dev.github.io\/posts\/efficient_caching\/","description":"\n<p>In the previous <a href=\"..\/lazy_execution_offering_part1\">article<\/a>, we have talked about how FireDucks can take care pushdown-projection related\noptimization for read_parquet(), read_csv() etc. In today&rsquo;s article, we will focus on the efficient caching mechanism\nby its JIT compiler.<\/p>\n<p>Let&rsquo;s consider the below sample query for the same data, used in previous article:<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span><span style=\"color:#f92672\">import<\/span> pandas <span style=\"color:#66d9ef\">as<\/span> pd\n<\/span><\/span><span style=\"display:flex;\"><span>\n<\/span><\/span><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> pd<span style=\"color:#f92672\">.<\/span>read_parquet(<span style=\"color:#e6db74\">&#34;sample_data.parquet&#34;<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span>f_df <span style=\"color:#f92672\">=<\/span> df<span style=\"color:#f92672\">.<\/span>loc[df[<span style=\"color:#e6db74\">&#34;a&#34;<\/span>] <span style=\"color:#f92672\">&gt;<\/span> <span style=\"color:#ae81ff\">3<\/span>, [<span style=\"color:#e6db74\">&#34;x&#34;<\/span>, <span style=\"color:#e6db74\">&#34;y&#34;<\/span>, <span style=\"color:#e6db74\">&#34;z&#34;<\/span>]]\n<\/span><\/span><span style=\"display:flex;\"><span>r1 <span style=\"color:#f92672\">=<\/span> f_df<span style=\"color:#f92672\">.<\/span>groupby(<span style=\"color:#e6db74\">&#34;x&#34;<\/span>)[<span style=\"color:#e6db74\">&#34;z&#34;<\/span>]<span style=\"color:#f92672\">.<\/span>sum()\n<\/span><\/span><span style=\"display:flex;\"><span>print(r1)\n<\/span><\/span><\/code><\/pre><\/div><p>When executing the above program (saved as sample.py) as follows:<\/p>\n<pre tabindex=\"0\"><code>$ FIRE_LOG_LEVEL=3 python -mfireducks.pandas sample.py\n<\/code><\/pre><p>You can find the generated IR before and after optimization:<\/p>\n<pre tabindex=\"0\"><code>2024-12-05 12:37:21.012481: 958259 fireducks\/lib\/fireducks_core.cc:64] Input IR:\nfunc @main() {\n%t0 = read_parquet(&#39;sample_data.parquet&#39;, [])\n%t1 = project(%t0, [&#39;x&#39;, &#39;y&#39;, &#39;z&#39;])\n%t2 = project(%t0, &#39;a&#39;)\n%t3 = gt.vector.scalar(%t2, 3)\n%t4 = filter(%t1, %t3)\n%t5 = groupby_select_agg(%t4, [&#39;x&#39;], [&#39;sum&#39;], [], [], &#39;z&#39;)\n%v6 = get_shape(%t5)\nreturn(%t5, %v6)\n}\n2024-12-05 12:37:21.013462: 958259 fireducks\/lib\/fireducks_core.cc:73] Optimized IR:\nfunc @main() {\n%t0 = read_parquet(&#39;sample_data.parquet&#39;, [&#39;x&#39;, &#39;a&#39;, &#39;z&#39;])\n%t1 = project(%t0, [&#39;z&#39;, &#39;x&#39;])\n%t2 = project(%t0, &#39;a&#39;)\n%t3 = gt.vector.scalar(%t2, 3)\n%t4 = filter(%t1, %t3)\n%t5 = groupby_select_agg(%t4, [&#39;x&#39;], [&#39;sum&#39;], [], [], &#39;z&#39;)\n%v6 = get_shape(%t5)\nreturn(%t5, %v6)\n}\n<\/code><\/pre><p>It can be noted that the compiler correctly identified the projection targets for read_parquet() as &ldquo;x&rdquo;, &ldquo;a&rdquo;, and &ldquo;z&rdquo; columns.\nAlthough the &ldquo;y&rdquo; column is specified to be projected in the loc indexer, but that column is never used within the\nabove program. Hence, that is not even loaded during the read_parquet stage.<\/p>\n<h2 id=\"could-lazy-execution-be-expensive\">Could lazy execution be expensive?<\/h2>\n<p>Now, the question is what will happen if we want to perform another groupby-aggregation on the same filtered dataframe\nthat requires &ldquo;y&rdquo; column as follows:<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> pd<span style=\"color:#f92672\">.<\/span>read_parquet(<span style=\"color:#e6db74\">&#34;sample_data.parquet&#34;<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span>f_df <span style=\"color:#f92672\">=<\/span> df<span style=\"color:#f92672\">.<\/span>loc[df[<span style=\"color:#e6db74\">&#34;a&#34;<\/span>] <span style=\"color:#f92672\">&gt;<\/span> <span style=\"color:#ae81ff\">3<\/span>, [<span style=\"color:#e6db74\">&#34;x&#34;<\/span>, <span style=\"color:#e6db74\">&#34;y&#34;<\/span>, <span style=\"color:#e6db74\">&#34;z&#34;<\/span>]]\n<\/span><\/span><span style=\"display:flex;\"><span>r1 <span style=\"color:#f92672\">=<\/span> f_df<span style=\"color:#f92672\">.<\/span>groupby(<span style=\"color:#e6db74\">&#34;x&#34;<\/span>)[<span style=\"color:#e6db74\">&#34;z&#34;<\/span>]<span style=\"color:#f92672\">.<\/span>sum()\n<\/span><\/span><span style=\"display:flex;\"><span>print(r1)\n<\/span><\/span><span style=\"display:flex;\"><span>\n<\/span><\/span><span style=\"display:flex;\"><span>r2 <span style=\"color:#f92672\">=<\/span> f_df<span style=\"color:#f92672\">.<\/span>groupby(<span style=\"color:#e6db74\">&#34;y&#34;<\/span>)[<span style=\"color:#e6db74\">&#34;z&#34;<\/span>]<span style=\"color:#f92672\">.<\/span>sum() <span style=\"color:#75715e\"># newly added groupby-sum<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span>print(r2)\n<\/span><\/span><\/code><\/pre><\/div><p>Since FireDucks performs lazy execution,<\/p>\n<ol>\n<li><strong>will it process two expensive calls as follows<\/strong>?<\/li>\n<\/ol>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span>r1 <span style=\"color:#f92672\">=<\/span> (\n<\/span><\/span><span style=\"display:flex;\"><span> pd<span style=\"color:#f92672\">.<\/span>read_parquet(<span style=\"color:#e6db74\">&#34;sample_data.parquet&#34;<\/span>, columns<span style=\"color:#f92672\">=<\/span><span style=\"color:#e6db74\">&#34;[&#34;<\/span>x<span style=\"color:#e6db74\">&#34;, &#34;<\/span>z<span style=\"color:#e6db74\">&#34;, &#34;<\/span>a<span style=\"color:#e6db74\">&#34;])<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>loc[df[<span style=\"color:#e6db74\">&#34;a&#34;<\/span>] <span style=\"color:#f92672\">&gt;<\/span> <span style=\"color:#ae81ff\">3<\/span>, [<span style=\"color:#e6db74\">&#34;x&#34;<\/span>, <span style=\"color:#e6db74\">&#34;z&#34;<\/span>]]\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>groupby(<span style=\"color:#e6db74\">&#34;x&#34;<\/span>)[<span style=\"color:#e6db74\">&#34;z&#34;<\/span>]<span style=\"color:#f92672\">.<\/span>sum()\n<\/span><\/span><span style=\"display:flex;\"><span>)\n<\/span><\/span><span style=\"display:flex;\"><span>\n<\/span><\/span><span style=\"display:flex;\"><span>r2 <span style=\"color:#f92672\">=<\/span> (\n<\/span><\/span><span style=\"display:flex;\"><span> pd<span style=\"color:#f92672\">.<\/span>read_parquet(<span style=\"color:#e6db74\">&#34;sample_data.parquet&#34;<\/span>, columns<span style=\"color:#f92672\">=<\/span><span style=\"color:#e6db74\">&#34;[&#34;<\/span>y<span style=\"color:#e6db74\">&#34;, &#34;<\/span>z<span style=\"color:#e6db74\">&#34;, &#34;<\/span>a<span style=\"color:#e6db74\">&#34;])<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>loc[df[<span style=\"color:#e6db74\">&#34;a&#34;<\/span>] <span style=\"color:#f92672\">&gt;<\/span> <span style=\"color:#ae81ff\">3<\/span>, [<span style=\"color:#e6db74\">&#34;y&#34;<\/span>, <span style=\"color:#e6db74\">&#34;z&#34;<\/span>]]\n<\/span><\/span><span style=\"display:flex;\"><span> <span style=\"color:#f92672\">.<\/span>groupby(<span style=\"color:#e6db74\">&#34;y&#34;<\/span>)[<span style=\"color:#e6db74\">&#34;z&#34;<\/span>]<span style=\"color:#f92672\">.<\/span>sum()\n<\/span><\/span><span style=\"display:flex;\"><span>)\n<\/span><\/span><\/code><\/pre><\/div><ol start=\"2\">\n<li><strong>Or, will it keep the intermediate filtered result (f_df) alive when processing <code>r1<\/code><\/strong>? since it will be used later in the given program when processing <code>r2<\/code>.<\/li>\n<\/ol>\n<p>\ud83d\udc49 <strong>The answer is (2)<\/strong>. It will effectively keep the intermediate results alive that are to be required at some later stage.<\/p>\n<p>Let&rsquo;s look into the generated IR of the before and after optimization for the modified program:<\/p>\n<pre tabindex=\"0\"><code>2024-12-05 13:26:41.691496: 959435 fireducks\/lib\/fireducks_core.cc:64] Input IR:\nfunc @main() {\n%t0 = read_parquet(&#39;sample_data.parquet&#39;, [])\n%t1 = project(%t0, [&#39;x&#39;, &#39;y&#39;, &#39;z&#39;])\n%t2 = project(%t0, &#39;a&#39;)\n%t3 = gt.vector.scalar(%t2, 3)\n%t4 = filter(%t1, %t3)\n%t5 = groupby_select_agg(%t4, [&#39;x&#39;], [&#39;sum&#39;], [], [], &#39;z&#39;)\n%v6 = get_shape(%t5)\nreturn(%t5, %t4, %v6)\n}\n2024-12-05 13:26:41.692423: 959435 fireducks\/lib\/fireducks_core.cc:73] Optimized IR:\nfunc @main() {\n%t0 = read_parquet(&#39;sample_data.parquet&#39;, [&#39;z&#39;, &#39;x&#39;, &#39;a&#39;, &#39;y&#39;]) &lt;- this time it also loads &#34;y&#34; column (as needed for r2)\n%t1 = project(%t0, [&#39;x&#39;, &#39;y&#39;, &#39;z&#39;])\n%t2 = project(%t0, &#39;a&#39;)\n%t3 = gt.vector.scalar(%t2, 3)\n%t4 = filter(%t1, %t3)\n%t5 = groupby_select_agg(%t4, [&#39;x&#39;], [&#39;sum&#39;], [], [], &#39;z&#39;)\n%v6 = get_shape(%t5)\nreturn(%t5, %t4, %v6) &lt;- this time it also returns filtered dataframe (%t4)\n}\n2024-12-05 13:26:41.706225: 959435 fireducks\/lib\/fireducks_core.cc:64] Input IR:\nfunc @main(%arg0: !table) { later use.\n%t1 = groupby_select_agg(%arg0, [&#39;y&#39;], [&#39;sum&#39;], [], [], &#39;z&#39;)\n%v2 = get_shape(%t1)\nreturn(%t1, %v2)\n}\n2024-12-05 13:26:41.706721: 959435 fireducks\/lib\/fireducks_core.cc:73] Optimized IR:\nfunc @main(%arg0: !table) {\n%t1 = groupby_select_agg(%arg0, [&#39;y&#39;], [&#39;sum&#39;], [], [], &#39;z&#39;)\n%v2 = get_shape(%t1)\nreturn(%t1, %v2)\n}\n<\/code><\/pre><p>The first &ldquo;Optimized IR&rdquo; is generated when processing <code>r1<\/code>.\nThis time the compiler identifies the &ldquo;y&rdquo; column and the filtered dataframe (f_df) will be used at later stage when computing <code>r2<\/code>.\nHence it will also load the &ldquo;y&rdquo; column and keep the intermediate filtered dataframe alive (in other word, cache it) by returning\nit (%t4) along with the result of <code>r1<\/code> (%t5) to avoid further processing at later use.<\/p>\n<p>\ud83d\udc49If you carefully notice the previous IR returned only <code>(%t5, %v6)<\/code>, when there was no computing related to <code>r2<\/code> in the input program.<\/p>\n<p>The second &ldquo;Optimized IR&rdquo; is generated when processing <code>r2<\/code>.\nThe input <code>%arg0<\/code> is the filtered dataframe (%t4) that the compiler kept alive.\nHence only groupby-sum is performed when processing <code>r2<\/code>.<\/p>\n<h2 id=\"how-to-profile\">How to profile?<\/h2>\n<p>You can also check kernel-wise execution time, number of calls etc. by executing the program as follows:<\/p>\n<pre tabindex=\"0\"><code>$ FIREDUCKS_FLAGS=&#34;--trace=3 --trace-file=-&#34; python -mfireducks.pandas sample.py\n<\/code><\/pre><p>It will produce some profiling output as follows:<\/p>\n<pre tabindex=\"0\"><code class=\"language-duration\" data-lang=\"duration\">== kernel ==\nfireducks.gt.vector.scalar 0.004 8.26% 1\nfireducks.read_parquet 0.003 6.02% 1\nfireducks.groupby_select_agg 0.002 3.06% 2\nfireducks.to_pandas.frame.metadata 0.001 1.89% 2\nfireducks.filter 0.001 1.34% 1\nfireducks.project 0.000 0.03% 2\n<\/code><\/pre><p>It can clearly be seen that the method related to read_parquet, filter etc. is called only once.\nIn order to produce the similar profiling on Jupyter notebook, you can use the cell magic: <code>%%fireducks.profile<\/code>.<\/p>\n<h2 id=\"wrapping-up\">Wrapping-up<\/h2>\n<p>Thank you for your time in reading this article.\nIn case you have any queries or have an issue to report, please feel free to get in touch with us in any of your prefered channel mentioned below:<\/p>\n<ul>\n<li>\ud83e\udd86github : <a href=\"https:\/\/github.com\/fireducks-dev\/fireducks\/issues\/new\">https:\/\/github.com\/fireducks-dev\/fireducks\/issues\/new<\/a><\/li>\n<li>\ud83d\udce7mail : <a href=\"mailto:contact@fireducks.jp.nec.com\">contact@fireducks.jp.nec.com<\/a><\/li>\n<li>\ud83e\udd1dslack : <a href=\"https:\/\/join.slack.com\/t\/fireducks\/shared_invite\/zt-34qpdgr6q-_iWdIoZW4l_hGhljKS0pyg\">https:\/\/join.slack.com\/t\/fireducks\/shared_invite\/zt-34qpdgr6q-_iWdIoZW4l_hGhljKS0pyg<\/a><\/li>\n<\/ul>"},{"title":"Posts: Import hooks: how to use FireDucks without modifying your programs","link":"https:\/\/fireducks-dev.github.io\/posts\/importhook\/","pubDate":"Wed, 15 Nov 2023 09:35:10 +0900","guid":"https:\/\/fireducks-dev.github.io\/posts\/importhook\/","description":"\n<p>This is Osamu Daido from the FireDucks development team.\nIn today&rsquo;s developers&rsquo; blog, I would like to introduce the import hook feature of FireDucks.\nThis feature enables you to use FireDucks without modifying your existing programs at all.<\/p>\n<p>I&rsquo;ll explain how to use hooks when running Python files on the command line and how to enable hooks in IPython or Jupyter Notebook.<\/p>\n<h1 id=\"what-is-an-import-hook\">What is an import hook?<\/h1>\n<p>FireDucks behaves in the same way as the original pandas, so it&rsquo;s easy to get started by simply modifying an import statement as follows:<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span><span style=\"color:#75715e\"># import pandas as pd<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#f92672\">import<\/span> fireducks.pandas <span style=\"color:#66d9ef\">as<\/span> pd\n<\/span><\/span><\/code><\/pre><\/div><p>However, even if it&rsquo;s just a single line, finding and replacing import statements with FireDucks in your programs which use pandas may be annoying.\nMoreover, if you want to use FireDucks in a third-party library that works with pandas, it&rsquo;s not practical to modify all import statements in that library.<\/p>\n<p>As mentioned in <a href=\"https:\/\/fireducks-dev.github.io\/docs\/get-started\/#import-hook\">Get Started<\/a>, FireDucks has a utility called an import hook.\nPlease specify the following options for the Python interpreter when you run <code>your_script.py<\/code> on the command line.<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-shell\" data-lang=\"shell\"><span style=\"display:flex;\"><span>python3 -m fireducks.imhook your_script.py\n<\/span><\/span><\/code><\/pre><\/div><p>With this feature, <code>fireducks.pandas<\/code> is imported instead of <code>pandas<\/code> when the Python interpreter attempts to import <code>pandas<\/code>.\nKeep in mind that this does not edit the source code of <code>your_script.py<\/code>, but rather dynamically hacks the import process while executing the program.<\/p>\n<h2 id=\"example-of-an-import-hook\">Example of an import hook<\/h2>\n<p>Let&rsquo;s see it in action with a simple Python script, <code>print_classname.py<\/code>, as shown below.\nThis script outputs the repr string of the DataFrame class.<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span><span style=\"color:#f92672\">import<\/span> pandas <span style=\"color:#66d9ef\">as<\/span> pd\n<\/span><\/span><span style=\"display:flex;\"><span>print(pd<span style=\"color:#f92672\">.<\/span>DataFrame)\n<\/span><\/span><\/code><\/pre><\/div><p>If you run it normally, the output is as follows:<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-shell\" data-lang=\"shell\"><span style=\"display:flex;\"><span>$ python3 print_classname.py\n<\/span><\/span><span style=\"display:flex;\"><span>&lt;class <span style=\"color:#e6db74\">&#39;pandas.core.frame.DataFrame&#39;<\/span>&gt;\n<\/span><\/span><\/code><\/pre><\/div><p>With the import hook, the output becomes different from the previous one, as follows! \ud83e\udd73<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-shell\" data-lang=\"shell\"><span style=\"display:flex;\"><span>$ python3 -m fireducks.imhook print_classname.py\n<\/span><\/span><span style=\"display:flex;\"><span>&lt;class <span style=\"color:#e6db74\">&#39;fireducks.pandas.frame.DataFrame&#39;<\/span>&gt;\n<\/span><\/span><\/code><\/pre><\/div><p>So, yes, you can use dataframes of FireDucks even though you haven&rsquo;t edited the source code.<\/p>\n<h2 id=\"limitations\">Limitations<\/h2>\n<h3 id=\"no-shebang-support\">No shebang support<\/h3>\n<p>Currently, execution by shebang (<code>#!...<\/code>) is not supported.<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span><span style=\"color:#75715e\">#!\/usr\/bin\/python3<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#f92672\">import<\/span> pandas <span style=\"color:#66d9ef\">as<\/span> pd\n<\/span><\/span><span style=\"display:flex;\"><span>print(pd<span style=\"color:#f92672\">.<\/span>DataFrame)\n<\/span><\/span><\/code><\/pre><\/div><p>You cannot enable an import hook, as you cannot specify the <code>-m<\/code> option for the Python interpreter (hmm, it&rsquo;s of course).<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-shell\" data-lang=\"shell\"><span style=\"display:flex;\"><span>$ chmod +x print_classname_shebang.py\n<\/span><\/span><span style=\"display:flex;\"><span>$ .\/print_classname_shebang.py\n<\/span><\/span><span style=\"display:flex;\"><span>&lt;class <span style=\"color:#e6db74\">&#39;pandas.core.frame.DataFrame&#39;<\/span>&gt;\n<\/span><\/span><\/code><\/pre><\/div><h3 id=\"no-combination-with-other-executable-modules\">No combination with other executable modules<\/h3>\n<p>The import hook feature cannot be used concurrently with other tools invoked by the <code>-m<\/code> option, as only one <code>-m<\/code> option can be passed to the Python interpreter.<\/p>\n<h3 id=\"no-subprocess-support\">No subprocess support<\/h3>\n<p>If you start a new Python process using the <code>subprocess<\/code> module, the import hook settings are not inherited by that subprocess.<\/p>\n<h1 id=\"how-to-use-import-hooks-in-jupyter-notebook\">How to use import hooks in Jupyter Notebook<\/h1>\n<p>The import hook feature is also available in Jupyter Notebook.\nCurrently, however, you cannot specify an option when starting Jupyter, and you must activate a hook explicitly in the first cell of your notebook.<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span><span style=\"color:#f92672\">import<\/span> fireducks.importhook\n<\/span><\/span><span style=\"display:flex;\"><span>fireducks<span style=\"color:#f92672\">.<\/span>importhook<span style=\"color:#f92672\">.<\/span>activate_hook(<span style=\"color:#e6db74\">&#34;fireducks.pandas&#34;<\/span>, <span style=\"color:#e6db74\">&#34;pandas&#34;<\/span>)\n<\/span><\/span><\/code><\/pre><\/div><p>There may not be much benefit to using import hooks if you&rsquo;re just using pandas in your own notebook.\nOn the other hand, import hooks also work with third-party libraries that use pandas, so it&rsquo;s useful if you want to utilize such libraries in your notebook.<\/p>\n<p>If you want to disable a hook, please call the following function.<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span>fireducks<span style=\"color:#f92672\">.<\/span>importhook<span style=\"color:#f92672\">.<\/span>deactivate_hook()\n<\/span><\/span><\/code><\/pre><\/div><p>However, if you mix dataframes from the original pandas with ones from FireDucks, you will likely encounter errors (probably with complicated and mysterious error messages).\nBasically, it is recommended to keep a hook enabled once you enable it.<\/p>\n<h2 id=\"how-to-use-import-hooks-with-ipython-cli\">How to use import hooks with IPython CLI<\/h2>\n<p>With IPython, you can enable a hook manually in the same way as in Jupyter Notebook described above.\nAnother option is to start the IPython CLI as follows (this is an example with bash):<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-shell\" data-lang=\"shell\"><span style=\"display:flex;\"><span>python3 -m fireducks.imhook <span style=\"color:#e6db74\">&#34;<\/span><span style=\"color:#66d9ef\">$(<\/span>which ipython<span style=\"color:#66d9ef\">)<\/span><span style=\"color:#e6db74\">&#34;<\/span>\n<\/span><\/span><\/code><\/pre><\/div><p>Well, it&rsquo;s a bit of an unusual way, but it works!<\/p>\n<h1 id=\"wrap-up\">Wrap-up<\/h1>\n<p>FireDucks is still under research and development, so you may face errors and problems if you switch from using pandas to FireDucks.\nWe have been working on improving features of FireDucks every day since the release of the beta version.\nYour feedback, bug reports, and feature requests are welcome!\nPlease see our <a href=\"https:\/\/fireducks-dev.github.io\/docs\/help\/contact\/\">contact information<\/a> for further details.<\/p>\n<p>To sum up, I&rsquo;ve shown you how to use FireDucks without modifying your existing programs at all.\nIf you want to try FireDucks, please refer to <a href=\"https:\/\/fireducks-dev.github.io\/docs\/get-started\/\">Get Started<\/a> and <a href=\"https:\/\/fireducks-dev.github.io\/docs\/user-guide\/01-intro\/\">User Guide<\/a> documents. For information on how much faster FireDucks is compared to pandas, please check out our <a href=\"https:\/\/fireducks-dev.github.io\/docs\/benchmarks\/\">Benchmarks<\/a>.<\/p>\n<p>May the Acceleration be with you,<!-- raw HTML omitted -->\nFireDucks Development Team<\/p>"},{"title":"Posts: Using Python's fast data frame library FireDucks","link":"https:\/\/fireducks-dev.github.io\/posts\/nes_taxi\/","pubDate":"Mon, 23 Oct 2023 08:47:36 +0000","guid":"https:\/\/fireducks-dev.github.io\/posts\/nes_taxi\/","description":"\n<p>pandas is a library that provides functions to support data analysis in the Python programming language.\nNEC Research Laboratories has developed a library called FireDucks, a faster version of pandas.<\/p>\n<h2 id=\"data-preparation\">Data Preparation<\/h2>\n<p>The analysis is performed on the data of passenger history of cabs in New York City.\nThe source of the data is as follows:<\/p>\n<blockquote>\n<p><a href=\"https:\/\/www.nyc.gov\/site\/tlc\/about\/tlc-trip-record-data.page\">https:\/\/www.nyc.gov\/site\/tlc\/about\/tlc-trip-record-data.page<\/a><\/p>\n<\/blockquote>\n<p>To analyze large data sets, we downloaded and merged the &ldquo;Yellow Taxi Trip Records&rdquo; data from January 2022 to June 2023 from the above link.\nThe data is provided in parquet format, but I converted it to csv format for testing.\nA script for preparing the data is included for reference.<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span><span style=\"color:#f92672\">import<\/span> pandas <span style=\"color:#66d9ef\">as<\/span> pd\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#f92672\">import<\/span> os\n<\/span><\/span><span style=\"display:flex;\"><span>\n<\/span><\/span><span style=\"display:flex;\"><span>dir <span style=\"color:#f92672\">=<\/span> <span style=\"color:#e6db74\">&#34;xxx&#34;<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span>df_list <span style=\"color:#f92672\">=<\/span> []\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#66d9ef\">for<\/span> year <span style=\"color:#f92672\">in<\/span> [<span style=\"color:#ae81ff\">2022<\/span>, <span style=\"color:#ae81ff\">2023<\/span>]: <span style=\"color:#66d9ef\">for<\/span> i <span style=\"color:#f92672\">in<\/span> range(<span style=\"color:#ae81ff\">12<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#66d9ef\">for<\/span> i <span style=\"color:#f92672\">in<\/span> range(<span style=\"color:#ae81ff\">12<\/span>): <span style=\"color:#66d9ef\">for<\/span> i <span style=\"color:#f92672\">=<\/span> str(i<span style=\"color:#f92672\">+<\/span><span style=\"color:#ae81ff\">1<\/span>)<span style=\"color:#f92672\">.<\/span>zfill(<span style=\"color:#ae81ff\">2<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span>month <span style=\"color:#f92672\">=<\/span> str(i<span style=\"color:#f92672\">+<\/span><span style=\"color:#ae81ff\">1<\/span>)<span style=\"color:#f92672\">.<\/span>zfill(<span style=\"color:#ae81ff\">2<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span>fn <span style=\"color:#f92672\">=<\/span> f <span style=\"color:#e6db74\">&#34;yellow_tripdata_<\/span><span style=\"color:#e6db74\">{year}<\/span><span style=\"color:#e6db74\">-<\/span><span style=\"color:#e6db74\">{month}<\/span><span style=\"color:#e6db74\">.parquet&#34;<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span>file <span style=\"color:#f92672\">=<\/span> os<span style=\"color:#f92672\">.<\/span>path<span style=\"color:#f92672\">.<\/span>join(dir, fn)\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#66d9ef\">if<\/span> <span style=\"color:#f92672\">not<\/span> os<span style=\"color:#f92672\">.<\/span>path<span style=\"color:#f92672\">.<\/span>exists(file):\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#66d9ef\">continue<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> pd<span style=\"color:#f92672\">.<\/span>read_parquet(fn)\n<\/span><\/span><span style=\"display:flex;\"><span>df_list<span style=\"color:#f92672\">.<\/span>append(df)\n<\/span><\/span><span style=\"display:flex;\"><span>\n<\/span><\/span><span style=\"display:flex;\"><span>all_df <span style=\"color:#f92672\">=<\/span> pd<span style=\"color:#f92672\">.<\/span>concat(df_list)\n<\/span><\/span><span style=\"display:flex;\"><span>all_df<span style=\"color:#f92672\">.<\/span>to_csv(<span style=\"color:#e6db74\">&#34;taxi_all.csv&#34;<\/span>)\n<\/span><\/span><\/code><\/pre><\/div><p>The contents of the data contains the following values (some columns are excerpts).<\/p>\n<table>\n<thead>\n<tr>\n<th>Column name<\/th>\n<th><!-- raw HTML omitted -->Data type<!-- raw HTML omitted --><\/th>\n<th>Description<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><code>passenger_count<\/code><\/td>\n<td>int<\/td>\n<td>The number of passengers<\/td>\n<\/tr>\n<tr>\n<td><code>pu_location_Id<\/code><\/td>\n<td>string<\/td>\n<td>The TLC cab zone where the cab meter started working.<\/td>\n<\/tr>\n<tr>\n<td><code>do_location_Id<\/code><\/td>\n<td>string<\/td>\n<td>The TLC cab zone where the cab meter was deactivated.<\/td>\n<\/tr>\n<tr>\n<td><code>tpep_dropoff_datetime<\/code><\/td>\n<td>string<\/td>\n<td>The date and time the meter was deactivated.<\/td>\n<\/tr>\n<tr>\n<td><code>tpep_pickupdate_time<\/code><\/td>\n<td>string<\/td>\n<td>The date and time when the meter started to work.<\/td>\n<\/tr>\n<tr>\n<td><code>trip_distance<\/code><\/td>\n<td>double<\/td>\n<td>The trip distance (in miles) reported by the cab meter.<\/td>\n<\/tr>\n<tr>\n<td><code>total_amount<\/code><\/td>\n<td>double<\/td>\n<td>The total amount of money charged to the passenger, not including the cash tip.<\/td>\n<\/tr>\n<tr>\n<td><code>extra<\/code><\/td>\n<td>double<\/td>\n<td>Other surcharges and additional charges. Currently, this includes only the $0.50 and $1 rush hour and nighttime fares.<\/td>\n<\/tr>\n<tr>\n<td><code>fare_amount<\/code><\/td>\n<td>double<\/td>\n<td>Time-and-distance combined fare calculated by the meter.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2 id=\"actual-preprocessing\">Actual preprocessing<\/h2>\n<p>A series of preprocessing calculations, such as type conversion, column addition, and outlier deletion, which are often used in data analysis, are performed on the prepared data.<\/p>\n<p>First, prepare a wrapper for speed measurement.\n<code>_evaluate()<\/code> is described later.<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span><span style=\"color:#f92672\">from<\/span> time <span style=\"color:#f92672\">import<\/span> time\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#f92672\">from<\/span> functools <span style=\"color:#f92672\">import<\/span> wraps\n<\/span><\/span><span style=\"display:flex;\"><span>\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#66d9ef\">def<\/span> <span style=\"color:#a6e22e\">timer<\/span>(func):\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#a6e22e\">@wraps<\/span>(func)\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#66d9ef\">def<\/span> <span style=\"color:#a6e22e\">wp<\/span>(<span style=\"color:#f92672\">*<\/span>args, <span style=\"color:#f92672\">**<\/span>kargs):\n<\/span><\/span><span style=\"display:flex;\"><span>t <span style=\"color:#f92672\">=<\/span> time()\n<\/span><\/span><span style=\"display:flex;\"><span>ret <span style=\"color:#f92672\">=<\/span> func(<span style=\"color:#f92672\">*<\/span>args, <span style=\"color:#f92672\">**<\/span>kargs)\n<\/span><\/span><span style=\"display:flex;\"><span>print(<span style=\"color:#e6db74\">f<\/span><span style=\"color:#e6db74\">&#34;<\/span><span style=\"color:#e6db74\">{<\/span>func<span style=\"color:#f92672\">.<\/span>__name__<span style=\"color:#e6db74\">}<\/span><span style=\"color:#e6db74\"> : <\/span><span style=\"color:#e6db74\">{<\/span>(time() <span style=\"color:#f92672\">-<\/span> t)<span style=\"color:#e6db74\">:<\/span><span style=\"color:#e6db74\">.5g<\/span><span style=\"color:#e6db74\">}<\/span><span style=\"color:#e6db74\"> [sec]&#34;<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#66d9ef\">return<\/span> ret\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#66d9ef\">return<\/span> wp\n<\/span><\/span><span style=\"display:flex;\"><span>\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#66d9ef\">def<\/span> <span style=\"color:#a6e22e\">evaluate<\/span>(df):\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#66d9ef\">if<\/span> hasattr(df, <span style=\"color:#e6db74\">&#34;_evaluate&#34;<\/span>):\n<\/span><\/span><span style=\"display:flex;\"><span>df<span style=\"color:#f92672\">.<\/span>_evaluate()\n<\/span><\/span><\/code><\/pre><\/div><h3 id=\"loading-data\">Loading data<\/h3>\n<p>First, read the data.\nImport pandas and then use <code>read_csv<\/code> to read the data.\nDefine a function, and call it later to measure the data.<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span><span style=\"color:#f92672\">import<\/span> pandas <span style=\"color:#66d9ef\">as<\/span> pd\n<\/span><\/span><\/code><\/pre><\/div><div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span><span style=\"color:#a6e22e\">@timer<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#66d9ef\">def<\/span> <span style=\"color:#a6e22e\">file_read<\/span>(fn, args<span style=\"color:#f92672\">=<\/span>{}):\n<\/span><\/span><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> pd<span style=\"color:#f92672\">.<\/span>read_csv(fn, <span style=\"color:#f92672\">**<\/span>args)\n<\/span><\/span><span style=\"display:flex;\"><span>evaluate(df)\n<\/span><\/span><span style=\"display:flex;\"><span>print(df<span style=\"color:#f92672\">.<\/span>shape)\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#66d9ef\">return<\/span> df\n<\/span><\/span><\/code><\/pre><\/div><h3 id=\"data-processing\">Data processing<\/h3>\n<p>Remove data with missing values.<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span><span style=\"color:#a6e22e\">@timer<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#66d9ef\">def<\/span> <span style=\"color:#a6e22e\">drop_na<\/span>(df):\n<\/span><\/span><span style=\"display:flex;\"><span>df<span style=\"color:#f92672\">.<\/span>dropna(how<span style=\"color:#f92672\">=<\/span><span style=\"color:#e6db74\">&#34;all&#34;<\/span>, inplace<span style=\"color:#f92672\">=<\/span><span style=\"color:#66d9ef\">True<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span>evaluate(df)\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#66d9ef\">return<\/span> df\n<\/span><\/span><\/code><\/pre><\/div><p>The date and time of boarding and alighting are read as strings, so they should be converted to dates.<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span><span style=\"color:#a6e22e\">@timer<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#66d9ef\">def<\/span> <span style=\"color:#a6e22e\">txt_to_date<\/span>(df, low):\n<\/span><\/span><span style=\"display:flex;\"><span>df[low] <span style=\"color:#f92672\">=<\/span> pd<span style=\"color:#f92672\">.<\/span>to_datetime(df[low])\n<\/span><\/span><span style=\"display:flex;\"><span>evaluate(df)\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#66d9ef\">return<\/span> df\n<\/span><\/span><\/code><\/pre><\/div><p>Let&rsquo;s look at the distribution grouped by the number of boardings (print is not included in the performance evaluation, so it is omitted).\nWe see that there are data with zero riders, so we remove them.<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span><span style=\"color:#75715e\"># At least one person on the train<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#a6e22e\">@timer<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#66d9ef\">def<\/span> <span style=\"color:#a6e22e\">check_passenger_c<\/span>(df):\n<\/span><\/span><span style=\"display:flex;\"><span>df_ <span style=\"color:#f92672\">=<\/span> df<span style=\"color:#f92672\">.<\/span>groupby(<span style=\"color:#e6db74\">&#34;passenger_count&#34;<\/span>)<span style=\"color:#f92672\">.<\/span>size()\n<\/span><\/span><span style=\"display:flex;\"><span>evaluate(df_)\n<\/span><\/span><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> df[df[<span style=\"color:#e6db74\">&#34;passenger_count&#34;<\/span>] <span style=\"color:#f92672\">&gt;<\/span> <span style=\"color:#ae81ff\">0<\/span>]\n<\/span><\/span><span style=\"display:flex;\"><span>evaluate(df)\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#66d9ef\">return<\/span> df\n<\/span><\/span><\/code><\/pre><\/div><p>Extract the year, month, day, and hour information from the ride date data and add columns.\nThe distribution of the year and month of the ride contains incorrect values, so we remove them.<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span><span style=\"color:#75715e\"># correct ride year\/month<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#a6e22e\">@timer<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#66d9ef\">def<\/span> <span style=\"color:#a6e22e\">check_pu_date<\/span>(df):\n<\/span><\/span><span style=\"display:flex;\"><span>df[<span style=\"color:#e6db74\">&#39;year&#39;<\/span>] <span style=\"color:#f92672\">=<\/span> df[<span style=\"color:#e6db74\">&#39;tpep_pickup_datetime&#39;<\/span>]<span style=\"color:#f92672\">.<\/span>dt<span style=\"color:#f92672\">.<\/span>year\n<\/span><\/span><span style=\"display:flex;\"><span>df[<span style=\"color:#e6db74\">&#39;month&#39;<\/span>] <span style=\"color:#f92672\">=<\/span> df[<span style=\"color:#e6db74\">&#39;tpep_pickup_datetime&#39;<\/span>]<span style=\"color:#f92672\">.<\/span>dt<span style=\"color:#f92672\">.<\/span>month\n<\/span><\/span><span style=\"display:flex;\"><span>df[<span style=\"color:#e6db74\">&#39;date&#39;<\/span>] <span style=\"color:#f92672\">=<\/span> df[<span style=\"color:#e6db74\">&#39;tpep_pickup_datetime&#39;<\/span>]<span style=\"color:#f92672\">.<\/span>dt<span style=\"color:#f92672\">.<\/span>day\n<\/span><\/span><span style=\"display:flex;\"><span>df[<span style=\"color:#e6db74\">&#39;hour&#39;<\/span>] <span style=\"color:#f92672\">=<\/span> df[<span style=\"color:#e6db74\">&#39;tpep_pickup_datetime&#39;<\/span>]<span style=\"color:#f92672\">.<\/span>dt<span style=\"color:#f92672\">.<\/span>hour\n<\/span><\/span><span style=\"display:flex;\"><span>\n<\/span><\/span><span style=\"display:flex;\"><span>df_ <span style=\"color:#f92672\">=<\/span> df<span style=\"color:#f92672\">.<\/span>groupby(<span style=\"color:#e6db74\">&#34;year&#34;<\/span>)<span style=\"color:#f92672\">.<\/span>size()\n<\/span><\/span><span style=\"display:flex;\"><span>evaluate(df_)\n<\/span><\/span><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> df[(df[<span style=\"color:#e6db74\">&#39;year&#39;<\/span>] <span style=\"color:#f92672\">==<\/span> <span style=\"color:#ae81ff\">2022<\/span>) <span style=\"color:#f92672\">|<\/span> (df[<span style=\"color:#e6db74\">&#39;year&#39;<\/span>] <span style=\"color:#f92672\">==<\/span> <span style=\"color:#ae81ff\">2023<\/span>)]\n<\/span><\/span><span style=\"display:flex;\"><span>\n<\/span><\/span><span style=\"display:flex;\"><span>df_ <span style=\"color:#f92672\">=<\/span> df<span style=\"color:#f92672\">.<\/span>groupby(<span style=\"color:#e6db74\">&#34;month&#34;<\/span>)<span style=\"color:#f92672\">.<\/span>size()\n<\/span><\/span><span style=\"display:flex;\"><span>evaluate(df_)\n<\/span><\/span><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> df[(df[<span style=\"color:#e6db74\">&#39;month&#39;<\/span>] <span style=\"color:#f92672\">&gt;=<\/span> <span style=\"color:#ae81ff\">1<\/span>) <span style=\"color:#f92672\">&amp;<\/span> df[<span style=\"color:#e6db74\">&#39;month&#39;<\/span>] <span style=\"color:#f92672\">&lt;=<\/span> <span style=\"color:#ae81ff\">12<\/span>]\n<\/span><\/span><span style=\"display:flex;\"><span>\n<\/span><\/span><span style=\"display:flex;\"><span>df_ <span style=\"color:#f92672\">=<\/span> df<span style=\"color:#f92672\">.<\/span>groupby([<span style=\"color:#e6db74\">&#34;year&#34;<\/span>, <span style=\"color:#e6db74\">&#34;month&#34;<\/span>])<span style=\"color:#f92672\">.<\/span>size()\n<\/span><\/span><span style=\"display:flex;\"><span>evaluate(df_)\n<\/span><\/span><span style=\"display:flex;\"><span>evaluate(df)\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#66d9ef\">return<\/span> df\n<\/span><\/span><\/code><\/pre><\/div><p>Convert the difference between the disembarkation time and the ride time to minutes and add a column.\nRemove non-positive or too long ride times.<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span><span style=\"color:#75715e\"># realistic ride time in minutes<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#a6e22e\">@timer<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#66d9ef\">def<\/span> <span style=\"color:#a6e22e\">check_ride_time<\/span>(df):\n<\/span><\/span><span style=\"display:flex;\"><span>df[<span style=\"color:#e6db74\">&#34;ride_time&#34;<\/span>] <span style=\"color:#f92672\">=<\/span> (df[<span style=\"color:#e6db74\">&#34;tpep_dropoff_datetime&#34;<\/span>] <span style=\"color:#f92672\">-<\/span> df[<span style=\"color:#e6db74\">&#34;tpep_pickup_datetime&#34;<\/span>])<span style=\"color:#f92672\">.<\/span>dt<span style=\"color:#f92672\">.<\/span>seconds <span style=\"color:#f92672\">\/<\/span> <span style=\"color:#ae81ff\">60<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> df[(df[<span style=\"color:#e6db74\">&#34;ride_time&#34;<\/span>] <span style=\"color:#f92672\">&gt;<\/span> <span style=\"color:#ae81ff\">0<\/span>) <span style=\"color:#f92672\">&amp;<\/span> (df[<span style=\"color:#e6db74\">&#34;ride_time&#34;<\/span>] <span style=\"color:#f92672\">&lt;=<\/span> <span style=\"color:#ae81ff\">180<\/span>)]\n<\/span><\/span><span style=\"display:flex;\"><span>evaluate(df)\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#66d9ef\">return<\/span> df\n<\/span><\/span><\/code><\/pre><\/div><p>Remove non-negative or too large values for ride distance and fare.<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span><span style=\"color:#75715e\"># realistic distances<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#a6e22e\">@timer<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#66d9ef\">def<\/span> <span style=\"color:#a6e22e\">check_trip_distance<\/span>(df):\n<\/span><\/span><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> df[(df[<span style=\"color:#e6db74\">&#34;trip_distance&#34;<\/span>] <span style=\"color:#f92672\">&gt;<\/span> <span style=\"color:#ae81ff\">0<\/span>) <span style=\"color:#f92672\">&amp;<\/span> (df[<span style=\"color:#e6db74\">&#34;trip_distance&#34;<\/span>] <span style=\"color:#f92672\">&lt;=<\/span> <span style=\"color:#ae81ff\">250<\/span>)]\n<\/span><\/span><span style=\"display:flex;\"><span>evaluate(df)\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#66d9ef\">return<\/span> df\n<\/span><\/span><span style=\"display:flex;\"><span>\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#75715e\"># Realistic fares<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#a6e22e\">@timer<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#66d9ef\">def<\/span> <span style=\"color:#a6e22e\">check_total_amount<\/span>(df):\n<\/span><\/span><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> df[(df[<span style=\"color:#e6db74\">&#34;total_amount&#34;<\/span>] <span style=\"color:#f92672\">&gt;<\/span> <span style=\"color:#ae81ff\">0<\/span>) <span style=\"color:#f92672\">&amp;<\/span> (df[<span style=\"color:#e6db74\">&#34;total_amount&#34;<\/span>] <span style=\"color:#f92672\">&lt;=<\/span> <span style=\"color:#ae81ff\">1000<\/span>)]\n<\/span><\/span><span style=\"display:flex;\"><span>evaluate(df)\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#66d9ef\">return<\/span> df\n<\/span><\/span><\/code><\/pre><\/div><p>Calculate the latitude and longitude from the IDs of the boarding and alighting points.\nThe relationship between the ID and the point can be checked as follows:<\/p>\n<blockquote>\n<p><a href=\"https:\/\/d37ci6vzurychx.cloudfront.net\/misc\/taxi+_zone_lookup.csv\">https:\/\/d37ci6vzurychx.cloudfront.net\/misc\/taxi+_zone_lookup.csv<\/a><\/p>\n<\/blockquote>\n<p>The columns are added by merging the conversion table created based on the latitude and longitude calculated from:<\/p>\n<blockquote>\n<p><a href=\"https:\/\/d37ci6vzurychx.cloudfront.net\/misc\/taxi_zones.zip\">https:\/\/d37ci6vzurychx.cloudfront.net\/misc\/taxi_zones.zip<\/a><\/p>\n<\/blockquote>\n<p>Remove data outside New York City from the latitude and longitude information.<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span><span style=\"color:#75715e\"># Find latitude and longitude from IDs<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#a6e22e\">@timer<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#66d9ef\">def<\/span> <span style=\"color:#a6e22e\">add_coordinate<\/span>(df, ID_df):\n<\/span><\/span><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> df<span style=\"color:#f92672\">.<\/span>merge(ID_df<span style=\"color:#f92672\">.<\/span>rename(columns<span style=\"color:#f92672\">=<\/span>{<span style=\"color:#e6db74\">&#34;longitude&#34;<\/span>: <span style=\"color:#e6db74\">&#34;start_lon&#34;<\/span>, <span style=\"color:#e6db74\">&#34;latitude&#34;<\/span>: <span style=\"color:#e6db74\">&#34;start_lat&#34;<\/span>}),\n<\/span><\/span><span style=\"display:flex;\"><span>left_on<span style=\"color:#f92672\">=<\/span><span style=\"color:#e6db74\">&#34;PULocationID&#34;<\/span>, right_on<span style=\"color:#f92672\">=<\/span><span style=\"color:#e6db74\">&#34;LocationID&#34;<\/span>, how<span style=\"color:#f92672\">=<\/span><span style=\"color:#e6db74\">&#34;left&#34;<\/span>)<span style=\"color:#f92672\">.<\/span>drop(<span style=\"color:#e6db74\">&#34;LocationID&#34;<\/span>, axis<span style=\"color:#f92672\">=<\/span><span style=\"color:#ae81ff\">1<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> df<span style=\"color:#f92672\">.<\/span>merge(ID_df<span style=\"color:#f92672\">.<\/span>rename(columns<span style=\"color:#f92672\">=<\/span>{<span style=\"color:#e6db74\">&#34;longitude&#34;<\/span>: <span style=\"color:#e6db74\">&#34;end_lon&#34;<\/span>, <span style=\"color:#e6db74\">&#34;latitude&#34;<\/span>: <span style=\"color:#e6db74\">&#34;end_lat&#34;<\/span>}),\n<\/span><\/span><span style=\"display:flex;\"><span>left_on<span style=\"color:#f92672\">=<\/span><span style=\"color:#e6db74\">&#34;DOLocationID&#34;<\/span>, right_on<span style=\"color:#f92672\">=<\/span><span style=\"color:#e6db74\">&#34;LocationID&#34;<\/span>, how<span style=\"color:#f92672\">=<\/span><span style=\"color:#e6db74\">&#34;left&#34;<\/span>)<span style=\"color:#f92672\">.<\/span>drop(<span style=\"color:#e6db74\">&#34;LocationID&#34;<\/span>, axis<span style=\"color:#f92672\">=<\/span><span style=\"color:#ae81ff\">1<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span>evaluate(df)\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#66d9ef\">return<\/span> df\n<\/span><\/span><span style=\"display:flex;\"><span>\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#75715e\"># Check if it is in NY<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#a6e22e\">@timer<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#66d9ef\">def<\/span> <span style=\"color:#a6e22e\">in_NY<\/span>(df):\n<\/span><\/span><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> df[(df[<span style=\"color:#e6db74\">&#34;start_lon&#34;<\/span>] <span style=\"color:#f92672\">&lt;=<\/span> <span style=\"color:#f92672\">-<\/span><span style=\"color:#ae81ff\">71.47<\/span>) <span style=\"color:#f92672\">&amp;<\/span> (df[<span style=\"color:#e6db74\">&#34;start_lon&#34;<\/span>] <span style=\"color:#f92672\">&gt;=<\/span> <span style=\"color:#f92672\">-<\/span><span style=\"color:#ae81ff\">79.45<\/span>)]\n<\/span><\/span><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> df[(df[<span style=\"color:#e6db74\">&#34;start_lat&#34;<\/span>] <span style=\"color:#f92672\">&gt;=<\/span> <span style=\"color:#ae81ff\">40.29<\/span>) <span style=\"color:#f92672\">&amp;<\/span> (df[<span style=\"color:#e6db74\">&#34;start_lat&#34;<\/span>] <span style=\"color:#f92672\">&lt;=<\/span> <span style=\"color:#ae81ff\">45<\/span>)]\n<\/span><\/span><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> df[(df[<span style=\"color:#e6db74\">&#34;end_lon&#34;<\/span>] <span style=\"color:#f92672\">&lt;=<\/span> <span style=\"color:#f92672\">-<\/span><span style=\"color:#ae81ff\">71.47<\/span>) <span style=\"color:#f92672\">&amp;<\/span> (df[<span style=\"color:#e6db74\">&#34;end_lon&#34;<\/span>] <span style=\"color:#f92672\">&gt;=<\/span> <span style=\"color:#f92672\">-<\/span><span style=\"color:#ae81ff\">79.45<\/span>)]\n<\/span><\/span><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> df[(df[<span style=\"color:#e6db74\">&#34;end_lat&#34;<\/span>] <span style=\"color:#f92672\">&gt;=<\/span> <span style=\"color:#ae81ff\">40.29<\/span>) <span style=\"color:#f92672\">&amp;<\/span> (df[<span style=\"color:#e6db74\">&#34;end_lat&#34;<\/span>] <span style=\"color:#f92672\">&lt;=<\/span> <span style=\"color:#ae81ff\">45<\/span>)]\n<\/span><\/span><span style=\"display:flex;\"><span>evaluate(df)\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#66d9ef\">return<\/span> df\n<\/span><\/span><span style=\"display:flex;\"><span>\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#66d9ef\">def<\/span> <span style=\"color:#a6e22e\">check_in_NY<\/span>(df, ID_df):\n<\/span><\/span><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> add_coordinate(df, ID_df)\n<\/span><\/span><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> in_NY(df)\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#66d9ef\">return<\/span> df\n<\/span><\/span><\/code><\/pre><\/div><p>As described above, a series of processes are prepared to read data, perform type conversion, add columns, and remove outlier values.<\/p>\n<ol>\n<li>Read the file<\/li>\n<li>Convert date data from string to date data<\/li>\n<li>Preprocessing<\/li>\n<li>Remove missing values<\/li>\n<li>Check the number of passengers<\/li>\n<li>Check distribution by groupby<\/li>\n<li>Select at least 1 passenger<\/li>\n<li>Check the time of boarding<\/li>\n<li>Tet the year, month, date and time from the date data and add a column<\/li>\n<li>Group by year and check -&gt; Select only the relevant year<\/li>\n<li>Group by month and check -&gt; select only Jan-Dec<\/li>\n<li>Group by year and month and check the distribution<\/li>\n<li>Check the boarding time<\/li>\n<li>Take the difference between the time of disembarkation and the time of embarkation, convert to minutes, and add a column (<code>dt.total_second<\/code> is not supported by FireDucks)<\/li>\n<li>Select realistic ride time data<\/li>\n<li>Check the ride distance<\/li>\n<li>Select realistic distance data<\/li>\n<li>Check fare<\/li>\n<li>Select realistic fare data<\/li>\n<li>Select NY City data<\/li>\n<li>Merge the passenger ID with the latitude and longitude table<\/li>\n<li>Select the data where the longitude and latitude of the boarding and alighting is in NY city<\/li>\n<\/ol>\n<h2 id=\"execution-time-in-pandas\">Execution time in pandas<\/h2>\n<p>First, let&rsquo;s check the execution time on pandas.\nA 24-core Xeon server (Intel(R) Xeon(R) Gold 6226 CPU x 2, 256GB main memory) was used for the measurement.<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span><span style=\"color:#f92672\">import<\/span> os\n<\/span><\/span><span style=\"display:flex;\"><span>data_path <span style=\"color:#f92672\">=<\/span> <span style=\"color:#e6db74\">&#34;data_sets&#34;<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span>fn <span style=\"color:#f92672\">=<\/span> os<span style=\"color:#f92672\">.<\/span>path<span style=\"color:#f92672\">.<\/span>join(data_path, <span style=\"color:#e6db74\">&#34;taxi_all.csv&#34;<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span>ID_file <span style=\"color:#f92672\">=<\/span> os<span style=\"color:#f92672\">.<\/span>path<span style=\"color:#f92672\">.<\/span>join(data_path, <span style=\"color:#e6db74\">&#34;ID_to_coordinate.csv&#34;<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span>\n<\/span><\/span><span style=\"display:flex;\"><span>need_cols <span style=\"color:#f92672\">=<\/span> [<span style=\"color:#e6db74\">&#39;tpep_pickup_datetime&#39;<\/span>,<span style=\"color:#e6db74\">&#39;tpep_dropoff_datetime&#39;<\/span>, <span style=\"color:#e6db74\">&#39;passenger_count&#39;<\/span>, <span style=\"color:#e6db74\">&#39;trip_distance&#39;<\/span>,\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#e6db74\">&#39;PULocationID&#39;<\/span>, <span style=\"color:#e6db74\">&#39;DOLocationID&#39;<\/span>, <span style=\"color:#e6db74\">&#39;total_amount&#39;<\/span>, <span style=\"color:#e6db74\">&#39;improvement_surcharge&#39;<\/span>, <span style=\"color:#e6db74\">&#39;extra&#39;<\/span>, <span style=\"color:#e6db74\">&#39;fare_amount&#39;<\/span>, <span style=\"color:#e6db74\">&#39;RatecodeID&#39;<\/span>]\n<\/span><\/span><span style=\"display:flex;\"><span>\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#a6e22e\">@timer<\/span>\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#66d9ef\">def<\/span> <span style=\"color:#a6e22e\">Preprocessing<\/span>():\n<\/span><\/span><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> file_read(fn, {<span style=\"color:#e6db74\">&#34;usecols&#34;<\/span>: need_cols})\n<\/span><\/span><span style=\"display:flex;\"><span>ID_df <span style=\"color:#f92672\">=<\/span> file_read(ID_file)\n<\/span><\/span><span style=\"display:flex;\"><span>\n<\/span><\/span><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> txt_to_date(df, <span style=\"color:#e6db74\">&#34;tpep_pickup_datetime&#34;<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> txt_to_date(df, <span style=\"color:#e6db74\">&#34;tpep_dropoff_datetime&#34;<\/span>)\n<\/span><\/span><span style=\"display:flex;\"><span>\n<\/span><\/span><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> drop_na(df)\n<\/span><\/span><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> check_passenger_c(df)\n<\/span><\/span><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> check_pu_date(df)\n<\/span><\/span><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> check_ride_time(df)\n<\/span><\/span><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> check_trip_distance(df)\n<\/span><\/span><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> check_total_amount(df)\n<\/span><\/span><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> check_in_NY(df, ID_df)\n<\/span><\/span><span style=\"display:flex;\"><span>evaluate(df)\n<\/span><\/span><span style=\"display:flex;\"><span><span style=\"color:#66d9ef\">return<\/span> df\n<\/span><\/span><span style=\"display:flex;\"><span>\n<\/span><\/span><span style=\"display:flex;\"><span>df <span style=\"color:#f92672\">=<\/span> Preprocessing()\n<\/span><\/span><\/code><\/pre><\/div><p>Now, the execution time in pandas is as shown in the table below.\nThe two <code>file_read<\/code> files are due to reading the taxi data as well as the location data for merging, and the <code>txt_to_date<\/code> file is due to converting the ride time and the drop-off time.\nThe execution time shows that it took more than one minute to read the file.\nIn addition, <code>check_pu_date<\/code> and <code>add_coordinate<\/code>, which include adding columns and merge processing, take more than 30 seconds, and the implemented preprocessing takes 186 seconds to complete.<\/p>\n<h2 id=\"execution-time-with-fireducks\">Execution time with FireDucks<\/h2>\n<p>Measure the execution time when using FireDucks with the imported library replaced by pandas.<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-shell\" data-lang=\"shell\"><span style=\"display:flex;\"><span>pip install fireducks\n<\/span><\/span><\/code><\/pre><\/div><p>The above scripts for pandas preprocessing can be used as-is by importing FireDucks, since FireDucks is compatible with pandas.<\/p>\n<div class=\"highlight\"><pre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"><code class=\"language-python\" data-lang=\"python\"><span style=\"display:flex;\"><span><span style=\"color:#f92672\">import<\/span> fireducks.pandas <span style=\"color:#66d9ef\">as<\/span> pd\n<\/span><\/span><\/code><\/pre><\/div><p>Note that FireDucks does not immediately execute methods when they are called.\nTherefore, it is necessary to run <code>_evaluate()<\/code> to measure the execution time of each method.<\/p>\n<p>The following table compares the execution time of FireDucks with that of pandas after importing FireDucks and performing the same preprocessing calculations.<\/p>\n<table>\n<thead>\n<tr>\n<th style=\"text-align:left\">Function<\/th>\n<th style=\"text-align:right\">pandas [sec]<\/th>\n<th style=\"text-align:right\">FireDucks [sec]<\/th>\n<th style=\"text-align:right\">Speed-up ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"text-align:left\"><code>file_read<\/code><\/td>\n<td style=\"text-align:right\">72.19<\/td>\n<td style=\"text-align:right\">3.52<\/td>\n<td style=\"text-align:right\">20.49<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:left\"><code>file_read<\/code><\/td>\n<td style=\"text-align:right\">0.003<\/td>\n<td style=\"text-align:right\">0.01<\/td>\n<td style=\"text-align:right\">0.38<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:left\"><code>txt_to_date<\/code><\/td>\n<td style=\"text-align:right\">9.07<\/td>\n<td style=\"text-align:right\">19.10<\/td>\n<td style=\"text-align:right\">0.48<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:left\"><code>txt_to_date<\/code><\/td>\n<td style=\"text-align:right\">8.57<\/td>\n<td style=\"text-align:right\">20.57<\/td>\n<td style=\"text-align:right\">0.42<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:left\"><code>drop_na<\/code><\/td>\n<td style=\"text-align:right\">3.13<\/td>\n<td style=\"text-align:right\">0.70<\/td>\n<td style=\"text-align:right\">4.47<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:left\"><code>check_passenger_c<\/code><\/td>\n<td style=\"text-align:right\">3.21<\/td>\n<td style=\"text-align:right\">1.80<\/td>\n<td style=\"text-align:right\">1.79<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:left\"><code>check_pu_date<\/code><\/td>\n<td style=\"text-align:right\">27.37<\/td>\n<td style=\"text-align:right\">0.99<\/td>\n<td style=\"text-align:right\">27.64<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:left\"><code>check_ride_time<\/code><\/td>\n<td style=\"text-align:right\">7.02<\/td>\n<td style=\"text-align:right\">2.00<\/td>\n<td style=\"text-align:right\">3.51<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:left\"><code>check_trip_distance<\/code><\/td>\n<td style=\"text-align:right\">3.24<\/td>\n<td style=\"text-align:right\">0.91<\/td>\n<td style=\"text-align:right\">3.55<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:left\"><code>check_total_amount<\/code><\/td>\n<td style=\"text-align:right\">3.11<\/td>\n<td style=\"text-align:right\">0.93<\/td>\n<td style=\"text-align:right\">3.59<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:left\"><code>check_total_amount<\/code><\/td>\n<td style=\"text-align:right\">3.11<\/td>\n<td style=\"text-align:right\">0.93<\/td>\n<td style=\"text-align:right\">3.59<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:left\"><code>in_NY<\/code><\/td>\n<td style=\"text-align:right\">20.75<\/td>\n<td style=\"text-align:right\">2.71<\/td>\n<td style=\"text-align:right\">7.65<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align:left\"><code>Preprocessing<\/code><\/td>\n<td style=\"text-align:right\">186.02<\/td>\n<td style=\"text-align:right\">54.92<\/td>\n<td style=\"text-align:right\">3.39<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The <code>file_read<\/code> process took more than 70 seconds to complete in pandas, but it took about 3.5 seconds, which is more than 20 times faster.\nOther time-consuming processes (<code>check_pu_date<\/code>, <code>add_coordinate<\/code>) were also significantly reduced.<\/p>\n<p>The computation time for <code>txt_to_date<\/code> was increased by using FireDucks.\nThis is because the <code>to_datetime()<\/code> function is not supported by FireDucks at the time of writing.\nHowever, even when a function like <code>to_datetime()<\/code> is called that does not support acceleration, FireDucks does not return an error because it performs the calculation by calling a pandas function.<\/p>\n<p>The total computation time for the preprocessing calculations in this article was 55 seconds with FireDucks, compared to 186 seconds with pandas, which is about 3.4 times faster.\nOf the 55 seconds, about 40 seconds was for processing that had not yet been accelerated, and further acceleration is expected in the future.<\/p>"}]}}