{"@attributes":{"version":"2.0"},"channel":{"title":"PyVideo.org - Parallel Programming","link":"https:\/\/pyvideo.org\/","description":{},"lastBuildDate":"Fri, 26 Oct 2018 00:00:00 +0000","item":[{"title":"Concurrency in Python - concepts, frameworks and best practices","link":"https:\/\/pyvideo.org\/pycon-de-2018\/concurrency-in-python-concepts-frameworks-and-best-practices.html","description":"<h3>Description<\/h3><p>Have you run in situations where concurrent execution could speed up\nyour Python code? Are you using a GUI toolkit?<\/p>\n<p>This talk gives you the background to use concurrency in your code\nwithout shooting yourself in the foot - which is quite easy if you don't\nunderstand how concurrent execution differs from linear execution!<\/p>\n<p>The presentation starts with explaining some concepts like concurrency,\nparallelism, resources, atomic operations, race conditions and\ndeadlocks.<\/p>\n<p>Then we discuss the commonly-used approaches to concurrency:\nmultithreading with the <tt class=\"docutils literal\">threading<\/tt> module, multiprocessing with the\n<tt class=\"docutils literal\">multiprocessing<\/tt> module, and event loops (which include the\n<tt class=\"docutils literal\">asyncio<\/tt> framework). Each of these approaches has its typical use\ncases, which are explained.<\/p>\n<p>You can implement concurrency on a number of abstraction levels. The\nlowest level consists of primitives like locks, events, semaphores and\nso on. A higher abstraction level is using queues, typically with worker\nthreads or processes. Even higher abstraction levels are active objects\n(hiding primitives or queues behind an API; this includes &quot;actors&quot; if\nyou heard of them), the thread and process pools in\n<tt class=\"docutils literal\">concurrent.futures<\/tt> and the <tt class=\"docutils literal\">asyncio<\/tt> framework. Finally, you can\n&quot;outsource&quot; concurrency by leaving it to a message broker, which is a\ndistinct process that receives and distributes messages.<\/p>\n<p>The talk closes with some tips and best practices, mainly:<\/p>\n","pubDate":"Fri, 26 Oct 2018 00:00:00 +0000","guid":"tag:pyvideo.org,2018-10-26:\/pycon-de-2018\/concurrency-in-python-concepts-frameworks-and-best-practices.html","category":["PyCon DE 2018","Parallel Programming","Programming","Python"]},{"title":"Strongly typed datasets in a weakly typed world","link":"https:\/\/pyvideo.org\/pycon-de-2018\/strongly-typed-datasets-in-a-weakly-typed-world.html","description":"<h3>Description<\/h3><p>We at Blue Yonder use Pandas quite a lot during our daily data science\nand engineering work. This choice, together with Python as an underlying\nprogramming language gives us flexibility, a feature-rich interface, and\naccess to a large community and ecosystem. When it comes to preserving\nthe data and exchanging it with different software stacks, we rely on\nParquet Datasets \/ Hive Tables. During the write process, there is a\nshift from a rather weakly typed world to a strongly typed one. For\nexample, Pandas may convert integers to floats for many operations\nwithout asking, but parquet files and the schema information stored\nalongside them dictate very precise types. The type situation may get\neven more &quot;colorful&quot;, when datasets are written by multiple code\nversions or different software solutions over time. This then results in\nimportant questions regarding type compatibility.<\/p>\n<p>This talk will first represent an overview on types at different layers\n(like NumPy, Pandas, Arrow and Parquet) and the transition between this\nlayers. The second part of the talk will present examples of type\ncompatibility we have seen and why+how we think they should be handled.\nAt the end there will be a Q+A, which can be seen as the start of a\npotentially longer RFC process to align different software stacks (like\nHive and Dask) to handle types in a similar way.<\/p>\n","pubDate":"Fri, 26 Oct 2018 00:00:00 +0000","guid":"tag:pyvideo.org,2018-10-26:\/pycon-de-2018\/strongly-typed-datasets-in-a-weakly-typed-world.html","category":["PyCon DE 2018","Algorithms","Big Data","Data Science","Parallel Programming"]},{"title":"Big Data Systems Performance: The Little Shop of Horrors","link":"https:\/\/pyvideo.org\/pycon-de-2018\/big-data-systems-performance-the-little-shop-of-horrors.html","description":"<h3>Description<\/h3><p>The confusion around terms such as like NoSQL, Big Data, Data Science,\nSpark, SQL, and Data Lakes often creates more fog than clarity. However,\nclarity about the underlying technologies is crucial to designing the\nbest technical solution in any field relying on huge amounts of data\nincluding data science, machine learning, but also more traditional\nanalytical systems such as data integration, data warehousing,\nreporting, and OLAP.<\/p>\n<p>In my presentation, I will show that often at least three dimensions are\ncluttered and confused in discussions when it comes to data management:\nFirst, buzzwords (labels &amp; terms like &quot;big data&quot;, &quot;AI&quot;, &quot;data lake&quot;);\nsecond, data design patterns (principles &amp; best practices like:\nselection push-down, materialization, indexing); and Third, software\nplatforms (concrete implementations &amp; frameworks like: Python, DBMS,\nSpark, and NoSQL-systems).<\/p>\n<p>Only by keeping these three dimensions apart, it is possible to create\ntechnically-sound architectures in the field of big data analytics.<\/p>\n<p>I will show concrete examples, which through a simple redesign and wise\nchoice of the right tools and technologies, run thereby up to 1000 times\nfaster. This in turn triggers tremendous savings in terms of development\ntime, hardware costs, and maintenance effort.<\/p>\n","pubDate":"Thu, 25 Oct 2018 00:00:00 +0000","guid":"tag:pyvideo.org,2018-10-25:\/pycon-de-2018\/big-data-systems-performance-the-little-shop-of-horrors.html","category":["PyCon DE 2018","Algorithms","Big Data","Data Science","Infrastructure","Parallel Programming","Programming","Python","Science"]},{"title":"Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy","link":"https:\/\/pyvideo.org\/pycon-de-2018\/fulfilling-apache-arrows-promises-pandas-on-jvm-memory-without-a-copy.html","description":"<h3>Description<\/h3><p>Apache Arrow established a standard for columnar in-memory analytics to\nredefine the performance and interoperability of most Big Data\ntechnologies in early 2016. Since then implementations in Java, C++,\nPython, Glib, Ruby, Go, JavaScript and Rust have been added. Although\nApache Arrow (<tt class=\"docutils literal\">pyarrow<\/tt>) is already known to many Python\/Pandas users\nfor reading Apache Parquet files, its main benefit is the cross-language\ninteroperability. With feather and PySpark, you can already benefit from\nthis in Python and R\/Java via the filesystem or network. While they\nimprove data sharing and remove serialization overhead, data still needs\nto be copied as it is passed between processes.<\/p>\n<p>In the 0.23 release of Pandas, the concept of ExtensionArrays was\nintroduced. They allow the extension of Pandas DataFrames and Series\nwith custom, user- defined typed. The most prominent example is\n<tt class=\"docutils literal\">cyberpandas<\/tt> which adds an IP dtype that is backed by the appropriate\nrepresentation using NumPy arrays. These ExtensionArrays are not limited\nto arrays backed by NumPy but can take an arbitrary storage as long as\nthey fulfill a certain interfaces. Using Apache Arrow we can implement\nExtensionArrays that are of the same dtype as the built-in types of\nPandas but memory management is not tied to Pandas' internal\nBlockManager. On the other hand Apache Arrow has a much more wider set\nof efficient types that we can also expose as an ExtensionArray. These\ntypes include a native string type as well as a arbitrarily nested types\nsuch as <tt class=\"docutils literal\">list of \u2026<\/tt> or <tt class=\"docutils literal\">struct of (\u2026, \u2026, \u2026)<\/tt>.<\/p>\n<p>To show the real-world benefits of this, we take the example of a data\npipeline that pulls data from a relational store, transforms it and then\npasses it into a machine learning model. A typical setup nowadays most\nlikely involves a data lake that is queried with a JVM based query\nengine. The machine learning model is then normally implemented in\nPython using popular frameworks like CatBoost or Tensorflow.<\/p>\n<p>While sometimes these query engines provide Python clients, their\nperformance is normally not optimized for large results sets. In the\ncase of a machine learning model, we will do some feature\ntransformations and possibly aggregations with the query engine but feed\nas many rows as possible into the model. This will lead then to result\nsets that have above a million rows. In contrast to the Python clients,\nthese engines often come with efficient JDBC drivers that can cope with\nresult sets of this size but then the conversion from Java objects to\nPython objects in the JVM bridge will slow things down again. In our\nexample, we will show how to use Arrow to retrieve a large result in the\nJVM and then pass it on to Python without running into these\nbottlenecks.<\/p>\n","pubDate":"Thu, 25 Oct 2018 00:00:00 +0000","guid":"tag:pyvideo.org,2018-10-25:\/pycon-de-2018\/fulfilling-apache-arrows-promises-pandas-on-jvm-memory-without-a-copy.html","category":["PyCon DE 2018","Algorithms","Big Data","Data Science","Parallel Programming"]},{"title":"Cython to speed up your Python code","link":"https:\/\/pyvideo.org\/pycon-de-2018\/cython-to-speed-up-your-python-code.html","description":"<h3>Description<\/h3><p><a class=\"reference external\" href=\"http:\/\/cython.org\">Cython<\/a> is not only a very fast and comfortable\nway to talk to native code and libraries, it is also a widely used tool\nfor speeding up Python code. The Cython compiler translates Python code\nto C or C++ code, and applies many static optimisations that make Python\ncode run visibly faster than in the interpreter. But even better, it\nsupports static type annotations that allow direct use of C\/C++ data\ntypes and functions, which the compiler uses to convert and optimise the\ncode into fast, native C. The tight integration of all three languages,\nPython, C and C++, makes it possible to freely mix Python features like\ngenerators and comprehensions with C\/C++ features like native data\ntypes, pointer arithmetic or manually tuned memory management in the\nsame code.<\/p>\n<p>This talk by a core developer introduces the Cython compiler by\ninteractive code examples, and shows how you can use it to speed up your\nreal-world Python code. You will learn how you can profile a Python\nmodule and use Cython to compile and optimise it into a fast binary\nextension module. All of that, without losing the ability to run it\nthrough common development tools like code checkers or coverage test\ntools.<\/p>\n","pubDate":"Wed, 24 Oct 2018 00:00:00 +0000","guid":"tag:pyvideo.org,2018-10-24:\/pycon-de-2018\/cython-to-speed-up-your-python-code.html","category":["PyCon DE 2018","Big Data","Infrastructure","Jupyter","Parallel Programming"]},{"title":"Pyccel, a Fortran static compiler for scientific High-Performance Computing","link":"https:\/\/pyvideo.org\/pycon-de-2018\/pyccel-a-fortran-static-compiler-for-scientific-high-performance-computing.html","description":"<h3>Description<\/h3><p><em>Pyccel<\/em> is a new <strong>static compiler<\/strong> for Python that uses <strong>Fortran<\/strong>\nas backend language while enabling High-Performance Computing <strong>HPC<\/strong>\ncapabilities.<\/p>\n<p>Fortran is a computer language for scientific programming that is\ntailored for efficient run-time execution on a wide variety of\nprocessors. Even if the <em>2003<\/em> and <em>2008<\/em> standards added major\nimprovements like <em>OOP, Coarrays, Submodules, do concurrent<\/em> , etc ...\nthey are not covered by all available compilers. Moreover, the Fortran\ndeveloper still suffers from the lack of <strong>meta-programming<\/strong> compared\nto <strong>C++<\/strong> ones. Therefore, it is more and more difficult for applied\nmathematicians and computational physicists to write applications at the\n<em>state of art<\/em> (targeting CPUs, GPUs, MICs) while implementing\ncomplicated algorithms or numerical schemes.<\/p>\n<p>Pyccel can be used in two cases:<\/p>\n<p>In order to achieve the second point, we developed an internal DSL for\n<em>types<\/em> and <em>macros<\/em>. The later is used to map sentences based on\n<em>mpi4py<\/em> , <em>scipy.linalg.blas or lapack<\/em> onto the appropriate calls in\nFortran. Moreover, two parsers, for <em>OpenMP<\/em> and <em>OpenACC<\/em> , were added\ntoo, allowing for explicit parallelism through the use of pragmas.<\/p>\n<p>Last but not least, Pyccel is an extension of <strong>Sympy<\/strong>. Actually, it\nconverts a Python code to symbolic expressions\/trees, from a Full Syntax\nTree ( <em>RedBaron<\/em> ), then annotates the new AST using types or different\nsettings provided by the user.<\/p>\n<p>In this talk, after a brief description of Pyccel, I will show different\napplications including Finite Elements (1d, 2d, 3d), Semi-Lagrangian\nschemes (4d), Kronecker linear solvers, diagnostics for 5D kinetic\nsimulations and Machine Learning for Partial Differential Equations.<\/p>\n","pubDate":"Wed, 24 Oct 2018 00:00:00 +0000","guid":"tag:pyvideo.org,2018-10-24:\/pycon-de-2018\/pyccel-a-fortran-static-compiler-for-scientific-high-performance-computing.html","category":["PyCon DE 2018","Artificial Intelligence","Algorithms","Astronomy","Parallel Programming","Programming","Python","Science"]},{"title":"Scalable Scientific Computing using Dask","link":"https:\/\/pyvideo.org\/pycon-de-2018\/scalable-scientific-computing-using-dask.html","description":"<h3>Description<\/h3>","pubDate":"Wed, 24 Oct 2018 00:00:00 +0000","guid":"tag:pyvideo.org,2018-10-24:\/pycon-de-2018\/scalable-scientific-computing-using-dask.html","category":["PyCon DE 2018","Algorithms","Big Data","Data Science","Parallel Programming","Python"]},{"title":"Selinon - dynamic distributed task flows","link":"https:\/\/pyvideo.org\/pycon-de-2018\/selinon-dynamic-distributed-task-flows.html","description":"<h3>Description<\/h3><p>Have you ever tried to define and process complex workflows for data\nprocessing? If the answer is yes, you might have struggled to find the\nright framework for that. You've probably came across Celery - popular\ntask flow management for Python. Celery is great, but it does not\nprovide enough flexibility and dynamic features needed badly in complex\nflows. As we discovered all the limitations, we decided to implement\nSelinon.<\/p>\n<p>Have you ever tried to define and process complex workflows for data\nprocessing? If the answer is yes, you might have struggled to find the\nright framework for that. You've probably came across Celery - popular\ntask flow management for Python. Celery is great, but it does not\nprovide enough flexibility and dynamic features needed badly in complex\nflows. As we discovered all the limitations, we decided to implement\nSelinon.<\/p>\n<p>Selinon enhances Celery task flow management and allows you to create\nand model task flows in your distributed environment that can\ndynamically change behavior based on computed results in your cluster,\nautomatically resolve tasks that need to be executed in case of\nselective task runs, automatic tracing mechanism and many others.<\/p>\n","pubDate":"Wed, 24 Oct 2018 00:00:00 +0000","guid":"tag:pyvideo.org,2018-10-24:\/pycon-de-2018\/selinon-dynamic-distributed-task-flows.html","category":["PyCon DE 2018","Big Data","Infrastructure","Parallel Programming","Programming","Python"]}]}}