Python Script Node

The python script node varies on its start up speed depending on the size of the data input table. I am sure this is because, even when using a columnar table backend, knio.input_tables[0].to_pandas() is called to generate the table schema description inside the node (top left corner of the expanded node view).

Is there any way round this? This effectively slows down the execution of the python node even if the underlying code is very quick. Passing 4-5 million records requires a start up time of 30seconds to 1 minute, whilst the user input code can run in less that 2-3 seconds.

@pma99car welcome to the KNIME forum. One option I see is to use PyArrow directly for your operations instead of converting it to pandas in the first place. Another option might be to split the data and use a loop or even streaming in oder to speed it up. Sometimes also a Cache node in front of the operation might be able to help.

I tried the same operation with all three options and using polars beat out PyArrow. Although that would require some additional Python installations.

Also if you have large data you want to process you can take a look at this article:

Collect and Restore — or how to handle many large files and resume loops

2 Likes

Hi,

Thank you for the reply. I don’t think I explained my issue very well.

My python script doesn’t contain any pandas functions, nor is pandas imported at all.

My point was, that if I write a simple script; eg import knime_scripting as knio, print(“1”) the time taken to reach the print 1 statement varies considerably depending on the size of the input table.

I believe this is because the node itself is using pandas to generate the table schema summary. I believe this because the table summary says “knio.input_tables[0].to_pandas()”.

So there is a lot of overhead in doing so that cannot omitted by the user.

@pma99car I don’t think I fully understand. You can have python nodes without data being loaded. Also you could import data directly and not use a knime table for example loading CSV or parquet files.

You could try and use a Cache node with domain calculation before loading data into python node.

Also there is the function to_pyarrow() which might be faster.

Also you should check how much RAM you have assigned to KNIME.

Hi,

I created this workflow to explore the timing of the Python Nodes, especcially how the interface

Setup: Random Data → 5400 rows * 5 cols, 100 iterations, measure runtime
3 Python Nodes:

  1. Just call the Python and import knime.scripting:

  1. Load knime.scripting plus pass through data

  2. load knime.scripting plus converting input data into a dataframe:

Result:

Median Runtimes:

  • load knime scripting only: 60 ms
  • pass through: 220 ms
  • convert to pandas: 2.000 ms

4 Likes

Thanks for this, but it doesn’t quite address the issue I am trying to highlight.

Lets say I have the following python code that returns the first row of a data table:

import knime.scripting.io as knio
import pyarrow as pa
in_tbl = knio.input_tables[0].to_pyarrow()
first_row = in_tbl.slice(0, 1)
knio.output_tables[0] = knio.Table.from_pyarrow(first_row)

There is no pandas import at all. Our knime workflow is set to columnar backend.

Now I create two python nodes that both contain that script. The first python node receives a large dataset (5,000,000 rows). The second python node receives a small dataset (10,000 rows).

The time taken to start the python script (eg the green ball bouncing backwards and forwards, but before the progress bar begins) is significantly slower on the first node than the second node. Both scripts run equally quickly once the green ball ceases and the

I believe this is because pandas is being called by the node, not by the script - see the image above, the text next to “Input Table 1”. Obviously converting to pandas takes longer on a larger table.

Yeah, that delay happens because of .to_pandas(). You could speed it up by passing just a data sample or the schema instead of the full table.

Thanks for confirming David.

My script itself isn’t as simple as the one I posted. Unfortunately it needs the full dataset passing into the python node. The script itself is very quick (aggregrations using duckdb on a pyarrow table), its just a shame we still do the to_pandas first.

Yeah I see. I thought this line

is just a hint on how to access the data

1 Like

Hi all!

Sorry to chime in so late. Thanks for all your investigations and questions.
Let me try to clear some of the confusion:

  • if you don’t call to_pandas(), then pandas should not be imported and no time is “wasted” converting the data to a pandas dataframe
  • using the Columnar Backend is definitely a good idea, because then the Python Script can read the data directly that the previous node writes, without any conversions needed
  • KNIME performs asynchronous writing in the background, so a node turning green does not mean that all data has been written to disk yet
  • the Python script needs to wait until the previous node has finished writing its output to disk
  • writing 5,000,000 rows (by default the Data Generator produces 4 double rows + a string column containing “Cluster_1”, which amounts to ~1.5GB of data) simply takes a while. And this is what the Python Script node is waiting for

So it’s not the performance of the Python Script node that is degrading with larger data, it’s simply that it starts processing later because the data needs to be available before it can start.

Does that help?
Cheers,
Carsten

2 Likes