Update scoring api. by tomasatdatabricks · Pull Request #6575 · mlflow/mlflow

tomasatdatabricks · 2022-08-24T16:10:02Z

This PR makes the following changes to the scoring API:

headers: No longer recognize format flags used in mlflow 1.0 to specify pandas encoding (split or records or numpy split).
Input data:
Input is a dictionary with exactly one of dataframe_split, dataframe_records, instances, inputs fields.
Output data:
Output is a dictionary with predictions field that contains jsonized model output. The output itself remains unchanged.

Implementation wise, we can no longer use pandas.read_json function since the json is nested. Instead we parse the json using the built-in json library and turn the input into pandas DataFrame. This means that the type hints are not applied until the data is parsed and we have to explicitly cast the types.

Signed-off-by: tomasatdatabricks [email protected]

tomasatdatabricks · 2022-08-24T16:11:50Z

mlflow/pyfunc/scoring_server/client.py

+from mlflow.exceptions import MlflowException
+
+
+def infer_and_parse_json_output(decoded_input):


NOTE: I am not sure what the client is for and what is the intended API. I tried to preserve the existing behavior as closely as possible, but maybe we should change it?

when mlflow.pyfunc.spark_udf() is called with env_manager="virtualenv" or env_manager="conda", MLflow starts a model server and uses this client to send requests to it in order to perform inference; this is the long-awaited "restore model environments in spark UDF" feature that we talked about ~ 3 years ago :). It isn't a public API, so changes are okay as long as Spark UDF continues to work.

cc @WeichenXu123

Yes. The scoring_server.client module is used in spark_udf with env_manager="virtualenv" or env_manager="conda".

tomasatdatabricks · 2022-08-24T16:14:21Z

mlflow/utils/proto_json_utils.py

-            if len(schema.inputs) == 1:
-                dtypes = schema.numpy_types()[0]
+    if pandas_orient == "records":
+        if not isinstance(decoded_input, list):


NOTE: pandas is able to interpret almost anything as records format so it does not have to be a list for pandas's sake. However, from our perspective, it is better to be more restrictive so there is less variance in input encoding and it is easier to parse the input without pandas. E.g. when parsing logs etc.

dbczumar · 2022-08-31T05:33:42Z

mlflow/pyfunc/__init__.py

                local_model_path_on_executor,
                workers=1,
-                install_mlflow=False,
+                install_mlflow=(mlflow_home is None),


Previously, we would never install MLflow. Now, it seems like we'll almost always install MLflow, since users don't typically set mlflow_home. Is this necessary? cc @WeichenXu123

Oh, nice catch, it's inverted. I want to install mlflow only when MLFLOW_HOME is set. MLFLOW_HOME (not very descriptive name in retrospect) is meant for development - so that you can test your local changes when running stuff in a conda environment. For example, in this case, the tests were failing because the conda environment was installing mllfow from pip and hence was expecting the old input.

yes I think only in running test scenario we need set install_mlflow to be true

dbczumar · 2022-08-31T05:34:21Z

mlflow/pyfunc/__init__.py

+    # Used in test to force install lcoal version of mlflow when starting a model server
+    mlflow_home = os.environ.get("MLFLOW_HOME")


The impact of this appears to extend beyond tests: https://github.com/mlflow/mlflow/pull/6575/files#r959175274.

Suggested change

# Used in test to force install lcoal version of mlflow when starting a model server

mlflow_home = os.environ.get("MLFLOW_HOME")

# Used in test to force install local version of mlflow when starting a model server

mlflow_home = os.environ.get("MLFLOW_HOME")

commented above. hopefully it explains it. we can also rename MLFLOW_HOME to something like MLFLOW_DEV_PROJECT_HOME in a followup

mlflow/pyfunc/scoring_server/__init__.py

dbczumar · 2022-08-31T05:47:36Z

mlflow/sagemaker/__init__.py

-
            response_body = response["Body"].read().decode("utf-8")
-            return infer_and_parse_json_input(response_body)
+            return ScoringServerResponse.from_raw_json(response_body)


This breaks the API contract of BaseDeploymenntClient.predict(): https://mlflow.org/docs/latest/python_api/mlflow.deployments.html#mlflow.deployments.BaseDeploymentClient.predict, which states that the API returns a pandas DataFrame, series, or numpy array (i.e pyfunc output, though list should also be added to the docs)

The goal of the original implementation was to convert the response JSON into the corresponding pyfunc output object. Can we preserve that, e.g. by adding / using a utility for parsing the response as a pyfunc output object?

Ah, this is a hard one. We could certainly preserve the behavior but I am not sure if we should because:

we wrap the response in a dictionary so we can return additional stuff. I think we should enable access to any additional output via the client interface.

It's actually impossible to know whether the response should be pandas dataframe or series or numpy array. I saw most clients will always return pandas dataframe. infer_and_parse_json returns a dataframe unless the input is explicitly a tensor input - i.e. a dict with instances or inputs field. Unfortunately, our current server does not make that distinction and just jsonizes whatever the model returned. It is a bit late to change that now unfortunately.

Therefore, I would propose we update the interface to return pyfunc reponse object that can access the prediction as dataframe or numpy array per user's request and will have access to all response fields. Wdyt? Is that too big a change too late? We could also preserve the original behavior by default and add a flag to use the new behavior.

dbczumar · 2022-08-31T05:49:02Z

mlflow/pyfunc/scoring_server/client.py

+class ScoringServerResponse(dict):
+    """Predictions returned from the MLflow scoring server. The predictions are presented as a
+    dictionary with convenience methods to access the predictions parsed as higher level type."""
+
+    @classmethod
+    def from_raw_json(cls, json_str):
+        try:
+            parsed_response = json.loads(json_str)
+        except Exception as ex:
+            raise MlflowException(f"Model response is not a valid json. Error {ex}")
+        if not isinstance(parsed_response, dict) or "predictions" not in parsed_response:
+            raise MlflowException(
+                "Invalid predictions data. Prediction object must be a dictionary "
+                "with 'predictions' field."
+            )
+        return ScoringServerResponse(parsed_response)
+
+    def get_predictions_dataframe(self):
+        """Get the predictions returned from the server as pandas.DataFrame. this method will fail
+        if the returned predictions is not a valid"""
+        return pd.DataFrame(data=self["predictions"])


Following from https://github.com/mlflow/mlflow/pull/6575/files#r959181856, is this class necessary? It doesn't support the full set of pyfunc outputs. Is there a utility method that we can add or extend for parsing the response as a pyfunc output object?

Responded above.

mlflow/utils/proto_json_utils.py

dbczumar

@tomasatdatabricks Awesome stuff! Left some small comments and questions. Do we have time allocated to update our docs for this, such as https://mlflow.org/docs/latest/models.html#deploy-mlflow-models? Looks like a lot of this content would be out of date after merging this PR.

Can we add docs updates to the current PR or file a separate PR that we can merge at the same time as this one?

tomasatdatabricks · 2022-08-31T15:53:37Z

Thanks for the review @dbczumar ! I am planning to update the docs, but I am still fixing tests :/ The PR grew unexpectedly in scope as I am working on it cause I see it impacts more things. So I was hoping to get all tests green before I bring stuff up, get agreement and update the docs in the end. I will be working on this full time next sprint - it's my last sprint before I go on parental leave so I gotta finish it.

…field and nest dataframe inputs explicit fields Signed-off-by: tomasatdatabricks <[email protected]>

Signed-off-by: tomasatdatabricks <[email protected]>

Co-authored-by: Corey Zumar <[email protected]> Signed-off-by: tomasatdatabricks <[email protected]>

Signed-off-by: dbczumar <[email protected]>

…oring-api Signed-off-by: dbczumar <[email protected]>

Signed-off-by: dbczumar <[email protected]>

dbczumar · 2022-09-26T22:49:15Z

@mlflow-automation autoformat

Signed-off-by: mlflow-automation <[email protected]>

Signed-off-by: dbczumar <[email protected]>

dbczumar

LGTM! Thanks @tomasatdatabricks!

dbczumar · 2022-09-27T06:02:52Z

Test failures are caused by sqlparse upgrade and unrelated to this PR. Merging!

tomasatdatabricks commented Aug 24, 2022

View reviewed changes

dbczumar self-requested a review August 29, 2022 06:23