Skip to content

[BUG] Internal Server Error on delete #6339

@jaelynlitz

Description

@jaelynlitz

Willingness to contribute

Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.

MLflow version

1.19.0

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): CentOS 7
  • Python version: 3.9
  • yarn version, if running the dev UI:

Describe the problem

After using mlflow for ~1 year, we've run into this difficult to reproduce bug several times where an experiment tab will show an INTERNAL_SERVER_ERROR and stops displaying the runs that were stored in that experiment.

This issue seems to come up when deleting one or multiple runs from an experiment, but it doesn't happen every time a delete is attempted. But it seems to usually happen on the first delete attempt in an experiment if it's going to fail... The delete command hangs longer than usual then prompts the INTERNAL_SERVER_ERROR.

Screen Shot 2022-07-26 at 4 30 40 PM

Besides a full fix, I would also be interested to learn if it's possible to retrieve the runs that were stored in the broken experiment.

Steps to reproduce the bug

This is inconsistently reproducible (and I apologize, I know that's not very helpful), but I logged 50-100 runs in a new experiment. Each run had 5 metrics with ~200 values. I then tried deleting a random amount of runs. Sometimes one or the whole page. Sometimes the experiment broke, but most times the delete executed no problem.

Code to generate data required to reproduce the bug

import mlflow
import boto3
import random

MLFLOW_EXPERIMENT_NAME = "break server"
if not mlflow.get_experiment_by_name(MLFLOW_EXPERIMENT_NAME):
      mlflow.create_experiment(MLFLOW_EXPERIMENT_NAME)
mlflow.set_experiment(MLFLOW_EXPERIMENT_NAME)

mlflow.start_run()

mlflow.log_param("number", 1)

randomlist = random.sample(range(10, 500), 200)
for x in randomlist:
  mlflow.log_metric("rand1", x)
  mlflow.log_metric("rand2", x)
  mlflow.log_metric("rand3", x)
  mlflow.log_metric("rand4", x)
  mlflow.log_metric("rand5", x)

mlflow.end_run()

I then ran this 50-100 times to fill the experiment with runs to delete.

Is the console panel in DevTools showing errors relevant to the bug?

Screen Shot 2022-07-26 at 5 19 29 PM

Error: Promised response from onMessage listener went out of scope 5 [background.js:841:170](moz-extension://d6a7add6-de11-4239-898f-0a287b30c3e7/dist/background.js)
XHR failed 
Object { readyState: 4, getResponseHeader: getResponseHeader(e), getAllResponseHeaders: getAllResponseHeaders(), setRequestHeader: setRequestHeader(e, t), overrideMimeType: overrideMimeType(e), statusCode: statusCode(e), abort: abort(e), state: state(), always: always(), catch: catch(e)
, … }
​
abort: function abort(e)​
always: function always()​
catch: function catch(e)​
done: function add()​
fail: function add()​
getAllResponseHeaders: function getAllResponseHeaders()​
getResponseHeader: function getResponseHeader(e)​
overrideMimeType: function overrideMimeType(e)​
pipe: function pipe()​
progress: function add()​
promise: function promise(e)
​
readyState: 4
​
responseText: "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 3.2 Final//EN\">\n<title>500 Internal Server Error</title>\n<h1>Internal Server Error</h1>\n<p>The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.</p>\n"
​
setRequestHeader: function setRequestHeader(e, t)​
state: function state()
​
status: 500
​
statusCode: function statusCode(e)
​
statusText: "Internal Server Error"
​
then: function then(e, r, o)​
<prototype>: Object { … }
[main.895f3836.chunk.js:1:15852](https://mlwf-devel-server.mlflow.pnl.gov/static-files/static/js/main.895f3836.chunk.js)
    error https://mlwf-devel-server.mlflow.pnl.gov/static-files/static/js/main.895f3836.chunk.js:1
    l https://mlwf-devel-server.mlflow.pnl.gov/static-files/static/js/2.8c754ded.chunk.js:1
    fireWith https://mlwf-devel-server.mlflow.pnl.gov/static-files/static/js/2.8c754ded.chunk.js:1
    S https://mlwf-devel-server.mlflow.pnl.gov/static-files/static/js/2.8c754ded.chunk.js:1
    t https://mlwf-devel-server.mlflow.pnl.gov/static-files/static/js/2.8c754ded.chunk.js:1
Object { xhr: {…} }
​
xhr: Object { readyState: 4, getResponseHeader: getResponseHeader(e), getAllResponseHeaders: getAllResponseHeaders()
, … }
​​
abort: function abort(e)​​
always: function always()​​
catch: function catch(e)​​
done: function add()​​
fail: function add()​​
getAllResponseHeaders: function getAllResponseHeaders()​​
getResponseHeader: function getResponseHeader(e)​​
overrideMimeType: function overrideMimeType(e)​​
pipe: function pipe()​​
progress: function add()​​
promise: function promise(e)
​​
readyState: 4
​​
responseText: "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 3.2 Final//EN\">\n<title>500 Internal Server Error</title>\n<h1>Internal Server Error</h1>\n<p>The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.</p>\n"
​​
setRequestHeader: function setRequestHeader(e, t)​​
state: function state()
​​
status: 500
​​
statusCode: function statusCode(e)
​​
statusText: "Internal Server Error"
​​
then: function then(e, r, o)​​
<prototype>: Object { … }
​
<prototype>: Object { … }
[main.895f3836.chunk.js:1:8885](https://mlwf-devel-server.mlflow.pnl.gov/static-files/static/js/main.895f3836.chunk.js)
    value https://mlwf-devel-server.mlflow.pnl.gov/static-files/static/js/main.895f3836.chunk.js:1
    value https://mlwf-devel-server.mlflow.pnl.gov/static-files/static/js/main.895f3836.chunk.js:1

Does the network panel in DevTools contain failed requests relevant to the bug?

Screen Shot 2022-07-26 at 5 24 17 PM

Metadata

Metadata

Assignees

Labels

area/uiuxFront-end, user experience, plotting, JavaScript, JavaScript dev serverbugSomething isn't workinghas-closing-prThis issue has a closing PR

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions