Skip to content

NewConnectionError /datasets2/dataset_2.pq #1159

@eddiebergman

Description

@eddiebergman

Hi, I'm having difficulties downloading datasets and I need some help debugging it if possible.

Essentially I'm trying to do this simple task on a server but encountering errors, I'm not sure why it doesn't happen locally. This effects the ability for all users of our cluster to use automl_benchmark efficiently as we need to manually download data locally and scp it over to the cluster, causing possible issues with pandas versions when unpickling as has been a noted issue, one such example being #918. All other web traffic is handled without issue so I thought perhaps there was some misspecification in openml-python

import openml
openml.tasks.get_task(2).get_data()

The error I eventually run into, which is more of a logged warning is (source code of urllib3):

WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None))
after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ff83f3d2e20>:
Failed to establish a new connection: [Errno 110] Connection timed out')': /dataset2/dataset_2.pq

Seeing as it works locally, I tried to debug where openml does its called and came across def _send_request(...) so I put in some print debug statements which essentially give me the below. What seems to happen is that all normal url requests are handled fine but there is some web request happening outside of def _send_request(...) where things are going wrong and it seems to get url="/dataset2/dataset_2.pq" which is not a valid url.

HIHIHIHIHIHIHIHIHIHI  # In function
5  # n_retries
{}  # params
https://www.openml.org/api/v1/xml/data/2  # url
1  # response pre checksum
<Response [200]>
2 # response after checksum
<Response [200]>

<Response [200]> # response at returning


HIHIHIHIHIHIHIHIHIHI
5
{}
https://www.openml.org/api/v1/xml/data/features/2
1
<Response [200]>
2
<Response [200]>
<Response [200]>

HIHIHIHIHIHIHIHIHIHI
5
{}
https://www.openml.org/api/v1/xml/data/qualities/2
1
<Response [200]>
2
<Response [200]>
<Response [200]>

HIHIHIHIHIHIHIHIHIHI
5
{}
https://api.openml.org/data/v1/download/1666876/anneal.arff
1
<Response [200]>
2
<Response [200]>
<Response [200]>

# Now it hangs
WARNING:urllib3.connectionpool:Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ff83ff2f6d0>: Failed to establish a new connection: [Errno 110] Connection timed out')': /dataset2/dataset_2.pq
WARNING:urllib3.connectionpool:Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ff83ff24100>: Failed to establish a new connection: [Errno 110] Connection timed out')': /dataset2/dataset_2.pq
WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ff83f3d2e20>: Failed to establish a new connection: [Errno 110] Connection timed out')': /dataset2/dataset_2.pq

I tried looking up import requests, "pq" and "dataset but this didn't help me find where this extra call could have been made. Any advice would be greatly appreciated.

openml            0.12.2

Best,
Eddie

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions