Skip to content

Minicrawl dataset #18842

@alexey-milovidov

Description

@alexey-milovidov

Download front pages of several million websites with curl.
Record all metadata such as: headers, redirects, TLS version, cipher... as well as data (HTTP body).
Create a dataset from it. The dataset will allow to do research similar to https://w3techs.com/

See also: https://commoncrawl.org/
See also: https://www.rukv.ru/ (created and abandoned by Aleksey Tutubalin)

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions