Improve url detection #166

huangsam · 2018-05-26T18:46:14Z

The previous implementation used Linux regular expressions. This was sufficient as a MVP but it was not accurate enough for corner cases. After doing some more research, it seemed as if using a HTML parsing library would be more efficient for this purpose. As such, the shell command has been scrapped away in favor of a more elaborate approach for detecting urls.

Implement skeleton for extract_urls
Detect html and markdown files
Use bs4 for parsing html
Convert markdown for bs4 parsing
Remove use of urlin.txt and urlout.txt
Remove unnecessary global vars

The previous implementation used Linux regular expressions. This was sufficient as a MVP but it was not accurate enough for corner cases. After doing some more research, it seemed as if using a HTML parsing library would be more efficient for this purpose. As such, the shell command has been scrapped away in favor of a more elaborate approach for detecting urls. - Implement skeleton for extract_urls - Detect html and markdown files - Use bs4 for parsing html - Convert markdown for bs4 parsing - Remove use of urlin.txt and urlout.txt - Remove unnecessary global vars

huangsam · 2018-05-26T18:46:46Z

Preview of script output:

Extract urls...
Currently checking: file=full-stack-python-map.pdf                                  
Check urls...
Currently checking: id=2409 host=joaoventura.net                              
Bad urls: {
    "http://www.machinalis.com/blog/jwt-django-channels/": 404,
    "https://www.continuum.io/blog/developer-blog/using-bokeh-nist": 404,
    "http://erik.io/blog/2013/06/08/a-basic-guide-to-when-and-how-to-deploy-https/": 404,
    "http://flask.pocoo.org/docs/0.10/patterns/sqlite3/": 404,
    "http://articles.slicehost.com/nginx": 504,
    "https://github.com/fullstackpython/blog-code-examples/monitor-aws-lambda-python-3-6": 404,
    "http://flask.pocoo.org/docs/0.10/blueprints/": 404,
    "http://flask.pocoo.org/docs/0.10/tutorial/introduction/": 404,
    "http://www.machinalis.com/blog/offloading-work-using-django-channels/": 404,
    "https://storify.com/samnewman/in-which-i-discuss-monorepos": 410,
    "http://blog.yjl.im/2016/01/pymux-tmux-clone-in-python.html": -1,
    "http://blog.yjl.im/2016/01/pymux-and-tmux-performance-comparison.html": -1,
    "https://russ.garrett.co.uk/talks/postgres-gds/": 404,
    "http://flask.pocoo.org/docs/0.10/deploying/wsgi-standalone/": 404,
    "http://www.machinalis.com/blog/full-text-search-on-django-with-database-back-ends/": 404,
    "http://manpages.ubuntu.com/manpages/zesty/man1/ssh-agent.1.html": 404,
    "https://github.com/mapbox/mapboxgl-jupyter/blob/master/docs-markdown/viz.md": 404,
    "http://blog.ashnab.com/task-queues-and-python-rq/": 504,
    "http://www.machinalis.com/blog/pandas-django-rest-framework-bokeh/": 404
}

mattmakai · 2018-05-26T21:50:29Z

Wow, this looks like a huge improvement. Thanks again for doing all this work @huangsam.

huangsam · 2018-05-26T22:32:50Z

My pleasure @mattmakai. It was great meeting you in person at PyCon 2018. I recall that you sent me an invitation to share my creation via http://twiliovoices.com - is it still possible?

mattmakai · 2018-05-27T13:38:13Z

Yes! Go ahead and submit the form that's linked to from that website. I'll respond back via my Twilio email next week. Have a great weekend.

huangsam mentioned this pull request May 26, 2018

Fix bad urls #167

Merged

mattmakai merged commit a5274df into mattmakai:master May 26, 2018

huangsam deleted the bugfix/url-discovery branch May 26, 2018 22:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve url detection #166

Improve url detection #166

Uh oh!

huangsam commented May 26, 2018

Uh oh!

huangsam commented May 26, 2018

Uh oh!

mattmakai commented May 26, 2018

Uh oh!

huangsam commented May 26, 2018

Uh oh!

mattmakai commented May 27, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Improve url detection #166

Improve url detection #166

Uh oh!

Conversation

huangsam commented May 26, 2018

Uh oh!

huangsam commented May 26, 2018

Uh oh!

mattmakai commented May 26, 2018

Uh oh!

huangsam commented May 26, 2018

Uh oh!

mattmakai commented May 27, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants