-
Notifications
You must be signed in to change notification settings - Fork 634
Improve url detection #166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The previous implementation used Linux regular expressions. This was sufficient as a MVP but it was not accurate enough for corner cases. After doing some more research, it seemed as if using a HTML parsing library would be more efficient for this purpose. As such, the shell command has been scrapped away in favor of a more elaborate approach for detecting urls. - Implement skeleton for extract_urls - Detect html and markdown files - Use bs4 for parsing html - Convert markdown for bs4 parsing - Remove use of urlin.txt and urlout.txt - Remove unnecessary global vars
|
Preview of script output: |
|
Wow, this looks like a huge improvement. Thanks again for doing all this work @huangsam. |
|
My pleasure @mattmakai. It was great meeting you in person at PyCon 2018. I recall that you sent me an invitation to share my creation via http://twiliovoices.com - is it still possible? |
|
Yes! Go ahead and submit the form that's linked to from that website. I'll respond back via my Twilio email next week. Have a great weekend. |
The previous implementation used Linux regular expressions. This was sufficient as a MVP but it was not accurate enough for corner cases. After doing some more research, it seemed as if using a HTML parsing library would be more efficient for this purpose. As such, the shell command has been scrapped away in favor of a more elaborate approach for detecting urls.