I'm trying to get a script that opens a file, matches the file for all URL's and outputs a new file with just the matches.
What currently happens with the below is just get the first match. The file I'm parsing is basically 1 line with multiple urls
"This is a a random string of urls http://www.yandex.ru:8080, http://www.hao123.com:8080, another bit here , http://www.wordpress.com:8080,"
import re
with open("C:\Users\username\Desktop\test.txt") as f:
Lines = f.readlines()
file_to_write = open("C:\Users\username\Desktop\output.txt", "w")
pattern = 'https?://(?:w{1,3}.)?[^s.]+(?:.[a-z]+)*(?::d+)?(?![^<]*(?:</w+>|/?>))'
matches = []
for line in Lines:
m = re.search(pattern, line)
if m:
matches.append(m.group(0))
print(matches)
file_to_write.write("
".join(matches))
Now, if I replace the regex with something more simple like "'(https?://.):(d)'" I get all the matches but they are not separated on the lines, they are all joined together on one line.
Not sure how to quite modify the script OR the Regex to capture ALL urls' base:port and add to a new line.
Current output with Regex ('(https?://.):(d)'):
http://www.yandex.ru:8080, http://www.hao123.com:8080, antoher bit here , http://www.wordpress.com:8080,http://www.gmw.cn:8080, http://www.tumblr.com:8080/test/etete/eete, http://www.paypal.com:8080
Desired Output:
http://www.yandex.ru:8080
http://www.hao123.com:8080
http://www.wordpress.com:8080
http://www.gmw.cn:8080
http://www.tumblr.com:8080
http://www.paypal.com:8080
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…