regex - Python match all URL's in a file and list each on new line in file

Question

Welcome To Ask or Share your Answers For Others

regex - Python match all URL's in a file and list each on new line in file

posted Jan 25, 2021 in Technique[技术] by 深蓝 (71.8m points)

regex - Python match all URL's in a file and list each on new line in file

I'm trying to get a script that opens a file, matches the file for all URL's and outputs a new file with just the matches. What currently happens with the below is just get the first match. The file I'm parsing is basically 1 line with multiple urls "This is a a random string of urls http://www.yandex.ru:8080, http://www.hao123.com:8080, another bit here , http://www.wordpress.com:8080,"

import re

with open("C:\Users\username\Desktop\test.txt") as f:
    Lines = f.readlines()
file_to_write = open("C:\Users\username\Desktop\output.txt", "w")
pattern = 'https?://(?:w{1,3}.)?[^s.]+(?:.[a-z]+)*(?::d+)?(?![^<]*(?:</w+>|/?>))'
matches = []
for line in Lines:
   m = re.search(pattern, line)
   if m:
     matches.append(m.group(0))
   print(matches)
   file_to_write.write("
".join(matches))

Now, if I replace the regex with something more simple like "'(https?://.):(d)'" I get all the matches but they are not separated on the lines, they are all joined together on one line.

Not sure how to quite modify the script OR the Regex to capture ALL urls' base:port and add to a new line.

Current output with Regex ('(https?://.):(d)'):

http://www.yandex.ru:8080, http://www.hao123.com:8080, antoher bit here , http://www.wordpress.com:8080,http://www.gmw.cn:8080, http://www.tumblr.com:8080/test/etete/eete, http://www.paypal.com:8080

Desired Output:

http://www.yandex.ru:8080
http://www.hao123.com:8080
http://www.wordpress.com:8080
http://www.gmw.cn:8080
http://www.tumblr.com:8080
http://www.paypal.com:8080

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-01-25T05:30:10+0000

You can try with re.findall (and the pattern you have):

>>> import re
>>>
>>> s = 'This is a a random string of urls http://www.yandex.ru:8080, http://www.hao123.com:8080, another bit here, http://www.wordpress.com:8080,'
>>> pattern = 'https?://(?:w{1,3}.)?[^s.]+(?:.[a-z]+)*(?::d+)?(?![^<]*(?:</w+>|/?>))'
>>> urls = re.findall(pattern, s)
>>> urls
['http://www.yandex.ru:8080', 'http://www.hao123.com:8080', 'http://www.wordpress.com:8080']

You can then use the list named urls as you see fit. For example, to write the URLs in a file, you can use (as you already have) file_to_write.write(' '.join(urls)). For illustration:

>>> print('
'.join(urls))
http://www.yandex.ru:8080
http://www.hao123.com:8080
http://www.wordpress.com:8080

Categories

regex - Python match all URL's in a file and list each on new line in file

regex - Python match all URL's in a file and list each on new line in file

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags