0
votes

I'm trying to extract a specific string value from a text file (file1.txt) and then to create HTTP GET request with the extracted string (url address), the HTTP response should be saved as a new HTML file in the directory. The string I'm trying to extract is a value of a specific key.

For example: "display_url":"test.com" (extract "test.com" and then to create http request)

My txt file content:

{"created_at":"Thu Nov 15 11:35:00 +0000 2018","id":15292802,"id_str":325802","text":"test8 https://t.co/ZtCsuk7Ek2 #osining","source":"\u003ca href=\"http://twitter.com\" rel=\"nofollow\"\u003eTwitter Web Client\u003c/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":961508561217052675,"id_str":"961508561217052675","name":"Online S","screen_name":"osectraining","location":"Israel","url":"https://www.test.co.il","description":"test","translator_type":"none","protected":false,"verified":false,"followers_count":2,"friends_count":51,"listed_count":0,"favourites_count":0,"statuses_count":7,"created_at":"Thu Feb 08 07:54:39 +0000 2018","utc_offset":null,"time_zone":null,"geo_enabled":false,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"000000","profile_background_image_url":"http://abs.twimg.com/images/themes/theme1/bg.png","profile_background_image_url_https":"https://abs.twimg.com/images/themes/theme1/bg.png","profile_background_tile":false,"profile_link_color":"1B95E0","profile_sidebar_border_color":"000000","profile_sidebar_fill_color":"000000","profile_text_color":"000000","profile_use_background_image":false,"profile_image_url":"http://pbs.twimg.com/profile_images/961510231346958336/d_KhBeTD_normal.jpg","profile_image_url_https":"https://pbs.twimg.com/profile_images/961510231346958336/d_KhBeTD_normal.jpg","profile_banner_url":"https://pbs.twimg.com/profile_banners/961508561217052675/1518076913","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"quote_count":0,"reply_count":0,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[{"text":"osectraining","indices":[33,46]}],"urls":[{"url":"https://t.co/ZtCsuk7Ek2","expanded_url":"http://test.com","display_url":"test.com","indices":[7,30]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en","timestamp_ms":"1542281700508"}

My code:

import re
with open('file1') as f:
found = []
for line in f.readlines():
    found += re.findall(r'"display_url":\s(\w+)\s', line)
print(found) 
2
And what have you tried so far? Please post your code as it is.Matt Morgan
Does your indentation actually look like what you posted? If not, you should fix it, it matters in Python.Matt Morgan

2 Answers

1
votes

Please note that indentation is critical in Python. It's not clear to me if you have made a mistake in your code indentation, or just a mistake in formatting your posted question. Having said that...

You need to do four things to accomplish the task:

  1. Read file1.txt from disk.
  2. Parse the contents of the file to find the display_url
  3. Call the URL to get a response
  4. Write the response to disk

Your code attempts to do steps 1 and 2, but there are a few problems. The first issue is that your text file has an error in it. It is missing a closing quotation mark for this key-value pair: "id_str":"325802".

If you fix that, you then need to fix the indentation of your code so that f is available when you try to use it. Finally, I don't think the regex approach is really the way to go here.

You can read the file and parse it to a Python dictionary easily. Finding the information you want requires that you know the structure of the JSON, here is one way you could do it:

import json


with open('./file1.txt', 'r') as f:
    lines = f.readlines()
    text = ''.join(lines)


dictionary = json.loads(text)
entities = dictionary.get('entities')
urls = entities.get('urls')[0]
display_url = urls.get('display_url')
print(display_url)

Now you need to figure out steps 3 and 4, which are really the easy part compared to step 2.

0
votes

From your text, it looks like that your file has JSON data. So you can load JSON instead of reading lines and then you can easily get the value of display_url. Ex.

import json
with open('file1') as f:
    data = json.load(f)

urls = [x["display_url"] for x in data["entities"]["urls"]]