Exploring the world through data

Join me as I embark on a journey to uncover a valuable tourism and travel dataset through web scraping and data pre-processing. This article is divided into two major sections.

The first section focuses on obtaining relevant data from the popular social media platform, Reddit, using the PRAW (Python Reddit API Wrapper) library. The second section delves into the pre-processing of our scraped data and extracting travel locations using the libraries Spacy and GeoPy.

You're welcome to start with the section that piques your interest the most.

11 minute read

NOTE: The code implementation as discussed in the article can be found here.

Part 1: Scraping data from Reddit

Web scraping is the process of extracting data from a website. In this section, we will use the Python Reddit API Wrapper (PRAW) library to web scrape data from top 1000 reddit posts from r/travel, r/solotravel and other sub-reddits. Here are the steps to do so.

Step 1: Install the PRAW library
To use PRAW, we will first need to install the PRAW library. This can be done easily by running the command `pip install praw` in the terminal.

Step 2: Obtain Reddit API key
To access the data on Reddit, we will need to obtain a Reddit API key. This can be done by creating a new application on the Reddit website from here. Once done, we'll be provided with a client ID and client secret which we will need in the next step.

Step 3: Create a Reddit instance
In a Python script, import the PRAW library and create a new instance of reddit class using our client ID and client secret as follows


# create a reddit instance and login to client
reddit = praw.Reddit(client_id = "your_client_id",
					client_secret = "your_client_secret",
					username = "your_username",
					password = "your_password",
					user_agent = "your_user_agent",
					check_for_async = False)
		

Step 4: Scraping the data
It's worth noting that the Reddit API limits the rate at which we make calls to access Reddit data. Fortunately, PRAW automatically handles rate limits so we don't need to worry about exceeding the rate limit and getting blocked. To extract the relevant information from the top 1000 posts from the specified subreddits, we will define some helper functions.


def get_info(post, pbar=None):
	"""
		Extract information we need from a Reddit post.

		Parameters
		----------
		post: The post from which to extract information.
		pbar: A progress bar to update the number of iterations.

		Returns
		-------
		list:  A list containing the post's title, subreddit, creation time, 
		number of comments, score, upvote ratio, and description.
		"""
	# update iterations if a progress bar is provided
	if pbar: pbar.update(1)
	return [post.title, post.subreddit, post.created_utc, post.num_comments, \
	post.score, post.upvote_ratio, post.selftext]
	

def extract(N=1000, sub_name=None):
	"""
		Extract the top N posts from a given subreddit.

		Parameters
		----------
		N: The number of posts to extract.
		sub_name: The name of the subreddit from which to extract posts.


		Returns
		-------
		list: A list containing the information from the top N posts.
		"""
	assert sub_name is not None
	
	subreddit = reddit.subreddit(sub_name)

	with tqdm(total=N) as pbar:
		data = [get_info(post, pbar) for post in subreddit.top(limit=N)]

	return data		
		

The function `get_info` returns the relevant information we need from a Reddit post, and the function `extract` extracts information for N posts from the specified subreddit.
Now with the code below, we can successfully build our dataset of Reddit posts.


# Names of subreddits to extract data from
sub_names = ['travel', ... , 'solotravel']
dataset = []

# Number of posts to extract
N = 1000

# Extract posts
for sub_name in sub_names:
	print(f"Extracting data from r\\{sub_name}...")
	dataset.extend(extract(N, sub_name))

# Convert extracted data to a Pandas dataframe
columns = ['title', 'subreddit', 'timestamp', 'num_comments', \
'upvotes', 'upvote_ratio', 'description']
travel_dataset = pd.DataFrame(dataset, columns=columns)
		

That's all! With these simple steps, we have successfully extracted our Reddit travelling dataset. As always, be sure to check Reddit's policy before scraping.

Part 2: Preprocessing the Reddit dataset

We will start by extracting locations from our dataset. We will use the Named Entity Recognition (NER) model from the spacy library. NER is an NLP task associated with identifying and classifying named entities such as people, locations, organizations, and others in unstructured text. This model will identify entities in our text, and we will only keep the entities that are labeled as GPE (Geo-Political Entity), which represents countries, cities, and states.


# Load the English language model
nlp = spacy.load('en_core_web_sm')

def extract_entities(text):
    # Process the text using the spacy model
    doc = nlp(text)
    # Extract entities labelled 'GPE'
    entities = {entity.text for entity in doc.ents if entity.label_ == 'GPE'}
    return entities
		

Now we need to map all the locations using the GeoPy library. We will first create a pool of all the different locations extracted in step 1, and will then use the GeoPy library to map them to their corresponding geo-code objects. This way we can ensure that we do not make duplicate calls to GeoPy since each making each call is time consuming. Moreover, to handle the rate limits, we will use `geopy.extra.rate_limiter` from the GeoPy library. The geo-code objects allow us to get the desired location's latitude, longitude, altitude and the full address which includes the country of that location.


# create a pool of all the different locations
all_locations = set()
for i in range(travel_dataset.shape[0]):
    all_locations = all_locations.union(extract_entities(travel_dataset.loc[i, 'title']))
		

# create a geo-code object
geolocator = Nominatim(user_agent="specify_your_app_name_here")
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)

# extract geo-codes for all the locations
df_all_locations['geo_code'] = df_all_locations['location'].progress_apply(geocode, language='en')
		

The Nominatim object we created is a geocoding service that allows us to convert a location name or address into the corresponding geographic coordinates. It returns a geo-code object for each location, from which we can extract the address, latitude, longitude and the altitude. This is done in the code snippet below.


df_all_locations['address'] = df_all_locations['geo_code'].apply(lambda loc: loc.address if loc else None)
df_all_locations['latitude'] = df_all_locations['geo_code'].apply(lambda loc: loc.latitude if loc else None)
df_all_locations['longitude'] = df_all_locations['geo_code'].apply(lambda loc: loc.longitude if loc else None)
df_all_locations['altitude'] = df_all_locations['geo_code'].apply(lambda loc: loc.altitude if loc else None)

df_all_locations = df_all_locations.drop(columns=['geo_code'])
			

We will now process each row to extract location, country, longitude, latitude, altitude, and the month and year of the Reddit post. This can be done as follows


def process_row(row):
	"""
		Processes a single row of the DataFrame.

		Parameters
		----------
		row: A single row of the DataFrame

		Returns
		-------
		cur_location: location of post
		country: country of post
		longitude: longitude of post
		latitude: latitude of post
		altitude: altitude of post
		month: month of post
		year: year of post
		"""

	locations = extract_entities(row.title)
	
	cur_location = None
	country = None
	longitude = None
	latitude = None
	altitude = None

	# Extract info from the first location only
	for location in locations:
		idx = location_to_idx[location]
		
		if location_mapping.loc[idx].isnull().sum(): continue

		country = location_mapping.loc[idx, 'address'].split(',')[-1]
		country = country.strip().lower()
		
		cur_location = location.strip().lower()
		country = country
		longitude = location_mapping.loc[idx, 'longitude']
		latitude = location_mapping.loc[idx, 'latitude']
		altitude = location_mapping.loc[idx, 'altitude']

		break

	d = dt.datetime.fromtimestamp(row.timestamp)

	return cur_location, country, longitude, latitude, altitude, d.month, d.year

# process the 'title' and 'timestamp' columns to get the location, country, 
# coordinates, and month and year of the post
df = travel_dataset.copy(deep=True)
columns = ['location', 'country', 'longitude', 'latitude', 'altitude', 'month', 'year']
df[columns] = df.apply(process_row, axis='columns', result_type='expand')
		

With a short analysis, we can see the altitude of all the locations is 0, which is not accurate. As a result, we will not be using the altitude column as it does not provide any useful information. Additionally, we have also extracted the month and year from the timestamp, so we don't need it anymore. Description and empty location rows are also removed as they are not needed for analysis.


# drop altitude because all them are 0 which is not likely. Did no one visit the mountains?
df = df.drop(columns=['timestamp', 'description', 'altitude'])

# drop rows with null locations
df = df.loc[df['location'].notnull(), :].reset_index(drop=True)
		

Now we will be handling the locations present in our dataset under different names.

First, we check for any places that might be abbreviated by looking at the list of all the locations that have a size less than equal to 3.


locations = set()
for _, row in df.iterrows():
    if len(row.location) <= 3:
        locations.add(row.location)

print(locations)
		

Upon running the above code we can see the following list of names. We can see the following list of names:
{'got', 'ia', 'nyc', 'nz', 'sea', 'sf', 'st', 'uae', 'uk', 'usa', 'wa', 'èze'}

Through prior knowledge and some research online, we can find out that NYC stands for New York City, NZ for New Zealand, UAE for United Arab Emirates, UK for United Kingdom, USA for United States of America, and Eze for a commune in the Alpes-Maritimes department in southeastern France. Hence, we will replace these abbreviations with their appropriate full forms. The remaining abbreviations are not actual locations and are wrongly classified as locations. Therefore, we will remove them.


remove_values = ['got', 'ia', 'sea', 'sf', 'st', 'wa']
df = df.loc[~df['location'].isin(remove_values)].reset_index(drop=True)
			

Next, we use the fuzzywuzzy library to identify locations that have very similar spellings and replace them with the correct names.


places = set()

for _, row in df.iterrows():
	places.add(row.location)
	places.add(row.country)

places = list(places)

for i in range(len(places)):
	for j in range(i+1, len(places)):
		similarity = fuzz.ratio(places[i], places[j])
		if similarity > 80:
			print(f"{places[i]}, \t {places[j]}")
			

Now we will focus on the locations that have some extra words in them so they might not have been found using the FuzzyWuzzy library. For example consider 'Colorado springs' and 'Colorado'. These refer to the same place but weren't identified before because of a whole extra term.

For every pair of locations, we check if one of them is a subset of the other. If that is the case, we print them and conduct our analysis to replace the appropriate location names. We make the following replacements in the code snippet below:


replace_list = [('south east iceland', 'south iceland'), 
('puerto rico -', 'puerto rico'), 
('bosnia i herzegovina', 'bosnia and herzegovina'), 
('the virgin islands', 'british virgin islands'), 
('british virgin islands -', 'british virgin islands'), 
('grece', 'greece'), 
('usa', 'united states'), 
('america', 'united states'), 
('the united kingdom', 'united kingdom'), 
('uk', 'united kingdom'), 
('the united arab emirates', 'united arab emirates'), 
('uae', 'united arab emirates'), 
('new york city', 'new york'),
('porto 💜', 'porto'),
('funland', 'finland'),
('the czech republic', 'czech republic'),
('japan-', 'japan'),
('namib', 'namibia'),
('brugge', 'bruges'),
('my ireland', 'ireland'),
('nz', 'new zealand'),
('nyc', 'new york'),
]


for item in replace_list:
    df.replace(to_replace=item[0], value=item[1], inplace=True)
			

After further analysis, we can see that some locations marked as United Arab Emirates had their country listed as United States, which is clearly incorrect. Also some locations are named as "corona", "coronavirus". So we fix these issues.


# some rows have a mismatch with location being UAE and country being USA
for i in range(df.shape[0]):
	if df.loc[i, 'location'] == 'united arab emirates' and df.loc[i, 'country'] == 'united states':
		df.loc[i, 'country'] = 'united arab emirates'

# remove locations containing corona/coronavirus in them
remove_values = ['corona', 'coronavirus']
df = df.loc[~df['location'].isin(remove_values)].reset_index(drop=True)
			

With the country information for each location, we can now manually add a 'continent' column to our dataset. We can obtain a list of continents and their countries from the internet and use it to categorize each location.


def get_continent(row):
    
	asian_countries = ['afghanistan', ..., 'vietnam', 'yemen']
	european_countries = ['albania', ..., 'united kingdom']
	north_american_countries = ['antigua and barbuda', ..., 'united states']
	south_american_countries = ['argentina', ..., 'venezuela']
	african_countries = ['algeria', ..., 'zimbabwe']
	australian_countries = ['australia', ..., 'vanuatu']

	if row.country in asian_countries:
		return 'asia'
	elif row.country in european_countries:
		return 'europe'
	elif row.country in north_american_countries:
		return 'north america'
	elif row.country in south_american_countries:
		return 'south america'
	elif row.country in african_countries:
		return 'africa'
	elif row.country in australian_countries:
		return 'australia'
	print(f'{row.country} \t not present!')
	return 'antarctica'

df['continent'] = df.apply(get_continent, axis='columns', result_type='expand')
			

NOTE: The above code snippet shows a shortened list of countries for the purposes of this article. Please refer to the notebook link to view the entire code.

Finally, we convert the 'month' and 'year' columns to integers.


df = df.astype({'year': int, 'month': int})
			

Hurray! We've successfully completed the pre-processing stage as well.

In conclusion, scraping data from Reddit using PRAW and pre-processing it with NLP techniques such as NER and using GeoPy can provide valuable insights into the kinds of places visited by people. The techniques we discussed can be applied to a wide range of topics, from understanding the spread of news and information to even tracking the popularity of products and services.

Alright, with all that hard work we put in scraping and pre-processing the data, I had to give you something cool to play with. I present to you an interactive 3D scatter plot from the same data we just processed! You can rotate the globe and see the data in action. Check out the code for the same in the notebook mentioned in the beginning.

More Sources