Search for rental properties with Streamlit — pt 2

Conrado Bio
4 min readJan 12, 2023

--

Photo by Naomi Hébert on Unsplash

Hi everyone, I’m back with part 2 of this web application using Streamlit.

In part 1, I wrote about the project structure and the key point of this project.

Now, In this second part, let's dive deeper into the code.

I will focus on the ET part of the pipeline, the extraction, and transformation,

I forgot in the last post to get the app link, which is now hosted on Streamlit Cloud, but by the end of this series of articles, it will be on EC2 or DigitalOcean.

The link to the streamlit app is below, hope you enjoy it.

Streamlit app

Extract

I used beautiful soap and selenium for extraction from the Vivareal website.

I want to collect from each property the following information:

. description/address/area/number of bedrooms/number of bathrooms/number of parking lot/rent price/condo fee and entertainment items.

I won't go into the details of each part of this code, as you will find much better content on the internet.

This is a part of the script for the extraction elements from the website.

Extracting elements

The script is collecting each element for properties, saves it to a list of dictionaries, and converts it to a Pandas DataFrame to save to CSV.

To track the extraction and for future analysis, I added a column in this dataframe with the extraction date. So with this date column, I can in the future do some useful analyses on the dataset.

Okay, now we have the data extracted from the website.

The second part is transforming the data.

Transform

Sample of extracted data:

Data after the extraction.

There is a lot of work to be done to transform this data.

As you can see, in several columns there are elements like ‘\N’, ‘R$’ and we can divide the ‘Endereço’ (address) into the street, neighborhood, and city.

So let’s start.

To clean the rent price, I had to remove the ‘R$’, and remove ‘.’ and other texts.

And another column to transform was the address column (‘Endereco’).

This column is a little trickier to clean.

The addresses are like this:

‘Avenida João Scarparo Netto, 240 — Loteamento Center Santa Genebra, Campinas — SP’

But there are some just with the neighborhood and another just with the city name, so it was a bit complex to deal with this exception in this field.

I want to have a vision per neighborhood and city, so I used regex expression to handle it.

The result is having two columns with data from the neighborhood(Bairro) and city (Cidade).

On the transformation part, I want to do some feature engineering processes.

Feature engineering or feature extraction or feature discovery is the process of using domain knowledge to extract features (characteristics, properties, attributes) from raw data.

So I want to know the price by area (an important metric to measure if the apartment is well priced based on area or not), the total rent price (rent price plus condo fee), and categorize the price into 4 clusters (Economic, Medium, Medium High, High)

This clusterization is very important for better analysis of the properties

DISCLAIMER — it’s possible to do this clusterization using some Machine Learning Algorithm like kmean, doing with some ML algorithm would be much better for analysis, but the focus of this project is to deploy and useful application using Streamlit.

Ok, after this feature engineering process, our dataset is almost ready to be loaded into MongoDB.

But there is one more important transformation to make.

We need to find the latitude and longitude of the address.

So, this article is getting pretty big, so I’ll be writing about this transformation next week.

That’s all folks, if you liked it, please share this article.

Thank you.

--

--

No responses yet