Data acquisition, web scraping, and the KDD process: a practical study with COVID-19 data in Brazil

Are you interested in data science, and are you aware of the situation of COVID-19 in Brazil? In this post, I will explore the concept of data acquisition and demonstrate how to use a simple web scraping technique to obtain official data on the progress of the disease. It is worth saying that, with some adaptation in the code, the strategy used here can be easily extended to data from other countries or even to other applications. So put on your mask, wash your hands, and start reading it right now!

Clique aqui para a versão em português deste post.

I must start this text offering the most sincere sentiments to all victims of COVID-19. It is in their memory that we must seek motivation for the development of science without forgetting that if it exists, it is because there is a humanity that sustains it. Concerning data science, it is through it that we can minimize fake news and somehow contribute to a more informed, sensible, and thoughtful society.

I started this data analysis project when I realized, a few weeks ago, that there was little stratified information on the number of cases of COVID-19 in Brazil on official bases, such as that of the World Health Organization (WHO – World Health Organization) or the panel maintained by John Hopkins University. As stratified, in fact, we understand the set of data that can be observed in different layers. That is, for a country of continental dimensions, such as Brazil, knowing the total number of cases in the country is only informative. For such information to be functional, it must be stratified by regions, states, or cities.

To provide both an overview of the problem and direct the development of the analysis, I have set out below some objectives for this study that I hope the reader will appropriate them too:

  • Acquire an overview of the KDD process and understand where the data acquisition places on it.
  • Explore data acquisition techniques so that similar problems could easily employ them.
  • Obtain, from official databases, the number of cases and deaths by COVID-19 in Brazil, stratified by state.

In the following sections, we will explore each of these items to build the necessary concepts and implement them in a Python development environment. Ready?

KDD – Knowledge Discovering in Databases

Knowledge discovery in databases is the process by which someone seeks for comprehensive, valid, and relevant patterns from databases. It is an interactive and iterative process, which comprises many of the techniques and methods that are known today as data science.

Before exploring the main stages of the KDD process, however, it is worth differentiating three key terms: data, information, and knowledge. Each step of KDD is concerned with processing each of these terms, illustrated in the diagram of Fig. 1 from the problem we are working on.

Figure 1. From data to knowledge: information processing in KDD.

In short, data are all chains of symbols, structured or not, related to some phenomenon or event. Data, by itself, has no meaning. See: the 4.08 million number is just a cardinal. It becomes scary only after we assign it the sense that this is the number of confirmed cases of COVID-19. In other words, we have information only when we assign meaning to data.

It is from the manipulation of information that we acquire knowledge, which is closely related to what we call “cognitive processes” — perception, contextualization, and learning — and, therefore, intelligence.

Once these three concepts are understood, we can now proceed to the three fundamental phases of the KDD process. These are nothing more than sets of tasks respectively dedicated to obtaining and exploring data, information, and knowledge, as illustrated in Fig. 2, below:

Figure 2. The KDD process and its phases of (i) data acquisition, (ii) information processing, and (iii) knowledge extraction.

The KDD process consists of steps, which in turn are compound of tasks. Data mining tasks, for example, are part of the information processing step. In contrast, those tasks related to data visualization, analytics, and decision making are part of the last step, which aims for knowledge extraction.

In this article, we will focus only on the data acquisition task, fundamental to any KDD project. In general, there is no predefined method for obtaining data. Medium and long term projects usually demand a data model, as well as a robust architecture for its collection and storage. Short-term projects, in turn, mostly require immediately available and accessible sources.

With these fundamental concepts in mind, we are ready to move towards our purpose: the acquisition of official data on the number of COVID-19 cases in Brazil.

Defining a data acquisition strategy

As we have seen, a data acquisition strategy will only make sense if built from the perspective of a complete KDD process. That is, before we start collecting data out there, we need to have a clear purpose definition. What is the problem that we are trying to solve or elucidate?

The CRISP-DM reference model, shown in Fig.3, can be used as a compass when going through a KDD process. Do you remember that we previously defined KDD as an interactive and iterative process?

Well, the interaction means mutual or shared action between two or more agents. It can refer either to people holding data and information or to the stages. The short arrows, in both directions between the first steps of Fig. 3, represent the necessary interaction in the process. I dare say it is the biggest challenge for newcomers to the world of KDD or data science: there is an initial illusion that these steps follow a continuous flow, which results in a feeling of doing something wrong. Quite the contrary, by understanding the business, we make it possible to understand the data, which gathers us new insights about the business, and so on.

Iteration, in turn, is associated with the repetition of the same process. The larger arrows, in Fig.3, form a circle that symbolizes this repetition. It starts in the first step and returns, after development, to it again. A KDD project will never be stagnant.

Figure 3. Phases of the KDD process according to the CRISP-DM reference model. (Adapted from Wikimedia Commons.)

The highlighted boxes, in Fig. 3, represent the steps related to the data acquisition process. According to CRISP-DM, the first step — understanding the business — is fundamental to the success of the entire project and comprises (i) the definition of the business objectives; (ii) the assessment of the current situation, including inventories and risk analyzes, as well as what data and information are already available; (iii) the determination of the objectives in applying data mining, and also the definition of what the success criteria will be; and (iv) the elaboration of a project plan.

The second step — understanding the data — comprises (i) the first data samples and (ii) its detailed description; (iii) the initial exploration of the data; and (iv) an assessment of the quality of that data. Remember that these two initial steps take place interactively and, ideally, involving domain and business specialists.

The third step begins with the validation of the project plan by the sponsor and is characterized by the preparation of the data. In general, the efforts required in this phase become more expensive and, therefore, their changes should be milder. The main tasks of this step include the selection and cleaning of data, the construction and adaptation of the format, and the merging of data. All of these tasks are commonly referred to as data pre-processing.

Your data acquisition strategy will start being defined as soon as you have the first interactions between understanding the business (or the problem) and following the available data. For that, I consider essentially two possible scenarios:

  1. There is control over the data source: you can somehow manage and control the sources that generate the data (sensors, meters, people, databases, etc.). Take, for example, the activity of a head nurse: he can supervise, audit, establish protocols, and record all patient data in his ward. It is important to note that the control is over the origin or registration of the data, and not over the event to which the data is related.
    This scenario usually happens in projects of high complexity or duration, where a data model is established, and there are means to ensure its consistency and integrity.
  2. There is no control over the data source: this is the most common situation in projects of short duration or of occasional interest, as well as in projects, analyzes, or perennial studies that use or demand data from sources other than their own. Whatever the initiative, the data acquisition strategy must be more reliable, given that any change beyond its control can increase the costs of its analysis or even render the project unfeasible.
    Our own study illustrates this scenario: how to obtain the stratified number of COVID-19 cases in Brazil if we have no control over the production and dissemination of these data?

Expect to find the first scenario in corporations and institutions, or when leading longitudinal research or manufacturing processes. Otherwise, you will likely have to deal with the second one. In this case, your strategy may consist of subscriptions, data assignment agreements, or even techniques that search and compile such data from the Internet, once they are properly licensed. Web scraping is one of these techniques, as we will see next.

Data acquisition via web scraping: obtaining data with the official panel of the Brazilian Ministry of Health

With the objective established — the stratified number of COVID-19 cases in Brazil — my first action was to look for possible sources. I consulted the Brazilian Ministry of Health web portal in mid-April, 2020. At the time, I had not found consolidated and easily accessible information, so I increased the granularity of possible sources: state and municipal health departments. The challenge, in this case, was the heterogeneity in the way they presented the data. When consulting a public health worker, she informed me that the data were consolidated in the national bulletins and that specific demands should be requested institutionally. My first strategy, then, was to devise a way to automatically collect such bulletins (in PDF format) and start extracting the desired data — part of the result is available in this GitHub repository.

At the beginning of May 2020, though, I had an unpleasant surprise: the tables I had used to extract the data (for those who are curious, the PDFplumber library is an excellent tool) were replaced by figures, in the news bulletins, which made my extraction method unfeasible.

I made a point of going through the above paragraphs in detail for a simple reason: to demonstrate, once again, the interactive process inherent to the data acquisition step in KDD. Besides, I wanted to highlight the risks and uncertainties when we took on a project where we have no control over the data source. In these situations, it is always good to have alternatives, establish partnerships or agreements, and seek to build your own dataset.

While I was looking for a new strategy, the Brazilian Ministry of Health started making the data available into a single CSV file — days later changing the format to XLSX and changing the file name every day. In the following paragraphs, I will detail how I adapted my process and code for this new situation.

By the way, the automatic retrieval of data and information from pages and content made available on the Internet is called web scraping. This technique can be implemented effortlessly through Python libraries, as shown in Fig. 4.

Figure 4. The steps of a simplified web scraping process using the Google Chrome inspection tool and the Python language.

We finally come to the practical — and perhaps the most anticipated — part of this text. Following the steps defined in Fig. 4, we start by (1) visiting the Coronavirus panel and (2) identifying that our interest is in the data made available by clicking on the “Arquivo CSV” button in the upper right corner of the page. When we activate the Google Chrome page inspection tool (3), we find the code corresponding to the item on the HTML page, as shown in Fig. 5. Remembering that the inspection tool is activated by the Inspect menu, when we right-click over any element of the page, or by the shortcut ctrl+shift+I.

Figure 5. Coronavirus portal from Brazilian Ministry of Health seen through the Google Chrome Inspection tool.

The next step (4) is to access this data from our code in Python, which can go through Jupyter notebooks, scripts, or through an IDE. In any case, the idea is to emulate the access to the desired portal and the interaction with the web page by using the Selenium library.

In the lines below, I start creating a virtual environment to access the portal URL within the Google Chrome emulator:

# Initial statements
from pyvirtualdisplay import Display
from selenium import webdriver

# Parameters definition
url = ''
## A linha abaixo pode ser suprimida em alguns ambientes:
chromeDriverPath = '~/anaconda3/envs/analytics3/'

# Starts the virtual environment:
display = Display(visible=0, size=(800,600))

# Opens the Chrome emulator in the desired URL:
driver = webdriver.Chrome()

# Reads and gets the page content encoded in UTF-8:
page = driver.page_source.encode('utf-8')

Once we read the page, we proceed to step (5) of Fig. 4, where we iterate the process until we guarantee that the data is in our desired form. To do so, we can start by checking the size of the loaded page (e.g., if it is null, it means something went wrong) and its first lines:

# What is the length of the loaded page?

# What is the data type?

# What is the content of the first positions of the byte stream?

Most of the time, the next step would be to explore HTML content using tools known as parsers — the BeautifulSoup library is an excellent parser for Python. In our case, considering that the data we are interested are not on the web page but result from an action on this page, we will continue using only the Selenium methods to emulate the click on the button, automatically downloading the desired file for the default system folder:

## Path obtained from inspection in Chrome:
xpathElement = '/html/body/app-root/ion-app/ion-router-outlet/app-home/ion-content/div[1]/div[2]/ion-button'

## Element corresponding to the "Arquivo CSV" button:
dataDownloader = driver.find_element_by_xpath(xpathElement)

## Download the file to the default system folder:

The next step is to verify if the file was downloaded correctly. Since its name is not standardized, we then list the recent files through the glob library:

import os
import glob

## Getting the last XLSX file name:
list_of_files = glob.glob('/home/tbnsilveira/Downloads/*.xlsx') 
latest_file = max(list_of_files, key=os.path.getctime)

From this point on, the web scraping task is completed, at the same time that we enter the data preparation phase (if necessary, review Fig. 3).

Let’s consider that we are interested in the number of cases per state, concerning its population, as well as in the lethality rates (number of deaths to the number of cases) and mortality (number of deaths to the number of people). The lines of code below perform the pre-processing tasks needed to offer the desired information.

## Reading the data
covidData = pd.read_excel(latest_file)

## Getting the last registered date:
lastDay =

## Selecting the dataframe with only the last day 
## and whose data is consolidated by state
covidLastDay = covidData[( == lastDay) &
                         (covidData.estado.isna() == False) & 
                         (covidData.municipio.isna() == True) &
                         (covidData.populacaoTCU2019.isna() == False)]

## Selecting only the columns of interest:
covidLastDay = covidLastDay[['regiao','estado','data','populacaoTCU2019','casosAcumulado','obitosAcumulado']]

The pre-processing step is completed by generating some additional features, from the pre-existing data:

## Copying the dataframe before handling it:
normalCovid = covidLastDay.copy()

## Contamination rate (% of cases by the population of each state)
normalCovid['contamRate'] = (normalCovid['casosAcumulado'] / normalCovid['populacaoTCU2019']) * 100
## Fatality rate (% of deaths by the number of cases)
normalCovid['lethality_pct'] = (normalCovid['obitosAcumulado'] / normalCovid['casosAcumulado']) * 100
## Mortality rate (% of deaths by the population in each state)
normalCovid['deathRate'] = (normalCovid['obitosAcumulado'] / normalCovid['populacaoTCU2019']) * 100

From this point, we can search in the pre-processed database:

normalCovid.query("obitosAcumulado > 1000")

Here we end the data acquisition phase. If we follow the CRISP-DM methodology, the next steps could be either the construction of models, analyzes, or visualizations. To illustrate a possible conclusion of this KDD process, Fig. 6 presents a Four-Quadrant graph correlating the percentage of cases versus the lethality rate in the different states of Brazil (the code to generate it is available on GitHub).

Figure 6. Relationship between the number of COVID-19 cases and lethality, by states in Brazil, according to data released by the Ministry of Health on 05/22/2020.

The graphic above is the result of the entire process we covered so far: the data was obtained via web scraping, the information was generated from the manipulation of that data, and the knowledge was acquired by structuring the information. Amazonas (AM) is the state population with the highest number of infections per inhabitant, as well as with the highest lethality rate. São Paulo (SP) is the state with the highest cumulative number of deaths, but with a relatively low number of infections considering its entire population.

Final considerations

In this article, I tried to offer an overview of the KDD process and how to implement data acquisition using a web scraping technique, especially in the case of small projects or analyzes.

You may have wondered if it wouldn’t be much easier just to visit the web portal and simply download the data available there. Yes, it would! However, a data science project expect data acquisition as automatic as possible, giving space to move forward with predictive analyzes whose models require daily or constant updating. Also, understanding complex topics and gaining confidence about our technique becomes much more effective — and perhaps amusing — with relatively more straightforward problems.

Finally, I must remind you to always act ethically in the acquisition and manipulation of data. Automating the collection of data in the public domain, at no cost to its administration, is characterized as a completely legal activity. However, if you are in doubt as to when the use of such algorithms is allowed or not, do not hesitate to consult experts or even the data owners.

If you have followed the article so far, I would love to hear your opinion. In addition to the poll below, feel free to leave a comment or write to me with any questions, criticisms or suggestions.

Um comentário em “Data acquisition, web scraping, and the KDD process: a practical study with COVID-19 data in Brazil

Deixe um comentário

Faça o login usando um destes métodos para comentar:

Logo do

Você está comentando utilizando sua conta Sair /  Alterar )

Foto do Facebook

Você está comentando utilizando sua conta Facebook. Sair /  Alterar )

Conectando a %s