Google Play Store Apps on EDA

Google Play Store apps and reviews

Mobile apps are everywhere. They are easy to create and can be lucrative. Because of these two factors, more and more apps are being developed. In this notebook, we will do a comprehensive analysis of the Android app market by comparing over ten thousand apps in Google Play across different categories. We’ll look for insights in the data to devise strategies to drive growth and retention.

Let’s take a look at the data, which consists of two files:

  • apps.csv: contains all the details of the applications on Google Play. There are 13 features that describe a given app.
  • user_reviews.csv: contains 100 reviews for each app, most helpful first. The text in each review has been pre-processed and attributed with three new features: Sentiment (Positive, Negative or Neutral), Sentiment Polarity and Sentiment Subjectivity.

Load the Dataset

Drop Duplicate values from Datasets

Count total number of apps

Data cleaning

Data cleaning is one of the most essential subtask any data science project. Although it can be a very tedious process, it’s worth should never be undermined.

By looking at a random sample of the dataset rows (from the above task), we observe that some entries in the columns like Installs and Price have a few special characters (+ , $) due to the way the numbers have been represented.

This prevents the columns from being purely numeric, making it difficult to use them in subsequent future mathematical calculations. Ideally, as their names suggest, we would want these columns to contain only digits from [0–9].

Hence, we now proceed to clean our data. Specifically, the special characters , and + present in Installs column and $ present in Price column need to be removed.

Correcting data types

From the previous task we noticed that Installs and Price were categorized as object data type (and not int or float) as we would like. This is because these two columns originally had mixed input types: digits and special characters. To know more about Pandas data types.

The four features that we will be working with most frequently henceforth are Installs, Size, Rating and Price. While Size and Rating are both float64 (i.e. purely numerical data types), we still need to work on Installs and Price to make them numeric.

Exploring app categories

With more than 1 billion active users in 190 countries around the world, Google Play continues to be an important distribution platform to build a global audience. For businesses to get their apps in front of users, it’s important to make them more quickly and easily discoverable on Google Play. To improve the overall search experience, Google has introduced the concept of grouping apps into categories.

This brings us to the following questions:

Which category has the highest share of (active) apps in the market? Is any specific category dominating the market? Which categories have the fewest number of apps? We will see that there are 33 unique app categories present in our dataset. Family and Game apps have the highest market prevalence. Interestingly, Tools, Business and Medical apps are also at the top.

Distribution of app ratings

After having witnessed the market share for each category of apps, let’s see how all these apps perform on an average. App ratings (on a scale of 1 to 5) impact the discoverability, conversion of apps as well as the company’s overall brand image. Ratings are a key performance indicator of an app.

From our research, we found that the average volume of ratings across all app categories is 4.17. The histogram plot is skewed to the left indicating that the majority of the apps are highly rated with only a few exceptions in the low-rated apps.

Size and price of an app

Let’s now examine app size and app price. For size, if the mobile app is too large, it may be difficult and/or expensive for users to download. Lengthy download times could turn users off before they even experience your mobile app. Plus, each user’s device has a finite amount of disk space. For price, some users expect their apps to be free or inexpensive. These problems compound if the developing world is part of your target market; especially due to internet speeds, earning power and exchange rates.

How can we effectively come up with strategies to size and price our app?

  • Does the size of an app affect its rating?
  • Do users really care about system-heavy apps or do they prefer light-weighted apps?
  • Does the price of an app affect its rating?
  • Do users always prefer free apps over paid apps?

We find that the majority of top rated apps (rating over 4) range from 2 MB to 20 MB. We also find that the vast majority of apps price themselves under $10.

Price Vs Rating

Paid apps Vs Free Apps

Popularity of paid apps vs free apps For apps in the Play Store today, there are five types of pricing strategies: free, freemium, paid, paymium, and subscription. Let’s focus on free and paid apps only. Some characteristics of free apps are:

Free to download.

  • Main source of income often comes from advertisements.
  • Often created by companies that have other products and the app serves as an extension of those products.
  • Can serve as a tool for customer retention, communication, and customer service.

Some characteristics of paid apps are:

  • Users are asked to pay once for the app to download and use it.
  • The user can’t really get a feel for the app before buying it.

Are paid apps installed as much as free apps? It turns out that paid apps have a relatively lower number of installs than free apps, though the difference is not as stark as I would have expected!


Sentiment analysis of user reviews

Mining user review data to determine how people feel about your product, brand, or service can be done using a technique called sentiment analysis. User reviews for apps can be analyzed to identify if the mood is positive, negative or neutral about that app. For example, positive words in an app review might include words such as ‘amazing’, ‘friendly’, ‘good’, ‘great’, and ‘love’. Negative words might be words like ‘malware’, ‘hate’, ‘problem’, ‘refund’, and ‘incompetent’.

By plotting sentiment polarity scores of user reviews for paid and free apps, we observe that free apps receive a lot of harsh comments, as indicated by the outliers on the negative y-axis. Reviews for paid apps appear never to be extremely negative. This may indicate something about app quality, i.e., paid apps being of higher quality than free apps on average. The median polarity score for paid apps is a little higher than free apps, thereby syncing with our previous observation.

In this notebook, we analyzed over ten thousand apps from the Google Play Store. We can use our findings to inform our decisions should we ever wish to create an app ourselves.

plot the graph of sentiment analysis


Specailly thanks to DataCamp Engineering

Thanks for to give your lavish time to read this work…..




data anaylst and data science enthusiastic

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Nonlinear regression in R

Automating Your Stock Portfolio Research With Python For Beginners

5 reasons for NOT using deep learning in your product

How Data Science can Help Business

[Datacamp] Introduction to Data Visualization with Matplotlib — Plotting time-series

Software Engineering for Data Scientist — Art of Writing Clean Code

5 Questions on Data & Justice with Cathy O’Neil

Data Science and Python is necessary to be a Data Scientist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store


data anaylst and data science enthusiastic

More from Medium

Performing Analysis On Meteorological Data

Sales Analytics through Matillion

【Store Display Design】Surprise Your Customers with Visual Marketing Psychology

Customer Segmentation for Arvato Financial Services