Example

For testing purposes, the Yelp Dataset was downloaded, containing 6,990,280 reviews, 150,346 businesses, 11 metropolitan areas, and 200,100 photos. Apache Spark was responsible for data processing. It was decided that, based on the data, verification would be carried out to determine which attributes influence the rating that customers give to a venue on a scale of 1 to 5 stars. As a first step, sample data was displayed along with the names of the individual columns included in the downloaded dataset.

Figure 1. Table with sample data on customer reviews.

Figure 2. Table with sample data regarding businesses.

Figure 3. Table with sample data regarding check-ins.

Figure 4. Table with sample data on users.

It was decided that a Random Forest model would be trained, with the task of identifying relationships between business attributes and customer reviews expressed as 1–5 star ratings. PySpark was used to create the model. Unfortunately, due to hardware limitations, the amount of data was reduced to 5%, resulting in approximately 350,000 samples.

Figure 5. Code with parameters for training the Random Forest model.

In total, there are more than 100 examined attributes in businesses, and they depend on the type of business. Among them are features such as:

Alcohol
Atmosphere
Allowed age groups
Option to bring your own alcohol
Takeout food
WiFi
Smoking allowed
Car parking
Bicycle parking
Recommended visiting days

66

Atmosphere 24

BusinessParking 8

BestDays

Figure 6. Percentage values of the impact of parameters that had the greatest significance.

From the obtained results, it can be concluded that the atmosphere in the venue had the greatest impact on customer opinions. The car parking had a much lower, yet still significant value. According to the analysis, the days of the week recommended by customers for visits accounted for only 8%. For the remaining attributes, no correlation was detected, or it was found but at a level below 1%.