Math Model to Pick Best Store Locations

The major goal of our recent research is to provide simple recommendations on where to place new points of sale or service based on Twitter data analysis.
Scientifically approved technical solution for fast and quality social data analysis could become a priceless contribution for decision-making process. Moreover social media data grants a unique opportunity to estimate what people are interested in, what they want and how big the demand for a particular product or service is.
Let’s compare the traditional approach to identification of the next best store location and the innovative ‘social media data’ approach by InData Labs.
Traditional approach:
Traditional approach is usually based on comparison of demographic data from census of population with the number of existing infrastructure objects in a particular region. Assuming that some perfect ratio between number of people and infrastructure objects exists, researchers can determine where the areas with unsatisfied demand are.
Why have we decided that this approach is not good enough? Statistics data (population size and population density) used in the traditional approach don’t provide any valuable insights on such crucial factors as accessibility, visibility and people traffic.
Another traditional method is a social research, social researches do provide information about human traffic but the main shortcoming here is that a single research can not provide enough information to understand historical, seasonal or temporary factors while multiple researches are time-consuming and costly for a company.
Social Media Data approach:
Social networks provide decision-makers with an unlimited flow of customer data both historical and real-time. Analysts can slice and dice the information collected through different periods of time to identify trends and customer behaviour patterns. Containing abundance of geotagged messages social data can be successfully applied to urban planning and spatial analysis. The fast, up-to-date and data-driven approach moves the decision-making process to a higher quality level.
InData Labs approach to social data analysis of best store location:
Our data processing flow consists of three stages:
- Preprocessing – extraction of structured data
- Data processing – categorization and sentiment analysis
- Mapping of posts distribution
In the research we have used tweets from Singapore for the period from 5 to 18 of March 2015. We got access to the data via Twitter’s streaming API with subscription to author’s location. All in all we have analysed about 200 000 messages.
It’s notable that the sample that we analysed contained only those messages that could be used to retrieve exact location of their origin, it means that we knew coordinates of each tweet and could easily pinpoint the tweets to the map.
The heat map of tweets density below reflects different levels of social activity over Singapore territory. The most active zones are orange yellow and green zones are characterized by less social activity and zones without colors are passive ones without any social media activity.
Tweets density in Singapore
Preprocessing – extraction of structured data
Besides geolocation the contents of tweets is of high importance for the research. For that reason we analysed the contents and the origin of tweets. On the pie-charts below you can see what the contents of tweets from our sample looks like:
As you can see the vast majority of tweets in our sample were new messages (not replies), mostly self-generated (not shared), without URLs.
Our mathematical model is aimed to make quality semantic and sentiment analysis therefore it is of high importance for us to prepare the sample data so that it is a text message created by real people who express their thoughts and ideas in social media.
At that stage we clean all tweets extracting hash-tags, links, smiles and replies using twitter-text-python library. As an input the model gets messages in CSV-format: geolocation (lon, lat) and the text (raw_text). Output at this stage is a JSON-file that contains set of dictionaries: {‘point’: [lon, lat], ‘raw_text’: “…”, tags: […], links: [ …], …}.
Data processing – categorization and sentiment analysis
At the stage we do semantic and sentiment analysis of each message to determine topic of the message, its emotional tone and its level of subjectivity.
Being useful for trend monitoring hashtags appeared to bring little value for topic identification and sentiment analysis in our research.
The bar plot below shows the most popular hashtags from our sample.
Most hashtags repeat over time, while others are only used once and are usually connected with a certain event as a concert, holiday or sport competition. For example within our research we defined a one-day flash of #otrasg hashtag related to the concert show “On The Road Again Singapore” of a boy band One Direction. Only the audience of the concert was aware of the hashtag while for others it’s just a meaningless set of symbols.
Some people use simple hashtags like #food, #dinner, #lunch, #coffee to define the topic of their daily messages, alas, the vast majority (91%) of messages are not hashtagged. That’s why the mathematical model we have created ignores them.
For sentiment analysis at the stage we have used python-textblob library. Receiving JSON-file as an input, the processing algorithm provides a CSV-file that contains information about location, topic category (in the research we were focused on 3 topics: Food, Entertainment, Finance), emotional tone (negative/ neutral/ positive) with an original message.
Mapping of the messages distribution
At this final stage we build heat maps for each topic category. These maps help visualize the best locations for different infrastructure objects. Here is a short description of how it works.
We create 2 spatial empirical distribution functions f1 and f2: f1 – for existing infrastructure objects and f2 for the messages from a corresponding topic category. On the heat map we demonstrate difference between the two distribution functions (f2 – f1). Similar function values form levels of attractiveness that are represented by different colors on the map. Places where the function (f2 – f1) reaches its maximum represent the highest level of attractiveness and are colored in red.
Difference between all tweets density and infrastructure objects (shops in this case). Sentiments are ignored:
Difference between density of tweets related to Finance category and infrastructure objects (banking offices and ATMs in this case). Sentiments are ignored:
Difference between density of tweets related to Food category and infrastructure objects (restaurants and cafes in this case). Sentiments are analyzed:
For some topic categories we are interested in sentiment analysis considering only positive or only negative messages. Building a heat map for restaurants we primarily select all the messages relevant to food topic and then we evaluate emotional tone and the strength of emotions to determine weight of each message. This approach is based on an assumption that the strength of emotions identifies the strength of demand for a product or service.
The research conducted by InData Labs showed that social data analysis can be helpful for desicion-making process providing insights on people’s thoughts and wishes tied to certain locations. Discover how businesses small and large can leverage the information in our blog post.
Using machine learning, AI and Big Data technologies InData Labs helps tech startups and enterprises explore new ways of leveraging data, implement highly complex and innovative projects, and build breakthrough AI products. Our core services include Data Strategy Consulting, Big Data Engineering, Data Science Consulting.