Natural Blogarithm
https://natural-blogarithm.com/
Recent content on Natural BlogarithmHugo -- gohugo.ioCopyright 2021Tue, 20 Jul 2021 00:00:00 +0000Making my Photo Collection Searchable by Keywords
https://natural-blogarithm.com/post/photo-keyword-search/
Tue, 20 Jul 2021 00:00:00 +0000https://natural-blogarithm.com/post/photo-keyword-search/
<script src="https://natural-blogarithm.com/post/photo-keyword-search/index_files/header-attrs/header-attrs.js"></script>
<div id="TOC">
</div>
<p>Since I had to cancel most of my travel plans this year due to the Corona
pandemic I spent a lot of time browsing through and editing photos from previous
vacations. While this activity gave me at least a bit of a holiday feeling it also
gave me an idea for a nice little computer vision project.</p>
<p>My photo collection is organised into a number of chronologically ordered folders,
with each vacation or event residing in a separate directory. And this structure also
largely dictates how I usually browse through my photo collection: Looking at the pictures
from each vacation separately.</p>
<p>What if instead of doing that it would be possible to <em>browse through the photos in a
different way</em>? For example, wouldn’t it be nice to look at all the pictures of
beaches I have visited during all my vacations? Or the restaurants I have eaten at?
Or the bottles of beer I drank?</p>
<p>That should make for an interesting and novel way to look at at my photos.</p>
<div id="neural-networks-for-object-dection" class="section level1">
<h1>Neural Networks for Object Dection</h1>
<p>After thinking about this idea for a while I realised that it should be easy to
implement this functionality with a neural network that was trained for
<a href="https://en.wikipedia.org/wiki/Object_detection">object recognition</a> tasks. Such a
network could be used to make predictions on all the images in my photo collection and
the resulting predictions (i. e. <em>“What does the model think is shown in the photo?”</em>)
could be used to assign keywords to each picture.</p>
<p>However, training such a model from scratch requires two things that are not readily
available to me:</p>
<ul>
<li>a large labeled set of <em>training data</em></li>
<li>lots of <em>computational power</em> (typically in the form of specialised GPU or TPU hardware)</li>
</ul>
<p>While the second point may be solved by spending a bit of money on renting infrastructure
in AWS or Google Cloud Platform the first point poses a much bigger problem.</p>
<p>I could potentially go through my photos and try to label them manually but
this process would somehow defeat the purpose of this project. Furthermore it is
questionable whether I could generate a training dataset of sufficient size and
quality this way in a reasonable amount of time.</p>
<p>Fortunately there are already models out there that have been trained on huge
datasets some of which are available through the
<a href="https://pytorch.org/vision/stable/models.html">PyTorch torchvision model zoo</a>.</p>
<p>For our project we will use the <a href="https://arxiv.org/pdf/1512.03385.pdf">ResNet-152 model</a>
which was trained on the <em>ImageNet</em> dataset which contains around 1.28
million images of 1,000 different objects.</p>
<p>With the pre-trained model the keywords we will be able to assign to our images
will be restricted to the labels that were used during the original training. However,
this is not necessarily a severe restriction as the original dataset contains a vast variety
of different objects such as ambulances, kimonos, iPods, toilet seats or
stingrays (the full list can be found <a href="https://gist.github.com/yrevar/942d3a0ac09ec9e5eb3a">here</a>).
With this diverse list of keywords we should still be able to make some interesting
queries against our photo collection.</p>
<p>Another restriction we should be aware of is that the performance of the ResNet model
on the ImageNet dataset may not translate to my personal photo collection. Especially
if the images in the two datasets are fundamentally different in some way (e.g.
different angles, lighting etc.).</p>
<p>Also most images in the ImageNet dataset seem to consist of images which only show
one particular object while many of my photos contain a composition of several
objects or people. It will be interesting to see how the model performs in these
situations.</p>
</div>
<div id="implementation" class="section level1">
<h1>Implementation</h1>
<p>To implement this project I decided to split it into two components:</p>
<ul>
<li><em>Generating the keyword database</em>: Loading the model, predicting on the images, writing out the predicted labels</li>
<li><em>The frontend</em>: Loading the database and querying it in different ways, visualising the results</li>
</ul>
<p>Splitting the application into those two components should facilitate future development
in case I decide to dedicate more time to this project.</p>
<div id="creating-the-keyword-database-a-python-command-line-application" class="section level2">
<h2>Creating the Keyword Database: A Python Command Line Application</h2>
<p>To extract the labels from the model for each photo I decided to write a <em>Python
command line application</em> with a simple overall logic:</p>
<ul>
<li>Read the pre-trained model from the PyTorch model zoo</li>
<li>Load in and transform the images from the target directory</li>
<li>Extract the labels (predictions) from the model for each image</li>
<li>Write the labels to a file</li>
</ul>
<p>One important thing to note here is that in order to run the images through the
ResNet model <em>they need to be transformed in a specific way</em>. For example, the images
in my photo collection tend to have a relatively high resolution (e.g. 3000x4000
pixels) while the ResNet model expects the images to have a resolution of 224x224.
In addition to that the color channels need to be normalised to make sure they are
aligned with what the model is used to from the training process.</p>
<p>The following Python code combines the required pre-processing steps into one transformation that can
be applied to the images when they are loaded and before they are passed to the
model (see <a href="https://pytorch.org/vision/stable/transforms.html">here</a> for more
information about these transformations):</p>
<pre class="text"><code>from torchvision import transforms
data_transform = transforms.Compose(
[
transforms.Resize(224),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
]
)</code></pre>
<p>The application takes several command line parameters such as the directory in which
to look for the images or the number of labels to extract for each image. The full
code for the CLI application along with the definition of the parameters can be
found <a href="https://github.com/jakobludewig/image_keyword_generator/blob/main/build_keyword_database.py">here</a>.</p>
<p>The application can be executed from the terminal by running the following command:</p>
<pre class="bash"><code>python build_semantic_database.py --imagedir=images --numtoplabels=5</code></pre>
<p>An example of the output of the application is shown below:</p>
<div style="text-align:center">
<img src="images/example_output_cli.png" width ="75%" />
</div>
<p>As can be seen from the screenshot above it takes around 11 minutes on my machine
to process a sample of around 4100 images from my photo collection.</p>
<p>When the process has finished a Pandas dataframe which contains the extracted
labels for each photo will be <em>written out as a pickle file</em>. We will show
how this file can be queried in the next section.</p>
</div>
</div>
<div id="querying-the-database-a-jupyter-notebook" class="section level1">
<h1>Querying the Database: A Jupyter Notebook</h1>
<p>The structure of the dataframe we created above looks as follows:</p>
<div style="text-align:center">
<img src="images/database_example.png" width ="80%"/>
</div>
<p>Each row contains the filename of an image and a keyword that it was tagged with.
For each image file there are five keywords and they are sorted by the probability
that the neural network assigned to them.</p>
<p>To query this database and analyse the results of the neural network a bit more
in-depth I wrote a Jupyter Notebook file to serve as the “frontend” for the
application.</p>
<p>The core of this frontend will be a a function that displays an image along with
the labels that were predicted by the neural network. An example output of this
function is shown below:</p>
<div style="text-align:center">
<img src="images/example_plot_with_labels.png" width ="80%"/>
</div>
<p>On the left side it shows the specified image and on the right side the <em>labels
along with their predicted probabilities</em> from the neural network. In this case
the neural network assigns a probability of 85.5 % that the image is showing a
macaw (I am not sure whether the bird is actually a macaw but at least it does
not seem to be far off).</p>
<p>The function is implemented as a simple Matplotlib subplot where the left-hand plot uses
the <em>imshow</em> function and the right box is filled with a <em>text</em> object:</p>
<pre class="text"><code>def plot_image_with_labels(image: PILImage, labels: dict, full_path: bool = False) -> None:
""" Plot an image alongside its associated labels
Args:
image (PILImage): The image to plot
labels (dict): Dictionary containing the labels with their predicted probability
full_path (bool): Flag indicating whether the full path or just the filename should be displayed
"""
fig = plt.figure(figsize=(16, 12))
ax1 = fig.add_subplot(1, 6, (1, 4))
ax1.set_axis_off()
ax1.imshow(image)
ax2 = fig.add_subplot(1, 6, (5, 6))
if full_path:
labels_text = "File: " + image.filename + "\n"
else:
labels_text = "File: " + "[...]/" + image.filename.split("/")[-1] + "\n"
labels_text = labels_text + "\n".join(
[k + ": " + str(round(100 * v, 1)) + " %" for k, v in labels.items()]
)
ax2.set_axis_off()
ax2.text(0, 0.5, labels_text, fontsize=16)</code></pre>
<p>Apart from this function the frontend code mostly consists of functions
that implement different ways of querying the database and combining the images and
the labels in a convenient format for plotting. The notebook to reproduce the
results that will be presented in the next section can be found <a href="https://github.com/jakobludewig/image_keyword_generator/blob/main/Frontend.ipynb">here</a>.</p>
</div>
<div id="results" class="section level1">
<h1>Results</h1>
<p>I explored different ways to query the database through the Jupyter notebook to get an
idea of how well the neural network does at detecting objects in my photo collection.
In the next sections I will present some of the results I have seen.</p>
<div id="querying-the-top-images-for-a-given-label" class="section level2">
<h2>Querying the Top Images for a Given Label</h2>
<p>As described in the beginning of this blog post my main motivation for this project
was to query my photo collection for all images containing a given object. This
functionality is implemented in the Jupyter notebook by querying the keyword
database with a specified label and visualising the images that have the highest predicted probability for that given label.</p>
<p>Below are some examples for the results we get with this approach.</p>
<div id="top-images-for-label-seashore-coast-seacoast-sea-coast" class="section level3">
<h3>Top Images for Label: <em>“seashore, coast, seacoast, sea-coast”</em></h3>
<div style="text-align:center">
<img src="images/example_beach_1.png" width ="100%" />
</div>
<div style="text-align:center">
<img src="images/example_beach_2.png" width ="100%" />
</div>
<div style="text-align:center">
<img src="images/example_beach_3.png" width ="100%" />
</div>
<div style="text-align:center">
<img src="images/example_beach_4.png" width ="100%" />
</div>
</div>
<div id="top-images-for-label-beer-bootle" class="section level3">
<h3>Top Images for Label: <em>“beer bootle”</em></h3>
<div style="text-align:center">
<img src="images/example_beer_1.png" width ="100%" />
</div>
<div style="text-align:center">
<img src="images/example_beer_2.png" width ="100%" />
</div>
<div style="text-align:center">
<img src="images/example_beer_3.png" width ="100%" />
</div>
<div style="text-align:center">
<img src="images/example_beer_4.png" width ="100%" />
</div>
</div>
<div id="top-images-for-label-restaurant-eating-house-eating-place-eatery" class="section level3">
<h3>Top Images for Label: <em>“restaurant, eating house, eating place, eatery”</em></h3>
<div style="text-align:center">
<img src="images/example_restaurant_1.png" width ="100%" />
</div>
<div style="text-align:center">
<img src="images/example_restaurant_2.png" width ="100%" />
</div>
<div style="text-align:center">
<img src="images/example_restaurant_3.png" width ="100%" />
</div>
<div style="text-align:center">
<img src="images/example_restaurant_4.png" width ="100%" />
</div>
</div>
<div id="top-images-for-label-cock" class="section level3">
<h3>Top Images for Label: <em>“cock”</em></h3>
<div style="text-align:center">
<img src="images/example_cock_1.png" width ="100%" />
</div>
<div style="text-align:center">
<img src="images/example_cock_2.png" width ="100%" />
</div>
<div style="text-align:center">
<img src="images/example_cock_3.png" width ="100%" />
</div>
<div style="text-align:center">
<img src="images/example_cock_4.png" width ="100%" />
</div>
<p>Overall the results look pretty convincing. The only exceptions being maybe the
third and fourth picture in the <em>beer bottle</em> category which are actually showing
a can of beer and glasses of beer respectively.</p>
<p>Also the third photo in the <em>cock</em>
category seems to be showing a duck rather than a cock (however, it should be
said that the model actually only assigns a 17 % probability for this image to
show a cock).</p>
<p>It’s also interesting to see that the neural network does not seem to have a big issue
with the poor lighting in the last picture in that category and successfully identifies
the chicken in the shadow.</p>
</div>
</div>
<div id="querying-labels-for-random-images" class="section level2">
<h2>Querying Labels for Random Images</h2>
<p>As mentioned before the results for the keywords we presented above look
convincing. However, the way we query the database above is designed to give us
images for which the neural network has high confidence in its predictions and might
therefore have a tendency to produce clear cases.</p>
<p>To get a bit of a more general impression of the predictions of the model it might
therefore be interesting to <em>browse the photo collection more randomly</em>. Therefore,
below we show the results for a more or less random selection of photos from
my collection:</p>
<div style="text-align:center">
<img src="images/example_jeepney.png" width ="100%" />
</div>
<p>Here we can see that the neural network is <em>mistaking the colorful Jeepney for a
fire truck or a school bus</em>. In the ImageNet dataset there is no category for this
specific type of Filipino passenger bus which is why the model seems to fall back on the closest thing it has seen during the model training.</p>
<div style="text-align:center">
<img src="images/example_coconut_truck.png" width ="100%" />
</div>
<p>For the second image the top ranking label is “banana” even though there are no
bananas in the picture. It might be interesting to analyse where in the picture
the neural network has identified a banana. It would probably not be surprising if it
mistook the yellow paint on the truck door as a banana.</p>
<p>Another interesting case is the picture of the butterfly below:</p>
<div style="text-align:center">
<img src="images/example_butterfly.png" width ="100%" />
</div>
<p>While the ImageNet dataset contained instances of various types of butterflies
the <em>top ranking label for this image is “table lamp”</em> (also note that
“lampshade” ranks 3rd).</p>
<p>While for the human eye it is pretty easy to identify the
butterfly in the image it is plausible that the ResNet model could be confused: If
one squints their eyes the shot actually slightly resembles a table lamp, especially
given the particular lighting in this image. Again it would be interesting to do a
bit of trouble shooting to see how the neural network comes to its decision.</p>
</div>
<div id="querying-images-with-lowest-prediction-variance" class="section level2">
<h2>Querying Images with Lowest Prediction Variance</h2>
<p>We can find some more cases in which the neural network seems to have difficulties
to make a decision by taking a somewhat more systematic approach: We can query the
database for those images for which the predicted probabilities have lowest
variance. This should produce images for which the predicted probabilities for a given
image lie close together and therefore indicate cases where the model is uncertain
about what it sees:</p>
<div style="text-align:center">
<img src="images/example_low_var_1.png" width ="100%" />
</div>
<div style="text-align:center">
<img src="images/example_low_var_2.png" width ="100%" />
</div>
<div style="text-align:center">
<img src="images/example_low_var_3.png" width ="100%" />
</div>
<div style="text-align:center">
<img src="images/example_low_var_4.png" width ="100%" />
</div>
<div style="text-align:center">
<img src="images/example_low_var_5.png" width ="100%" />
</div>
<p>We can see that for these cases the model seems to be pretty far off: Basically none
of the objects it detects were actually shown in any of the images.</p>
<p>However, this failure of the model is understandable in the sense that the
images shown above do not have necessarily have a clearly identified, single object
in them. They look like somewhat random shots for which even a human might have
problems coming up with a single label that best describes them. What is more at
least two of the images are of poor quality and suffer from bad lighting or
blurriness.</p>
</div>
</div>
<div id="conclusions-next-steps" class="section level1">
<h1>Conclusions & Next Steps</h1>
<p>In this blog post we presented a simple way to make a photo collection searchable
by objects that they are showing. To do this we used a pretrained neural network model and
overall were able to get good results.</p>
<p>However, we also pointed out some limitations of this approach that seem to result
from the fact that we took a network that was trained on a specific dataset with a
fixed list of labels and applying it to a completely different collection of images.</p>
<p>The application itself is a very basic and hacky implementation consisting of a
command line application and a Jupyter notebook frontend. I might revisit this
project in the future to improve on it. In particular it would be nice to
dockerise the application and re-implement its functionality in a proper web application
framework.</p>
</div>
Guest Post about Fraud Detection on AWS Blog
https://natural-blogarithm.com/post/aws-guest-post-fraud-detection/
Thu, 13 May 2021 00:00:00 +0000https://natural-blogarithm.com/post/aws-guest-post-fraud-detection/
<script src="https://natural-blogarithm.com/post/aws-guest-post-fraud-detection/index_files/header-attrs/header-attrs.js"></script>
<div id="TOC">
</div>
<br>
<div style="text-align:center">
<img src="images/thumbnail.png" width ="40%"/>
</div>
<p>As this blog is growing more and more into a portfolio for my data science work I
did not want to miss out on the opportunity to (shamelessly) promote my guest post
on the AWS blog from November:</p>
<p><a href="https://aws.amazon.com/blogs/startups/using-ml-to-detect-survey-fraud/">Advancement in Fraud Detection: ML in Online Survey Research</a></p>
<p>The fraud detection solution described in the blog post was part of a larger
anti-fraud initiative at my previous company. Since the solution was highly
successful in combating fraud and also made heavy use of AWS solutions (in particular
<a href="https://aws.amazon.com/sagemaker/">Amazon SageMaker</a>)
our contacts at AWS asked us whether we would like to share our
experience in a short post on their blog.</p>
<p>One of the main technical requirements that we needed to meet in this project was
that the model endpoint had to accept a JSON payload as input. This payload had to
be transformed into a numeric input before passing it to the model which was
achieved by deploying the XGBoost model in conjunction with a JSON parser
model in a <a href="https://sagemaker.readthedocs.io/en/stable/api/inference/pipeline.html">SageMaker PipelineModel</a>.
Some boilerplate code for this implementation can be found
<a href="https://github.com/DaliaResearch/sagemaker_json_to_xgboost">here</a>.</p>
<p>In the future I plan to add some more functionalities to this implementation, such as unit
tests. So if this is something that might be useful for you, stay tuned for
upcoming blog posts!</p>
Calculating Variance: Welford's Algorithm vs NumPy
https://natural-blogarithm.com/post/variance-welford-vs-numpy/
Thu, 06 May 2021 00:00:00 +0000https://natural-blogarithm.com/post/variance-welford-vs-numpy/
<script src="https://natural-blogarithm.com/post/variance-welford-vs-numpy/index_files/header-attrs/header-attrs.js"></script>
<div id="TOC">
</div>
<p>In a past project at my previous job our team was building a real-time fraud detection
algorithm. The logic of the algorithm was based on the classical outlier definition:
flag everything that exceeds the population mean by roughly two standard deviations.</p>
<p>The requirements for the response time of the API that would host the algorithm
were quite restrictive (in the order of milliseconds) and the number of observations
which we had to calculate the mean and variance for would range in the thousands.</p>
<p>While carrying out the calculations for these kinds of sample sizes is extremely
fast, storing and retrieving the datasets can become a bottleneck. Initially we
evaluated different solutions such as storing all the observations in an in-memory
database such as Redis but were not quite satisfied with the performance and the
architectural overhead it introduced.</p>
<div id="welfords-algorithm-for-calculating-variance" class="section level1">
<h1>Welford’s Algorithm for Calculating Variance</h1>
<p>A colleague suggested we should go with a different approach and introduced us
to <a href="https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm">Welford’s algorithm</a>.
Welford’s algorithm is an <em>online algorithm</em> which means that it can update the
estimation of the sample variance by processing one observation at a time.</p>
<p>This approach has the distinct advantage that it does not require you to have
all the observations available to update the estimations of mean and variance
when a new observation comes in.</p>
<p>To make this a bit more clear, below is the formula which you would normally use
to calculate the variance from a set of observations <span class="math inline">\(x_1,\dots, x_n\)</span>:</p>
<p><span class="math display">\[s^2_n = \frac{\sum_{i=1}^n (x_i - \bar{x}_n)^2}{n-1}\]</span></p>
<p>Here <span class="math inline">\(\bar{x}_n\)</span> denotes the sample mean <span class="math inline">\(\frac{1}{n}\sum_{i=1}^n x_i\)</span>.</p>
<p>To calculate the variance using this formulation it is necessary to have all the
observations <span class="math inline">\(x_1,\dots,x_n\)</span> available at any time. In fact you have to go through
the whole dataset twice: once to calculate the mean <span class="math inline">\(\bar{x}_n\)</span> and then another time
to calculate the variance itself. Therefore this formula is also referred to as the
<em>two-pass algorithm</em>.</p>
<p>The idea of Welford’s algorithm is to decompose the calculations of the mean and
the variance into a form that can easily be updated whenever a new observation
comes in.</p>
<p>For the mean this transformation is produced by simply rewriting the formula
for <span class="math inline">\(\bar{x}_n\)</span> as the weighted average between the old mean <span class="math inline">\(\bar{x}_{n-1}\)</span>
and the new observation <span class="math inline">\(x_n\)</span>:</p>
<p><span class="math display">\[\bar{x}_n = \frac{\sum_{i=1}^n x_i}{n} = \frac{\sum_{i=1}^{n-1} x_i}{n} + \frac{x_n}{n} = \frac{(n-1)\bar{x}_{n-1} + x_n}{n} = \bar{x}_{n-1} + \frac{x_n - \bar{x}_{n-1}}{n}\]</span></p>
<p>The decomposition for the variance is slightly more complicated. Only looking at
the enumerator (the <em>sum of squares of differences</em>) of the formula for <span class="math inline">\(s^2_{n}\)</span>
from above we can derive it in the following way:</p>
<span class="math display">\[\begin{aligned}
&\sum_{i=1}^n (x_i - \bar{x}_n)^2 = \sum_{i=1}^n (x_i - \underbrace{(\bar{x}_{n-1} + \frac{x_n - \bar{x}_{n-1}}{n})}_{\text{decomposition of }\bar{x}_n})^2 = \sum_{i=1}^n ((x_i - \bar{x}_{n-1}) - (\frac{x_n - \bar{x}_{n-1}}{n}))^2\\\\
=& \sum_{i=1}^n (x_i - \bar{x}_{n-1})^2 - \underbrace{2 \sum_{i=1}^n (x_i - \bar{x}_{n-1})(\frac{x_n - \bar{x}_{n-1}}{n})}_{=2 \frac{(x_n - \bar{x}_{n-1})}{n}\sum_{i=1}^n(x_i - \bar{x}_{n-1})} + \underbrace{\sum_{i=1}^n (\frac{x_n - \bar{x}_{n-1}}{n})^2}_{=n \cdot (\frac{x_n - \bar{x}_{n-1}}{n})^2 = \frac{(x_n - \bar{x}_{n-1})^2}{n}} \\\\
=& \underbrace{\sum_{i=1}^{n-1} (x_i - \bar{x}_{n-1})^2}_{=(n-2) \cdot s_{n-1}^2} + \underbrace{(x_n - \bar{x}_{n-1})^2 + \frac{(x_n - \bar{x}_{n-1})^2}{n}}_{=\frac{(n+1)(x_n - \bar{x}_{n-1})^2}{n}} - 2 \frac{(x_n - \bar{x}_{n-1})}{n}\underbrace{\sum_{i=1}^n(x_i - \bar{x}_{n-1})}_{=(n \cdot \bar{x}_n - n\cdot\bar{x}_{n-1})} \\\\
=& (n-2) \cdot s_{n-1}^2 + \frac{(n+1)(x_n - \bar{x}_{n-1})^2}{n} - 2 \frac{(x_n - \bar{x}_{n-1})}{n}\underbrace{(n \cdot \bar{x}_n - n\cdot\bar{x}_{n-1})}_{=n \cdot (\bar{x}_{n-1} + \frac{x_n - \bar{x}_{n-1}}{n} - \bar{x}_{n-1})} \\\\
=& (n-2) \cdot s_{n-1}^2 + \frac{(n+1)(x_n - \bar{x}_{n-1})^2}{n} - 2 \frac{(x_n - \bar{x}_{n-1})^2}{n} \\\\
=& (n-2) \cdot s_{n-1}^2 + \frac{(n-1)(x_n - \bar{x}_{n-1})^2}{n}
\end{aligned}\]</span>
<p>To get the variance we just need to divide the formula above by <span class="math inline">\((n-1)\)</span> which yields:</p>
<p><span class="math display">\[s^2_n = \frac{n-2}{n-1} s^2_{n-1} + \frac{(x_n - \bar{x}_{n-1})^2}{n}\]</span></p>
<p>The derivation above shows that the updating formula is algebraically equivalent
to the standard formula for variance presented in the beginning of this section.</p>
<p>However, the new formulation has the distinct advantage that it does not
require all of the observations to be available at computation time but instead
can be used to sequentially update our estimation of the variance as new data
comes in. At any point in time our algorithm only depends on four values as input:</p>
<ul>
<li>the observation <span class="math inline">\(x_n\)</span></li>
<li>the total count of observations <span class="math inline">\(n\)</span></li>
<li>the current estimate of the mean <span class="math inline">\(x_{n-1}\)</span></li>
<li>the current estimate of the variance <span class="math inline">\(s_{n-1}^2\)</span></li>
</ul>
<p>In practice it is preferable to keep track of the sum of squares of differences
<span class="math inline">\(M_{2,n} = \sum_{i=1}^n (x_i - \bar{x}_n)^2\)</span> instead of the variance <span class="math inline">\(s_n^2\)</span> directly
due to numerical stability considerations. Doing so results in the slightly
different formulation which is presented in the Wikipedia article and which we will
be using in our analysis as well.</p>
</div>
<div id="discrepancies-between-welfords-algorithm-and-standard-formula" class="section level1">
<h1>Discrepancies Between Welford’s Algorithm and Standard Formula</h1>
<p>The fact that Welford’s algorithm and the standard formula for variance estimation
are algebraically equivalent does of course not guarantee that they will always
produce the same values when they are implemented in a piece of software as many
potential sources of errors can occur. An obvious example that comes to mind is
human error in implementing the solution but there are also more subtle things
such as numerical issues that arise when doing arithmetic with limited floating
point precision.</p>
<p>While testing the results of the real-time outlier detection algorithm against
an offline implementation that was previously run manually by data analysts some
discrepancies between the two solutions were detected.</p>
<p>We were able to trace back the differences in the results to discrepancies in the
estimates for the variances. Since for our practical purposes the differences were
negligible (relative errors on the scale of maybe up to <span class="math inline">\(1-5\%\)</span>) and tended to
result in a slightly more conservative behaviour of the online algorithm (which
was acceptable in our use case) we decided not to analyse the issue any further.</p>
</div>
<div id="simulation-study" class="section level1">
<h1>Simulation Study</h1>
<p>We assumed that the discrepancies we saw originated from inherent
differences between the variance estimation in the two algorithms (the offline
solution used the standard variance formula). However, after doing some superficial
research into the issue it seemed unlikely that with the data we had on our hands
the two estimations should differ from each other as much as they did in our test.</p>
<p>Intuitively it seemed more likely that other differences between the two
implementations were the real cause for the differences we witnessed (e. g. the
way the data was stored and retrieved or a subtle bug in either of the
implementations).</p>
<p>Since I don’t have access to the original data or the implementation anymore I decided
to do a little simulation study using real world datasets to see if I could
replicate discrepancies similar to the numbers we saw in our analysis.</p>
<p>As done previously in the <a href="https://natural-blogarithm.com/post/r-regression-predictor-order/">linear regression post</a>
we will use datasets from the
<a href="https://www.openml.org/search?type=data">OpenML dataset repository</a>. Instead of
R we will be using Python this time as this is the language the original
implementation was done in.</p>
<p>To run the simulation we will use <a href="https://metaflow.org/">Metaflow</a> which is a
framework for automating data science workflows, originally developed by Netflix
but now released as open source software.</p>
<p>Metaflow has a lot of nice features, such as seamless integration with AWS
services but for our simulation study the most important advantages are the
built-in data management which make the analysis of our results much easier
as well as the automatic parallelisation that will greatly speed up the runtime
of our simulation.</p>
<p>The code for the Metaflow implementation can be found <a href="https://github.com/jakobludewig/python_projects/blob/main/2021-05-05%20Welfords%20Algorithm%20Analysis/welford_sim_flow.py">here</a>.
Using an appropriate Python environment the simulation can be started by simply running
the following command in the bash:</p>
<pre class="bash"><code>python welford_sim_flow.py run --random_seed=4214511 --num_datasets=500 --max_obs=10000000 --max_cols=20 --max_cols_large_datasets=5 --n_def_large_datasets=100000 --max-num-splits=500</code></pre>
<p>The definitions of the command line parameters their definitions can be looked up
in the <a href="https://github.com/jakobludewig/python_projects/blob/main/2021-05-05%20Welfords%20Algorithm%20Analysis/welford_sim_flow.py#L29">source file of the flow</a>.</p>
<p>The Metaflow run will output status messages with information about the progress
of the simulation:</p>
<div style="text-align:center">
<img src="images/metaflow_output.png" width ="75%" alt ="Example status messages of simulation running in Metaflow"/>
</div>
<p>The individual steps of the process, their execution in parallel as well as the
transfer of the data between them are completely managed by Metaflow.</p>
<p>Metaflow will persist the results of the simulation run (in our case in a local
folder <code>.metaflow/</code> in the same directory as the code files) which can easily be
accessed from a Jupyter Notebook using the
<a href="https://docs.metaflow.org/metaflow/client">Metaflow client API</a>. You can see how
this is done in the
<a href="ihttps://github.com/jakobludewig/python_projects/blob/main/2021-05-05%20Welfords%20Algorithm%20Analysis/Analyse%20Simulation%20Results.ipynb">Jupyter Notebook file</a> containing the code to reproduce the results presented
below.</p>
<p>For our analysis we sampled 500 datasets from OpenML ranging between 2 and 4,898,431
observations each with the median number of observations per dataset being 400.</p>
<p>The plot below shows the distribution of the sample sizes of the datasets on a
log (base 10) scale:</p>
<div style="text-align:center">
<img src="images/dist_obs_dataset.png" width ="75%" alt ="Distribution of the dataset sample sizes used in the simulation"/>
</div>
<p>For each dataset a subset of the numerical columns was sampled and for each of these
columns we calculated the variance twice: once using the implementation from NumPy
(to represent the standard variance calculation formula) and the other time going through
the data sequentially using Welford’s algorithm.</p>
<p>Overall we produced 7935 pairs of variances this way and measured the deviation
between the two by calculating the relative errors (relative to the NumPy
variance estimates).</p>
<p>The distribution of the relative errors is shown in the plot below (note the log
scale on the x-axis):</p>
<div style="text-align:center">
<img src="images/dist_rel_error.png" width ="75%" alt ="Distribution of the relative error between the two variance estimations"/>
</div>
<p>To make sure all the relative errors could be calculated and plotted we shifted
both variance estimates and the relative error numbers by a value of <span class="math inline">\(10^{-20}\)</span>.
Therefore the values in the bucket of relative errors of <span class="math inline">\(10^{-20}\)</span> are
corresponding to cases in which the two variance estimates matched exactly. Those
cases made up around 30 % of all our simulation results.</p>
<p>The maximum relative error we observed was on the order of magnitude of <span class="math inline">\(10^{-7}\)</span>,
the mean and median relative errors turned out to be around <span class="math inline">\(4.9\cdot10^{-10}\)</span> and
<span class="math inline">\(2.9\cdot 10^{-16}\)</span> respectively.</p>
<p>The relative errors we obtained in our simulation are therefore much lower than the
ones we observed during the test of the fraud detection algorithm which motivated
this analysis (the relative errors we saw there were on the scale of <span class="math inline">\(\approx10^{-2}\)</span>).</p>
<p>It could well be that none of the datasets included in our simulation had the
same characteristics as the data that produced the larger relative errors in
our previous analysis. The real-world data that was used in that test tended to
be approximately log-normally distributed. However, running a number of simulated
observations from various log-normal distributions through our simulation setup
described above also failed to replicate relative errors on that scale.</p>
<p>So while we can not rule out that the original discrepancies were caused by inherent
differences in the calculations of the variance it now seems much less likely.</p>
<p>As already mentioned I do not have access to the code or the data anymore but I
would guess that the differences were caused by subtle differences in the
implementations. Especially the fact that the variables necessary for the
Welford updates (<span class="math inline">\(M_{2,n}, \bar{x}_n, n\)</span>) were stored in a MySQL table with
limited floating point precision may be an issue.</p>
</div>
<div id="theoretical-analyses" class="section level1">
<h1>Theoretical Analyses</h1>
<p>Aside from the results of our simulations it is worthwhile to look at some
more theoretical analyses of the behaviour of the variance estimation methods.</p>
<p>In particular <a href="http://cpsc.yale.edu/sites/default/files/files/tr222.pdf">this paper by Chan et al from 1983</a>
(which is also referenced in the Wikipedia article) and this
<a href="https://www.johndcook.com/blog/2008/09/28/theoretical-explanation-for-numerical-results/">blog post by John Cook</a>.</p>
<p>Both articles investigate which algorithm best approximates the true, known variance of
simulated samples (whereas our analysis focused on whether they differ from
each other without making any claims which of the two is more precise).</p>
<p>The Chan paper presents an overview of different algorithms for calculating the
variance and points out some potential problems with some of them. In particular
they derive theoretical bounds for the errors one can expect with each
of the methods depending on the sample size, machine precision and the condition
number of the dataset <span class="math inline">\(\kappa\)</span> (a measure of how sensitive the variance estimation
is to the data it encounters). They also validate their theoretical bounds in a
simulation study (using sample sizes of <span class="math inline">\(N=64\)</span> and <span class="math inline">\(N=4096\)</span>).</p>
<p>Interestingly their findings suggest that Welford’s algorithm yields comparably
higher error rates than most of the other candidates (out of which the one referred to as
<em>two-pass with pairwise summation</em> in the paper might be the closest match for the variance
estimation from NumPy).</p>
<p>However, some of the findings and recommendations in the paper seem a bit out of
date as most toolboxes for numerical calculations nowadays use many of their
recommendations for improvements as default (e.g. pairwise summations to avoid
rounding errors). Therefore it might be difficult to translate their findings to
modern implementations.</p>
<p>A more up-to-date analysis of the accuracy of Welford’s algorithm is the blog post
by John Cook (which is actually part three in a series of posts). In it he compares
how well Welford’s algorithm and the standard formula (which he refers to as
the <em>direct formula</em>) approximate the theoretical variance of some simulated observations.
Based on his C++ implementation he concludes that both formulas perform similarly
well, even giving Welford’s algorithm a slight edge.</p>
<p>In his analysis he is also able to generate datasets in which the two algorithms
deviate both from the theoretical variances as well as with each other up to a
certain set of digits. However, the cases he presents consist of relatively small numbers
(sampled from a uniform distribution between zero and one) shifted by huge values
(up to <span class="math inline">\(10^{12}\)</span>). These results are definitely interesting and theoretically
valuable but they should not be helping us to explain the discrepancies we saw
in our outlier detection scenario where the data tended to be much more well-behaved.</p>
</div>
<div id="conclusion" class="section level1">
<h1>Conclusion</h1>
<p>In our analysis we compared the results of two different ways of calculating the
variance based on a simulation with 500 datasets. We concluded that the discrepancies
between them lie in a range which makes them interchangeable for most practical
applications. In particular we were not able to reproduce any discrepancies on the
order of magnitude that were seen in a previous implementation which motivated
this post, hinting that these results were probably caused by some other
implementation difference.</p>
</div>
Modelling the Relationship Between Restaurant Review Texts and Point Ratings
https://natural-blogarithm.com/post/restaurant-reviews-modelling/
Tue, 27 Apr 2021 00:00:00 +0000https://natural-blogarithm.com/post/restaurant-reviews-modelling/
<script src="https://natural-blogarithm.com/post/restaurant-reviews-modelling/index_files/header-attrs/header-attrs.js"></script>
<div id="TOC">
</div>
<p>In our <a href="https://natural-blogarithm.com/post/restaurant-reviews-sentiment-analysis/">last post</a>
we used sentiment analysis to find out whether there is a difference in the
relationship between the wording of restaurant review texts and the associated
point score between Stockholm and Berlin.</p>
<p>Today we will revisit this subject using a different approach. We will fit
statistical models to our data to formalise the relationship between the review
texts and the associated point ratings.</p>
<p>We can then use the predictions and coefficients of these models to answer the
following question:</p>
<p><em>Is a review text of a five point rating from Berlin likely to receive a lower rating in Stockholm?</em></p>
<p>A positive answer to this question can be regarded as supporting evidence for
our subjective observation that there are many review texts in Stockholm
corresponding to reviews with less than five points that would have received
the full five point rating in Berlin.</p>
<div id="modelling-the-target-variable" class="section level1">
<h1>Modelling the Target Variable</h1>
<p>For this analysis we will fit
<a href="https://en.wikipedia.org/wiki/Logistic_regression">logistic regression models</a>
which are used to classify observations into two distinct classes. In our case
these two classes will be the five point reviews (which we will also refer to as
the <em>positive class</em>) and the reviews of less than five points (the <em>negative class</em>).</p>
<p>It should be noted that for data on a fixed scale that has a meaningful ordering
(such as the point ratings we are dealing with) it would be more appropriate to use
an <a href="https://en.wikipedia.org/wiki/Ordinal_regression">ordinal regression model</a>.
Since for our analysis however we are only interested in the distinction between
five point ratings and ratings with less than five points we will go with the
simpler logistic regression model.</p>
</div>
<div id="building-numerical-features-from-the-review-texts" class="section level1">
<h1>Building Numerical Features from the Review Texts</h1>
<p>One of the key ingredients in our model building process is the way we transform the
review texts to numerical data that can be used as predictor variables in our
models. For our analysis we will fit three types of models whose main differences
lie in how this transformation is done.</p>
<p>For the first model we will use a simple <em>bag of words</em> type transformation to
generate the features. Using this approach the the text data is converted into a
vector in which each entry corresponds to a word (or combination of words) from a
predefined list (the <em>dictionary</em>). The entry will be zero if a given word is not present
in the review text and have a value different from zero if it is (e.g. it could
be <span class="math inline">\(2\)</span> if the word occurs two times in the text).</p>
<p>Arranging these vectors into rows will produce the <em>document term matrix (DTM)</em>
in which each row corresponds to one review text and the columns to a word or
word combination. This matrix along with the target variable described in the
previous section will be used to fit our logistic regression model.</p>
<p>The process to create the DTM is a bit more complicated than described above and
involves, for example, the removal of certain unimportant words or words that occur
too rarely to make any difference in the model. In our implementation (which can
be found
<a href="https://github.com/jakobludewig/r_projects/blob/main/2021-03-17%20Restaurant%20Reviews%20Analysis/03%20-%20Modelling.R">here</a>)
we used the <code>text2vec</code> package and followed the steps outlined in its
<a href="https://cran.r-project.org/web/packages/text2vec/vignettes/text-vectorization.html">vignette</a>.</p>
<p>We deviated from the recommendations in the vignette with respect to one important aspect:
Instead of the <em>TF-IDF</em> transformation we applied the simpler <em>L1-normalisation</em> that
converts the word counts into the relative frequencies with which a given word occurs
in a text.</p>
<p>The reason for that is that the TF-IDF transformation includes factors that are
calculated from the whole dataset. Therefore its results will vary depending on which dataset
the transformation is applied on. This is an undesirable property for our purposes
as in the next section we will fit two separate models: one on the data from
Berlin and one on the data from Stockholm. We then will use both models to predict
on data from Berlin</p>
<p>However, this can only be done if the features we put into the models have been
calculated in the same way for both datasets. This is not the case for the TF-IDF
normalisation which will produce different values for the same review text, depending
on which dataset it is in.</p>
<p>The L1 normalisation on the other hand can be applied independently on each
observation and therefore will produce compatible predictor variables.</p>
</div>
<div id="regularisation-for-feature-selection" class="section level1">
<h1>Regularisation for Feature Selection</h1>
<p>The model we will fit is a <em>logistic regression model with lasso regularisation (L1)</em>.
This means that during the training process an additional penalty term is introduced
into the objective function that will prevent the model from fitting too many
coefficients. As a result we will end up with a model with less (non-zero)
coefficients which will be easier to interpret and analyse. This is a big advantage as our models will tend to have
many potential predictor variables (the models in the next section will have
2893
potential predictors).</p>
<p>How much the model is being penalised for fitting non-zero coefficients (i.e.
the strength of the regularisation) will be automatically chosen as a hyperparameter
during the training process using cross validation.</p>
</div>
<div id="separate-models-for-each-city" class="section level1">
<h1>Separate Models for Each City</h1>
<p>Our first approach to formalise the relationship between the review text and
the point ratings in the two cities will be to fit one model for each city separately.</p>
<p>To be precise the models for each city will fit the linear relationship between the
<a href="https://en.wikipedia.org/wiki/Logit">log-odds</a> <span class="math inline">\(l\)</span> of an observation being a
five point review and the predictor variables. This relationship can be written down
as:</p>
<p><span class="math display">\[l = log(\frac{p}{1-p}) = \beta_{0} + \beta_{1}\cdot X_1 + \dots + \beta_{n}\cdot X_n \]</span></p>
<p>In this formulation</p>
<ul>
<li><span class="math inline">\(p\)</span> refers to the probability of an observation being a five point review</li>
<li><span class="math inline">\(\beta_{0}\)</span> refers to the intercept of the model (which incorporates the “average” probability of a five point review)</li>
<li><span class="math inline">\(X_{i}\)</span> is the value for the i-th word or n-gram as encoded in the document term matrix</li>
<li><span class="math inline">\(\beta_{i}\)</span> is the coefficient for variable <span class="math inline">\(X_{i}\)</span></li>
</ul>
<p>Our main quantity of interest will be the values that our models predict for <span class="math inline">\(p\)</span>
which can be recovered by a simple transformation of the log-odds.</p>
<p>The models for each city share the same model formulation but since we will fit
them on different datasets (the data from the respective city) they will have
different coefficients.</p>
<div id="effect-of-the-intercept" class="section level2">
<h2>Effect of the Intercept</h2>
<p>In our
<a href="https://natural-blogarithm.com/post/restaurant-reviews-stockholm-vs-berlin/">first post about the restaurant reviews data</a>
we saw that there is a big difference in the proportion of five point reviews
between the two cities: around 47 % in Stockholm vs about 64 % in Berlin.</p>
<p>This will present a problem for our separate models approach which can be see when
passing some example review data through each of the two models:</p>
<pre><code>## # A tibble: 6 x 3
## review_text Berlin_model Stockholm_model
## <chr> <dbl> <dbl>
## 1 "This is a great restaurant, I really enjoyed th… 0.902 0.742
## 2 "One of the best restaurants I have ever been. A… 0.947 0.940
## 3 "The staff is super friendly and the food is mag… 0.927 0.760
## 4 "If your looking for the best food in town, this… 0.842 0.791
## 5 "Had a brilliant time here with my family. The f… 0.889 0.785
## 6 "" 0.677 0.432</code></pre>
<p>For most of these example reviews the predictions for a five point review from
the Berlin model are considerably higher than from the Stockholm model (on average around 12 %)
and we might be inclined to read this as substantial proof towards our
research hypothesis.</p>
<p>The problem becomes clear when we look at the last example review which
is just an empty text string. Even for this data the Berlin model predicts a
24.5 % higher probability for a five point review. This property of our
models is certainly undesirable when we want to analyse how a given review text
might result in a different point ratings.</p>
<p>The reason for this issue is that logistic regression models are fit in such a
way that the average prediction will be approximately calibrated to the proportion
of the positive class. Since this number is higher in Berlin than in Stockholm, so
will be the predictions of the Berlin model, even on an empty text. This issue
manifests itself in the intercept term <span class="math inline">\(\beta_0\)</span> in our models and makes the
comparison of the predictions difficult.</p>
<p>We can remedy the situation by artificially giving the five point reviews in
Stockholm more weight during the training process so that their proportion will
be the same as in Berlin.</p>
<p>By doing so the average predictions of the two models will have the same level
and the difference between their predictions will have a more meaningful interpretation
for our analysis:</p>
<pre><code>## # A tibble: 6 x 3
## review_text Berlin_model_corre… Stockholm_model_co…
## <chr> <dbl> <dbl>
## 1 "This is a great restaurant, I really… 0.902 0.863
## 2 "One of the best restaurants I have e… 0.947 0.953
## 3 "The staff is super friendly and the … 0.927 0.883
## 4 "If your looking for the best food in… 0.842 0.865
## 5 "Had a brilliant time here with my fa… 0.889 0.870
## 6 "" 0.677 0.672</code></pre>
<p>With the reweighting the two models now give very similar predictions on the empty
review texts. Also the big differences we saw previously on some of the other
example texts have now gone down substantially and in some cases even reversed.</p>
</div>
<div id="differences-in-predictions-between-the-two-models" class="section level2">
<h2>Differences in Predictions Between the Two Models</h2>
<p>Now that we have two comparable models we query them to try and answer our initial
question about whether five point ratings from Berlin are less likely to receive
a five point score in Stockholm.</p>
<p>For this we simply run the five point reviews from Berlin through both of the
models in the same way as we did for the example reviews above and analyse the
differences in their predictions.</p>
<p>We can get a visual impression of the overall difference by plotting the resulting
predictions against each other in a scatter plot:</p>
<img src="https://natural-blogarithm.com/post/restaurant-reviews-modelling/index_files/figure-html/scatterplot_predictions_resampling-1.png" width="90%" style="display: block; margin: auto;" />
<div style="text-align:center">
<img src="https://natural-blogarithm.com/./images/invisible_pixel.png" alt = "Scatter Plot of predictions from Stockholm model against Berlin model"/>
</div>
<p>Each point in this plot corresponds to a five point review from Berlin. Its
location on the x-axis is determined by the predicted probability for a five
point rating according to the Berlin model and its position on the y-axis
corresponds to the probability for a five point rating assigned by the
Stockholm model.</p>
<p>This means that for all observations that are plotted below the blue identity
line the Berlin model is predicting a higher probability than the Stockholm model.</p>
<p>It is a bit hard to make out but the concentration of points in the upper right
quadrant of the plot might suggest that overall the average predictions from the
Berlin model are higher than the Stockholm model.</p>
<p>And indeed this is the case: The average difference across all the five point
reviews from Berlin turns out to be about
6.6% which seems to validate our
subjective impression.</p>
</div>
<div id="analysing-the-coefficients" class="section level2">
<h2>Analysing the Coefficients</h2>
<p>Another interesting pattern in the plot are the vertical and horizontal lines of
points that meet around the coordinate <span class="math inline">\((67 \%, 67 \%)\)</span>. These are formed by
observations for which either the Berlin model (vertical line) or the Stockholm
model (horizontal line) only predict the intercept probability. This means
that the corresponding review texts did not contain any words that have a
non-zero coefficient in the respective model.</p>
<p>In general it will be interesting to have a closer look at the coefficients of
the two models to understand how they arrive at their different assessments and
which pieces of texts are informing their prediction.</p>
<p>First let’s look at the coefficients that have high positive or negative coefficients
in both models:</p>
<img src="https://natural-blogarithm.com/post/restaurant-reviews-modelling/index_files/figure-html/unnamed-chunk-1-1.png" width="90%" style="display: block; margin: auto;" />
<div style="text-align:center">
<img src="https://natural-blogarithm.com/./images/invisible_pixel.png" alt = "Coefficients with high impact in both models"/>
</div>
<p>The plot above shows the coefficients which have a high positive (left
panel) and a high negative (right panel) coefficient in both models. They have
been selected by sorting the coefficients for both models separately and
calculating the average rank between the two models.</p>
<p>Note that certain stop words (such as <em>the</em>, <em>we</em>, <em>had</em>, <em>to</em>, etc.) were
removed from the text before generating the text features. Therefore expressions
such as <em>wait_long_time</em> refer to pieces of texts that originally read something
like <em>“we had to wait a long time”</em> or similar.</p>
<p>All of the words and phrases on the two lists intuitively make sense: you would
expect their presence in a review text to have a strong positive or negative
impact on the probability of a five point review.</p>
<p>This is a good sense check for our model. The only two entries on the list which
might raise our eyebrows are the words <em>however</em> and <em>otherwise</em>. But it is
plausible to assume that they are frequently used in qualifying statements in which
users explain why they did not give the full five point score
(e.g. <em>“We had a good night however the noise level in the restaurant was uncomfortable”</em>).</p>
<p>On the left side we see a lot of overwhelmingly positive words as well as
expressions of fulfilled expectations (<em>not_disappointed</em>, <em>never_disappointed</em>).
Also phrases of intent to return to the restaurant seem to have a high positive
impact on the probability of a five point review (<em>not_last</em>, <em>definitely_back</em>).</p>
<p>On the negative side we can see some specific criticism of certain aspects of
the restaurant experience (<em>wait_long_time</em>, <em>unprofessional</em>) or the food
(<em>dry</em>, <em>lukewarm</em>) mixed in with some more general negative expressions (<em>bad</em>,
<em>poor</em>, <em>worst</em>, etc.)</p>
<p>Now let’s look at the coefficients on which the models disagree. The lists below
contain coefficients that have either highly positive or negative coefficients in
the Berlin model but values of zero (indicating no effect) in the Stockholm model:</p>
<img src="https://natural-blogarithm.com/post/restaurant-reviews-modelling/index_files/figure-html/coefficients_comparison_resampling_3-1.png" width="90%" style="display: block; margin: auto;" />
<div style="text-align:center">
<img src="https://natural-blogarithm.com/./images/invisible_pixel.png" alt = "Coefficients with high impact in Berlin model and no impact in Stockholm model"/>
</div>
<p>One thing that stands out on the list on the left side is the presence of words
and phrases referring to the taste of the dishes ( <em>can_taste</em>, <em>tasted_great</em>,
<em>dishes_tasty</em>). Another cluster can be formed around expressions involving the
word <em>recommend</em> (<em>clear_recommendation</em>, <em>can_recommend</em>, <em>absolute_recommendation</em>).</p>
<p>On the negative side we have a lot phrases that refer to the point rating the
user is giving ( <em>one_star</em>, <em>two_stars</em>, <em>4_stars</em>, <em>star_deduction</em>, <em>deduction</em>)
for which a strong negative impact on the probability for a five point rating is
very plausible.</p>
<p>A lot the other negative factors that might lead to a low rating in Berlin seem
to revolve around the service ( <em>unfriendly</em>, <em>overwhelmed</em>, <em>behavior</em>,
<em>stressed</em>).</p>
<p>A curious entry on the list is the word <em>mushrooms</em> which could be interpreted in
different ways (maybe there is a high frequency of bad mushrooms in Berlin restaurants
or a tendency to ignore customers’ requests to exclude mushrooms from their meal).</p>
<p>Finally let’s look at the coefficients that have high impact in the Stockholm model
but no impact in the Berlin model:</p>
<img src="https://natural-blogarithm.com/post/restaurant-reviews-modelling/index_files/figure-html/coefficients_comparison_resampling_5-1.png" width="90%" style="display: block; margin: auto;" />
<div style="text-align:center">
<img src="https://natural-blogarithm.com/./images/invisible_pixel.png" alt = "Coefficients with high impact in Stockholm model and no impact in Berlin model"/>
</div>
<p>The list of positive coefficients in Stockholm seems to consist of a mixed bag
of positive expressions with no clear unifying topic.</p>
<p>The entry <em>sickly_good</em> looks like it might be the result of an improper translation
of a Swedish idiom.</p>
<p>In addition to that it is not immediately clear why the word <em>dissatisfied</em> pops
up on the list of the positive coefficients.</p>
<p>On the negative side the entries <em>took_long</em>, <em>small_portions</em> and (curiously)
<em>cutlery</em> describe very specific aspects of the restaurant experience that seem
to be important for users in Stockholm. Apart from that the list contains a lot
of generic negative statements such as <em>awful</em>, <em>not_recommend</em>, or <em>not_worth</em>.</p>
</div>
</div>
<div id="limitations-of-the-bag-of-words-features" class="section level1">
<h1>Limitations of the Bag of Words Features</h1>
<p>In our analysis of the coefficients above we have found some interesting differences
between the two models and we might be tempted to generalise them into statements
outside the realm of model analysis.</p>
<p>While these generalisations are always extremely difficult and require a thorough
understanding of the data generating process, they are especially difficult to make
with the models we have built as we will see below.</p>
<p>For example, we could be inclined to translate the fact that the coefficients for
<em>tasted_great</em>, <em>dishes_tasty</em>, and <em>can_taste</em> are highly positive in the
Berlin model but zero in the Stockholm model into a bold statement such as the
following:</p>
<blockquote>
<p>People in Stockholm do not care about the taste of their food (but people in Berlin people do).</p>
</blockquote>
<p>Even if the sample of restaurant reviews that we used to train our model was in theory
good enough to make these kind of inferences (which it probably is not in our case
and in general rarely ever is) we should remind ourselves about how exactly
we constructed the features we used in our model from the raw review texts.</p>
<p>As described above the simple bag of word features we are just counting with which
frequency a word appears in a given text. The important point to stress here is
that the variable will only account for the presence of an <em>exact word or phrase</em>.</p>
<p>So while it may be true that the exact expression <em>great_taste</em> is not important
in the Stockholm model, it may well be the case that other positive descriptions
of food taste are important. And in fact the expressions <em>tastiest</em> and
<em>fresh_tasty</em> are examples of coefficients that have positive values in the Stockholm
model while they are zero in the Berlin model.</p>
<p>Therefore we have to be aware that the validity of our analysis is largely
influenced by the <em>compatibility of the vocabularies</em> used in the reviews
from the two cities. If there is a tendency to describe the same issues with
different words or expressions this will pose a problem.</p>
<p>This issue is especially relevant in our case since we are working with review
texts that were machine translated from Swedish and German to English. Words
with very similar or identical meaning could be translated to different words in
English as can be seen in the example below:</p>
<div style="text-align:center">
<img src="images/google_translate_example.png" width ="75%" alt ="Example of words with the same original meaning being translated to different English words"/>
</div>
<p>Even though the two original words basically have identical meaning they will
be encoded in separate variables with no way for the model to make any connections
between them.</p>
<p>Ideally we would want to use features in our model that go beyond the simple
counting of words in a text and are powerful enough to encode
<em>semantic information and relationships</em> in different pieces of text.</p>
<p>This will make the comparison of the two models more meaningful and also
improve the predictive performance of the models themselves as the data can be
used more efficiently.</p>
</div>
<div id="unified-model" class="section level1">
<h1>Unified Model</h1>
<p>Luckily, these kind of features do exist and we will use them in our final model
in the next section. But first we will revisit our modelling approach.</p>
<p>While intuitively it makes sense to build two separate models, each describing
the rating process in the two cities separately, it may not be the most elegant
or standard approach.</p>
<p>An alternative way to approach our problem is to formulate one single, unified
model which we train on all the data and explicitly use the coefficients of the
model to describe the differences between the two cities.</p>
<p>Such a model could be formulated in the following way:</p>
<p><span class="math display">\[l = \beta_{0} + \beta_{0,city} \cdot X_{city} + \beta_{1}\cdot X_1 + \dots + \beta_{n}\cdot X_n + \beta_{1,city}\cdot X_1 \cdot X_{city} + \dots + \beta_{n,city}\cdot X_n \cdot X_{city}\]</span></p>
<p>The coefficients and variables <span class="math inline">\(\beta_0\)</span>, <span class="math inline">\(\beta_{i}\)</span> and <span class="math inline">\(X_i\)</span> have the same
meaning as in our previous formulation.</p>
<p>The key difference compared to our previous model is the introduction of the
city dummy variable <span class="math inline">\(X_{city}\)</span>. For the data from Berlin this variable will have
a value of zero and therefore the model will reduce to our previous model’s
formulation.</p>
<p>For data from Stockholm on the other hand the value of <span class="math inline">\(X_{city}\)</span> will be <span class="math inline">\(1\)</span> which
means we will use that data to fit the additional coefficients as offsets from
the Berlin data.</p>
<p>The city specific intercept <span class="math inline">\(\beta_{0,city}\)</span> will describe the difference in the
overall incidence of five point reviews between Stockholm and Berlin. Therefore this
variable will capture the difference we have previously accounted for by
reweighting which we can now drop (we will just have to make sure we use the same
intercept when measuring the prediction differences).</p>
<p>The other additional coefficients <span class="math inline">\(\beta_{i,city}\)</span> describe the interaction between
the city dummy <span class="math inline">\(X_{city}\)</span> and our word feature variables <span class="math inline">\(X_i\)</span>. This means they
will describe how the effect of a given word on the probability for a five point
review differs between Stockholm and Berlin. For example, if the word <span class="math inline">\(i\)</span> has a
more positive effect on the rating for reviews from Stockholm than in Berlin
the coefficient <span class="math inline">\(\beta_{i,city}\)</span> will be positive.</p>
<p>This formulation also makes the analysis of coefficients much simpler as we can
simply look at the offsets <span class="math inline">\(\beta_{i,city}\)</span>.</p>
<p>The coefficients <span class="math inline">\(\beta_{i}\)</span> are also called the <em>main effects</em> and the coefficients
<span class="math inline">\(\beta_{i,city}\)</span> are referred to as the <em>(city-)interaction effects</em>.</p>
<p>Fitting a model as described above is a bit more complex, especially when working
with the <code>glmnet</code> package as we will have to build the design matrix which encodes
the data for the logistic regression model manually. If you are interested in how
this is done exactly please <a href="https://github.com/jakobludewig/r_projects/blob/a1be58194933d0d6e6a8daf41db2d55db29edc72/2021-03-17%20Restaurant%20Reviews%20Analysis/03%20-%20Modelling.R#L168">check out the code here</a>.</p>
<div id="results-from-the-unified-model" class="section level2">
<h2>Results From the Unified Model</h2>
<p>As we have done before we will estimate the difference in the text to rating
relationship by looking at the differences in predictions on the five point
ratings from Berlin.</p>
<p>To obtain these predictions we will first use the Berlin coefficients and then the
Stockholm coefficients. For both predictions we will keep the Berlin specific
intercept as otherwise we will reproduce the difference that is caused by the higher
prevalence of five point ratings in Berlin.</p>
<p>Using these predictions the scatter plot from before would look as follows:</p>
<img src="https://natural-blogarithm.com/post/restaurant-reviews-modelling/index_files/figure-html/scatterplot_predictions_unified_model-1.png" width="90%" style="display: block; margin: auto;" />
<div style="text-align:center">
<img src="https://natural-blogarithm.com/./images/invisible_pixel.png" alt = "Scatter Plot of Stockholm predictions against Berlin predictions from unified model"/>
</div>
<p>The plot now seems much more symmetric and visually it seems like the difference
we have seen previously is gone. And in fact, if we calculate the average of the
prediction differences it now turns out to be around
-0.3 % (while with the two separate models
it was around 6.6 %).</p>
<p>So how come that our second model is not replicating the findings of our first model?</p>
<p>If we look at the coefficients of our model we can see that only around
5 % of
the interaction terms <span class="math inline">\(\beta_{city,i}\)</span> (i.e. the ones describing the Stockholm
specific effects) have non-zero values, compared to
26 % of
the main effects <span class="math inline">\(\beta_{i}\)</span>.</p>
<p>Given this imbalance we should not be surprised that the predictions for the
two cities are so similar: There are just very few ways (i.e. non-zero
coefficients) in which a prediction using the Stockholm coefficients can differ
from the predictions for Berlin.</p>
<p>What is more, the average magnitude of the non-zero Stockholm coefficients is
around half that of the Berlin coefficients (the main effects). That means that
even if the predictions differ, we can expect the differences to be relatively small.</p>
<p>One reason why we see so few non-zero coefficients for the interaction terms
may be the regularisation that we use in our model. As already mentioned above the
the regularisation will limit the number of non-zero coefficients in our model. In
general those coefficients will be favored that help most to improve the predictive
performance of the model which will be measured on the whole dataset.</p>
<p>Out dataset, however, is still biased towards observations from Berlin with a
ratio of roughly two to one. This means that it will be more difficult for the
coefficients corresponding to the interaction terms to obtain non-zero values,
simply because they can only improve the prediction performance on a smaller
portion of the data (i.e. the observations from Stockholm).</p>
<p>So we have some reason to believe that the way we trained our unified model may
not really be suitable to answer the question that we are trying to answer. One
way to potentially improve its behaviour would be to remove the imbalance between
Berlin and Stockholm in a similar way that we have done in the first model to
calibrate the incidence of five point reviews in the Stockholm sample to that
of the Berlin sample (i.e. through reweighting the data).</p>
<p>Alternatively we could decrease the strength of the regularisation to allow more
Stockholm specific coefficients to be fit with non-zero values or even could turn
them off altogether.</p>
</div>
</div>
<div id="unified-model-with-embeddings" class="section level1">
<h1>Unified Model with Embeddings</h1>
<p>In fact we will do the latter but first we will replace our simple bag of words
features with the more complex <em>embeddings</em> features which we already hinted at
in the previous section.</p>
<p>In a nutshell the idea behind embeddings is to represent words as vectors of a
given dimension <span class="math inline">\(n\)</span>. This can be done in different ways but the general idea is
to construct them in such a manner that words that are semantically similar to
each other are represented by vectors that lie close to each other in the <span class="math inline">\(n\)</span>
dimensional vector space.</p>
<p>So for example, we would expect word pairs such as <em>tasty</em> and <em>delicious</em> or
<em>fantastic</em> and <em>wonderful</em> to have a small distance from each other in the
<span class="math inline">\(n\)</span>-dimensional vector space in which we embed them in.</p>
<p>This should help us overcome some of the issues we discussed about our bag of
words features above as now the model will be able to pick up on the semantic
similarities between those words: these words will tend to have similar values in
the entries of the vectors that represent them.</p>
<p>Given embeddings for every word in our text an embedding for a whole piece of text
(e.g. our review texts) can be generated by aggregating all its word embeddings.
This final aggregated vectors will serve as the inputs for our model.</p>
<p>This was obviously a very superficial introduction to embeddings so if you are
interested in more details you can take a look at a more thorough introductions
such as <a href="http://web.stanford.edu/class/cs224n/">this lecture</a>.</p>
<p>In it you will also find descriptions of some fascinating behaviour that these
embeddings can exhibit (for example, see slide number 21 of
<a href="http://web.stanford.edu/class/cs224n/slides/cs224n-2021-lecture02-wordvecs2.pdf">this document</a>).
In the slides you can also find an explanation of how the embeddings can be interpreted
as complex transformations of input vectors that are very similar to our bag of
words vectors.</p>
<div id="embeddings-from-a-pre-trained-neural-network" class="section level2">
<h2>Embeddings from a Pre-Trained Neural Network</h2>
<p>For our model we will not construct our own embeddings as this tends to be a
computationally expensive process. Instead we will extract the embeddings from a
pre-trained neural network, in this case the
<a href="https://huggingface.co/bert-base-uncased">bert-base-uncased</a> model which is a
powerful transformers model that was trained on a huge corpus of English language
data.</p>
<p>Specifically the BERT model was trained to recover masked words in a sentence or to
predict whether a given sentence was followed by another. While this goal is not
necessarily related to the classification problem we are trying to solve pre-trained
neural networks tend to perform well even when applied to other tasks (a concept
called <em>transfer learning</em>). The reason behind this is that sufficiently complex
networks tend to learn features about their input data that represent general
concepts about their input data and can therefore be useful in many other scenarios
(e.g. neural networks used in image classification tasks learn features that
correspond to concepts like edges or round shapes etc.).</p>
<p>The embeddings we are going to extract correspond to the values of the nodes in
the last hidden layer of the pretrained network. We will again use them as the
inputs to our logistic regression model. But since the final layer in a neural
network for binary classification essentially is a logistic regression model (with
the inputs being the values of the nodes in the next-to-last layer) we are
effectively building a neural network (even if we are not re-fitting any of the
lower layers of the pre-trained BERT network we are using).</p>
<p>This interpretation of our new model also gives us an additional justification to
turn off the regularisation as it will be more in line with the concepts of neural
networks.</p>
<p>Accessing the pre-trained model and the extraction of the embeddings will be
done through the <a href="https://github.com/flairNLP/flair">flair framework</a>
that we already used in our
<a href="https://natural-blogarithm.com/post/restaurant-reviews-sentiment-analysis/">sentiment analysis post</a>.</p>
<p>We will construct our model in the same way as the unified model from before with
main effects corresponding to the overall estimates and interaction effects with
the city dummy variable to describe the differences in the estimates for the
Stockholm data. However, instead of the word frequency counts we will use the
768 embeddings extracted from the neural network as our input features <span class="math inline">\(X_i\)</span>.</p>
</div>
<div id="results-from-the-embeddings-model" class="section level2">
<h2>Results from the Embeddings Model</h2>
<p>Using our new neural-network powered logistic regression model our scatter plot of
predictions turns out as follows:</p>
<img src="https://natural-blogarithm.com/post/restaurant-reviews-modelling/index_files/figure-html/scatterplot_predictions_embeddings-1.png" width="90%" style="display: block; margin: auto;" />
<div style="text-align:center">
<img src="https://natural-blogarithm.com/./images/invisible_pixel.png" alt = "Scatter Plot of Stockholm predictions against Berlin predictions from embeddings model"/>
</div>
<p>Compared to our previous model the predictions show much more variance resulting
in a much more widely distributed scatter plot. This makes it slightly harder to
work out if there is a pattern. But taking a closer look at the top right part of
the plot one can get the impression that the average prediction difference between
Berlin and Stockholm might be positive.</p>
<p>And indeed that is the case: on average the Berlin predictions are
around 4.2 %
higher than the ones from Stockholm. This puts the results of our third model
between the differences we saw from our first model
(6.6 %) and our second model
(-0.3 %).</p>
</div>
</div>
<div id="conclusion" class="section level1">
<h1>Conclusion</h1>
<p>Now that we have built three different types of models, each of them producing
different results, we need to ask ourselves which one of them should we trust?</p>
<p>The statistical aphorism “All models are wrong (but some are useful)” comes to
mind and indeed choosing the “correct” model is not a trivial task.</p>
<p>Of the three the first one surely is the easiest to understand and interpret which
is a quality that should not be underestimated, especially in an explorative setting
such as ours.</p>
<p>However, we have identified some clear limitations of the features of that model
which have raised serious doubts about the validity of some of its results.</p>
<p>The same criticism can be applied to the second model since it is basically
using the same features. On top of that we have seen that the regularisation may
play an undue role in this model, potentially hiding actually existing differences
between the rating processes in the two cities.</p>
<p>Our third model, using the complex embeddings features certainly looks like the
most powerful model and this seems to be backed up by its higher performance
as measured during model validation phase. This performance however was not
substantially higher than the other models and it comes at the cost of
interpretability.</p>
<p>Finally it could be argued that each model represents another perspective
on the data with neither of them being necessarily <em>right</em> or <em>wrong</em>. Therefore
in the end one could even consider to combine these different perspectives in some
kind of <em>ensemble approach</em>, taking the average of the three results as our final
result (putting the difference in predictions somewhere around the 3 % mark).</p>
<p>In the end we could put a lot more effort into finding the right model to try
and get an exhaustive and perfectly reliable answer to our initial research
question. For now we will leave it at that and be content with the answers we have
found and the things we have learned in the process of finding them.</p>
</div>
Visualising the Berlin Train Network
https://natural-blogarithm.com/post/berlin-traffic-matrix-vis/
Wed, 14 Apr 2021 00:00:00 +0000https://natural-blogarithm.com/post/berlin-traffic-matrix-vis/
<script src="https://natural-blogarithm.com/post/berlin-traffic-matrix-vis/index_files/header-attrs/header-attrs.js"></script>
<div id="TOC">
</div>
<p>A few years ago I came across this
<a href="https://fronkonstin.com/2016/08/22/the-breathtaking-1-matrix/">very concise but intriguing post</a> on Antonio
Sánchez Chinchón’s blog which features a beautiful matrix visualisation technique (as
well as a Calle 13 quote).</p>
<p>Ever since then I had it on my to-do list to build a similar visualisation but could
never quite decide on which matrix I wanted to visualise. A few days ago I thought
that Berlin’s U- and S-Bahn network would be a nice candidate for such a visualisation.</p>
<div id="graphs-and-adjacency-matrices" class="section level1">
<h1>Graphs and Adjacency Matrices</h1>
<p>So how do we turn the Berlin train network into a matrix? For this, let’s take
a look at the classic route map of the Berlin train network:</p>
<div style="text-align:center">
<img src="images/berlinroutemap.png" width ="75%" alt ="Route Map of Berlin's U- and S-Bahn Network (Source: https://sbahn.berlin/fahren/liniennetze/)"/>
</div>
<p>In mathematical terms we can interpret the train network as a
<a href="https://en.wikipedia.org/wiki/Graph_(discrete_mathematics)">graph</a> structure.
Simply put, a graph is a collection of points (the <em>nodes</em>) and connections
between those points (the <em>edges</em> of the graph). In our case the nodes are the
train stations and the edges are the direct connections between those
stations (direct meaning reachable in just one stop).</p>
<p>Now graphs can be described using a so-called <a href="https://en.wikipedia.org/wiki/Adjacency_matrix">adjacency matrix</a> which indicates
which nodes of the graph are connected. For a network of <span class="math inline">\(n\)</span> nodes the matrix
would consist of <span class="math inline">\(n\)</span> rows and <span class="math inline">\(n\)</span> columns (so a total of <span class="math inline">\(n \times n\)</span>
entries). The entry in the <span class="math inline">\(i\)</span>-th row and the <span class="math inline">\(j\)</span>-th column would indicate whether
or not there is a connection between the <span class="math inline">\(i\)</span>-th node and the <span class="math inline">\(j\)</span>-th node in the
graph. If there is a connection this value will be <span class="math inline">\(1\)</span> and <span class="math inline">\(0\)</span> otherwise.</p>
<p>So for example, a matrix that describes the connection between the four Berlin train
stations <em>Hauptbahnhof</em>, <em>Friedrichstraße</em>, <em>Hackescher Markt</em>, and
<em>Unter den Linden</em> would look as follows:</p>
<p><span class="math display">\[
\left(\begin{array}{cc}
0 & 1 & 0 & 0\\
1 & 0 & 1 & 1\\
0 & 1 & 0 & 0\\
0 & 1 & 0 & 0\\
\end{array}\right)
\]</span>
The four stations are ordered along the rows/columns in the same way
as we listed them above. Therefore we can find out which stations are connected
to <em>Hauptbahnhof</em> by simply going through the entries in the first row. We can
see that <em>Hauptbahnhof</em> is only connected to <em>Friedrichstraße</em> (the second entry
is a one, the remaining ones are zero). We can also see that <em>Friedrichstraße</em>
is connected to all other stations in this network (all entries are one except for
the second which corresponds to the station itself) and <em>Hackescher Markt</em> and
<em>Unter den Linden</em> are only connected to <em>Friedrichstraße</em> (only second entry
is a one in rows three and four).</p>
<p>For our visualisation we do not care about the direction of the train lines
but only if there is a connection between two respective stations. Therefore our
adjacency matrix will always be symmetric (i.e. you can flip it on the diagonal
and retain the same matrix).</p>
<p>In the Berlin traffic network there can be more than one line connecting certain
stations and for our visualisation we will want to draw multiple lines in this
case. In terms of the adjacency matrix that means that instead of values of one
we will put a value equal to the number of direct connections between those two
stations.</p>
<p>For example, there are four lines between <em>Hauptbahnhof</em> and <em>Friedrichstraße</em>.
Therefore we would put a <span class="math inline">\(4\)</span> in the second column of the first row and the first
column of the second row.</p>
</div>
<div id="visualising-the-adjacency-matrix-or-the-graph" class="section level1">
<h1>Visualising the Adjacency Matrix or the Graph?</h1>
<p>So now we have learned that there is a direct connection between the adjacency
matrix and its corresponding graph structure and this connection also applies to
how we can visualise them.</p>
<p>If we take another look at the visualisation of the 20x20 matrix from Antonio’s
blog post we can now interpret it as the visualisation of a graph with
20 nodes in which each node is linked to every other node (and since the diagonal
elements are one every node is connected to itself as well).</p>
<p>One key component of his visualisation is the arrangement of the nodes of the
graph in a circle which we will also use for our visualisation of the Berlin
train network.</p>
<p>Since the visualisation of the adjacency matrix is equivalent to the visualisation
of a graph we won’t have to bother with creating a matrix but can rely on tools
that work directly with graph structures. Luckily in R such tools exist in the
form of the great <a href="https://www.data-imaginist.com/2017/introducing-tidygraph/">tidygraph</a> and <a href="https://github.com/thomasp85/ggraph">ggraph</a> packages written by Thomas Lin Pedersen.</p>
<p>A reproduction of the 20x20 matrix visualisation using these tools would look as
follows:</p>
<p><img src="https://natural-blogarithm.com/post/berlin-traffic-matrix-vis/index_files/figure-html/example_blogpost-1.png" width="768" style="display: block; margin: auto;" /></p>
<p>The main difference between this representation and the Chord diagram used in
Antonio’s blogpost is that in the <code>ggraph</code> version each node in the graph is
represented by a point on the circle whereas in the Chord diagram a segment of
the circle is used. This makes for the more pointy look in our version</p>
<p>Our visualisation of the Berlin traffic network will probably look much less
symmetric and regular since not all stops are connected with each other. We can
also expect the points on the circle to be much more condensed as there are way
more than twenty train stations in Berlin (in fact there are 316 U- and S-Bahn
stations).</p>
</div>
<div id="getting-the-gtfs-data" class="section level1">
<h1>Getting the GTFS Data</h1>
<p>To build our visualisation we will need some data to build our graph with. Luckily
this data is available through the
<a href="https://daten.berlin.de/datensaetze/vbb-fahrplandaten-gtfs">Berlin Open Data project</a>
in the
<a href="https://developers.google.com/transit/gtfs/">General Transit Feed Specification format</a>.</p>
<p>This dataset consists of multiple CSV files that each describe certain aspects
of the Berlin traffic network (e.g. stations, routes, stops, time tables etc.).
The data requires quite a lot of transformations before we can use it to build
our graph. For example, there are multiple stations that refer to the
Alexanderplatz station but in our graph we would like to have only one, so we
will need to map them first. If you are interested in the transformation steps
you can check out the code
<a href="https://github.com/jakobludewig/r_projects/tree/main/2021-04-16%20Berlin%20Train%20Network%20Visualisation">here</a>.</p>
</div>
<div id="arranging-the-nodes-along-the-circle" class="section level1">
<h1>Arranging the Nodes Along the Circle</h1>
<p>One thing that we still need to decide on is the order with which the nodes
(stations) will be arranged on the circle.</p>
<p>This choice which will heavily influence the look of our final visualisation but
it is also highly subjective as there is no “canonical” way of mapping the
stations to positions on the circle. One of the simplest choices which might be a
logical starting point would be to order the stations alphabetically. With this
choice our visualisation would look as follows:</p>
<div style="text-align:center">
<img src="images/alphabetical.png" width ="75%" alt ="Berlin traffic network graph visualisation with alphabetical node order"/>
</div>
<p>In this graph the point in the top center of the circle corresponds to the station
<em>Adenauerplatz</em> and from that point on the stations are ordered alphabetically in
clockwise order up until the point left of the top center which corresponds to
<em>Zwickauer Damm</em>. The colors for the edges correspond to the official colors of the
train lines.</p>
<p>This ordering is probably quite close to a random node order as generally there
will be very little relationship between the station names and how they are
connected to each other (with some few exceptions such as the stations
<em>Strausberg Bhf</em>, <em>Strausberg Nord</em>, and <em>Straubserg Stadt</em>).</p>
<p>In fact the results we get from a random node ordering does not differ so much
visually from the alphabetical one:</p>
<div style="text-align:center">
<img src="images/random.png" width ="75%" alt ="Berlin traffic network graph visualisation with random node order"/>
</div>
<p>We could also think about ways to incorporate some geographical information about
the stations into the mapping. While it is not possible to retain all information
when mapping the two-dimensional coordinates onto the one-dimensional circle we
can at least incorporate some of it by ordering the stations according to their
distance from the geographical center of Berlin.</p>
<p>This means that points that are close to each other on the circle (except for
the last and first one) lie on circles around the city center whose radii are
close to each other. Of course this could still mean that nodes that are next to
each other could represent stations that lie far away from each other
geographically (e.g. stations that lie on the opposite side of a circle with the
same radius). However, we should definitely see a pattern where very few to none
nodes from the upper right side of the circle will be connected to the upper left
side of the circle as this would correspond to very long distances between train
stations.</p>
<div style="text-align:center">
<img src="images/distance_center.png" width ="75%" alt ="Berlin traffic network graph visualisation with geographical node order"/>
</div>
<p>Indeed we can see that this geographical ordering results in a very different pattern
than what we have seen before with almost no lines going through the center of
the circle. We can also see that some of the furthest connections are made by
S-Bahn lines (e.g the purple S7 or green S8) which makes sense as their route
takes them further outside the city where the distance between stops tends to be longer.</p>
</div>
<div id="a-more-beautiful-plot" class="section level1">
<h1>A More Beautiful Plot</h1>
<p>In the end, the choice of node ordering is really a subjective one. For the final
visualisation I decided to go with a random ordering as I personally found it
visually the most appealing and the idea of this visualisation was to create something
nice to look at and not necessarily something that is effective at communicating
information (I don’t think anyone would use this kind of graph to navigate the
city).</p>
<p>To give the visualisation some of the same hypnotic and mesmerizing quality as
the visualisation from Antonio’s blog post I decided to choose a similar look.
Replacing the different edge colors with white and plotting everything on a blue
background results in a much less busy but high contrast visual:</p>
<div style="text-align:center">
<img src="images/random_beautified.png" width ="75%" alt ="Final version of the visualisation"/>
</div>
</div>
<div id="an-animated-version" class="section level1">
<h1>An Animated Version</h1>
<p>One final thing we can do is build an animated version of the graph using the
<a href="https://gganimate.com/articles/gganimate.html">gganimate</a> package which seamlessly
integrates with <code>ggraph</code>.</p>
<p>The following animation starts out with the nodes in the position corresponding
to their geographical location on the map, will move them to their (random) position
on the circle and then have the connections between the nodes appear:</p>
<div style="text-align:center">
<img src="images/animation_combined.gif" width ="75%" alt ="Animated version of the visualisation"/>
</div>
<p>Much more can be done with this data visualisation, especially with the animated
version. For example, I would have liked to keep the edges between the nodes
intact while moving the nodes from the circle back to their geographical positions
but was not quite able to get this working.</p>
<p>If you are interested in the how the visualisations in this post were created
or would like to build your own ones from the data you can check out
the code <a href="https://github.com/jakobludewig/r_projects/tree/main/2021-04-16%20Berlin%20Train%20Network%20Visualisation">here</a>.</p>
</div>
Sentiment Analysis for Restaurant Reviews in Stockholm and Berlin
https://natural-blogarithm.com/post/restaurant-reviews-sentiment-analysis/
Thu, 25 Mar 2021 00:00:00 +0000https://natural-blogarithm.com/post/restaurant-reviews-sentiment-analysis/
<script src="https://natural-blogarithm.com/post/restaurant-reviews-sentiment-analysis/index_files/header-attrs/header-attrs.js"></script>
<div id="TOC">
</div>
<p>In our <a href="https://natural-blogarithm.com/post/restaurant-reviews-stockholm-vs-berlin/">previous blog post</a>
we posed two questions about restaurant reviews from Stockholm and Berlin. Using
some actual restaurant review data from Google Maps we were able to find an answer
to the first question about whether there is a difference in the point rating
distribution between the two cities.</p>
<p>In this post we will focus on the second question which concerns the relationship
between the review texts and the point score and how this relationship might differ
between the two cities.</p>
<p>As a reminder, in Stockholm we encountered some review texts such as the ones below:</p>
<blockquote>
<p>“Extremely tasty, fresh spicy burgers and fantastic sweet potato fries.”</p>
</blockquote>
<blockquote>
<p>“Very good food but the best thing is the staff, they are absolutely wonderful here!!”</p>
</blockquote>
<blockquote>
<p>“Very good pasta, wonderful service !!! Very cozy place, everything is perfect👍”</p>
</blockquote>
<p>From our experience with restaurant reviews from Berlin we would have expected those
reviews to be associated with a full five point score. However, all the reviews
above had a rating of four or less.</p>
<p>This observation led us to formulate the following question:</p>
<p><em>Is there a difference in the relationship between the wording of the review
texts and the associated point ratings between Stockholm and Berlin?</em></p>
<p>This question is pretty vague and can not really be answered using data analysis
techniques. Therefore in this blog post we will concentrate on answering a more
refined and specific version of this question:</p>
<p><em>Is there a difference in the sentiments in the restaurant review texts between Stockholm and Berlin?</em></p>
<p>If we can find a way to quantify the sentiment of a given text we can use this
measure to compare the review texts from both cities. In particular this will
allow us to check if there is a tendency for more positive wording in Stockholm
reviews compared to the ones from Berlin. The existence of such a pattern would
provide evidence to support our subjective observations.</p>
<div id="sentiment-analysis" class="section level1">
<h1>Sentiment Analysis</h1>
<p>To obtain such a measure we will use tools from the field of
<a href="https://en.wikipedia.org/wiki/Sentiment_analysis">Sentiment Analysis</a> which
aims to quantify the sentiment that is expressed in a piece of text (single words,
sentences or collections of sentences).</p>
<p>The typical output of these algorithms is a numerical value that gauges the strength
of the sentiment of a given text. Examples for these sentiments can be the simple
distinction into positive and negative sentiments or more refined emotional
categories such as joy, anger or frustration.</p>
<p>For our analysis we will focus on classifying the review texts into the
two opposing categories of <em>negative</em> and <em>positive</em> sentiments. The tools we will
be using will produce a low score for a text which is estimated to have a negative
sentiment and high scores for a text which is supposed to express a positive sentiment.</p>
<p>For example, in the context of restaurant reviews we would expect the first
statement below to receive a high score while the second one should receive
a low score:</p>
<blockquote>
<p>“This is a great restaurant, very friendly staff and amazing food.”</p>
</blockquote>
<blockquote>
<p>“Do not go to this restaurant, the food tastes disgusting and the service is horrible!”</p>
</blockquote>
<p>Sentiment analysis methods come in many different implementations and levels of
complexity. In this blog post we will introduce and apply three different methods
and compare their results.</p>
<p>For this we will use the same data as in our previous blog post but will now
include the reviews texts that come with the point rating. The R and Python code
to replicate the results from this analysis can be found
<a href="https://github.com/jakobludewig/r_projects/blob/main/2021-03-17%20Restaurant%20Reviews%20Analysis/02%20-%20Sentiment%20Analysis.R">here</a>.</p>
</div>
<div id="lexicographical-approach" class="section level1">
<h1>Lexicographical Approach</h1>
<p>One very basic approach to generate a sentiment score from a piece of text is to
assign a numeric value to each word (e.g. -1 to the word <em>bad</em> and +3
to the word <em>amazing</em>) and then simply aggregate all these individual scores.</p>
<p>An example for such a calculation is shown below:</p>
<p><span class="math display">\[\underbrace{\text{This }}_{0} \underbrace{\text{is }}_{0} \underbrace{\text{a }}_{0} \underbrace{\text{great }}_{+3} \underbrace{\text{restaurant}}_{0}, \underbrace{\text{very }}_{0} \underbrace{\text{friendly }}_{+2} \underbrace{\text{staff }}_{0} \underbrace{\text{and }}_{0} \underbrace{\text{amazing }}_{+4} \underbrace{\text{food}}_{0} .\]</span>
The overall sentiment score for this sentence would then be obtained by calculating
the sum or the average of the individual scores.</p>
<p>Defining the sentiment scores for the individual words is not a trivial task but
luckily there are a number of <em>sentiment lexicons</em> available that provide these kind of
scores so we don’t have to create them ourselves. The
<a href="https://www.tidytextmining.com/sentiment.html">Text Mining with R book</a>
describes how some of these lexicons can be accessed using their
<a href="https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html">tidytext</a>
package.</p>
<p>For our analysis we will use the
<a href="http://www2.imm.dtu.dk/pubdb/edoc/imm6006.pdf">AFINN lexicon</a> which was built on
Twitter data and provides a score ranging from -5 (negative sentiments)
to +5 (positive sentiments) for 2,477
words. Below is an example of those scores from the AFINN lexicon:</p>
<pre><code>## # A tibble: 10 x 2
## word value
## <chr> <dbl>
## 1 stabs -2
## 2 threaten -2
## 3 debonair 2
## 4 disparaging -2
## 5 jocular 2
## 6 fun 4
## 7 resolves 2
## 8 dirtiest -2
## 9 goddamn -3
## 10 joyous 3</code></pre>
<p>One interesting thing to note is that the AFINN lexicon is skewed towards negative
scores, around 64.5%
of the entries in the lexicon have a negative score.</p>
<p>The authors of the <em>Text Mining with R book</em> also provide a process with which
the sentiment scores can be calculated through simple dataframe transformations.
We used this approach to calculate a score for each review as the average score
of the words within it (assigning zero to words that were not contained in the
dictionary).</p>
<p>Taking the average of these scores within each point rating category and breaking
it down by city yields the following distribution:</p>
<p><img src="https://natural-blogarithm.com/post/restaurant-reviews-sentiment-analysis/index_files/figure-html/afinn_score_dist-1.png" width="90%" style="display: block; margin: auto;" /></p>
<p>The overall pattern for the sentiment score distribution is the same for both
cities: Higher point ratings are associated with higher sentiment scores. This
finding makes intuitive sense and is a good face validity check of this method.</p>
<p>When we compare the scores between the two cities within each point rating group
we can see that the Stockholm scores are more polarised in each category: They are
lower for the two bottom categories (ratings of one and two points) and higher
for the top two categories (four and five points) than the scores from Berlin.</p>
<p>Going back to our analysis question this finding seems to support our subjective
observations: On average the review texts in the four and three point categories
in Stockholm express more positive sentiments than the ones in Berlin. The average
sentiment score in the four point category for reviews from Stockholm is even higher
than the score for five point reviews in Berlin. It is therefore plausible that
in Stockholm we are likely to come across four point reviews that we would assume
to be five point reviews (based on our experience from Berlin).</p>
<div id="issues-with-the-lexicographical-approach" class="section level2">
<h2>Issues with the Lexicographical Approach</h2>
<p>The lexicographical approach is very simple to understand and implement. However,
its simplicity also results in some limitations. One obvious drawback is the handling
of more complex sentence structures such as negations or the presence of words
that amplify or diminish the meaning of other words.</p>
<p>For example, let’s look at some example texts and the score the lexicographical
approach assigns to them:</p>
<pre><code>## # A tibble: 3 x 2
## review_text score_afinn
## <chr> <dbl>
## 1 This restaurant is good 0.75
## 2 This restaurant is really good 0.6
## 3 This restaurant is not good 0.6</code></pre>
<p>The first text has a score of 0.75 while the second one which is even a stronger
positive statement (thanks to the introduction of the word “really”) only has a
score of 0.6.</p>
<p>What is even more problematic is that the negation in the third example is not
picked up and we get the same score as for the second example. Arguably the third
example should receive a much lower sentiment score. Intuitively a score of -0.75
would make sense here since the statement is a direct negation of the first text.</p>
<p>Another issue with the lexicographical approach is that it does little more than
to provide a list of word scores and it remains up to us to decide how to aggregate
them into an overall score for a piece of text.</p>
<p>If we simply sum them up we might get undesirable results due to unequal text
lengths or the presence of redundant information.</p>
<p>For example, which of the two reviews below should get a higher score?</p>
<blockquote>
<p>This restaurant is good.</p>
</blockquote>
<blockquote>
<p>This restaurant is good. The food is good, the service is good, the prices are good.</p>
</blockquote>
<p>By simply summing up the scores the first one would receive a score of 3. The
second one would get a score of 12 even though arguably it does not express a
stronger sentiment but just repeats redundant information.</p>
<p>Taking the average (as we have done in our analysis) might do a better job in
this scenario as it would produce a score of 0.75 for both reviews. But that
approach introduces other issues since it penalises the score for reviews that
contain a lot of text that do not contain any words with sentiment scores, such
as objective descriptions of the restaurant:</p>
<blockquote>
<p>This restaurant is good.</p>
</blockquote>
<blockquote>
<p>This restaurant is good. The tables have white tablecloth and there is a painting on the wall.</p>
</blockquote>
<p>The second review would receive a much lower score while intuitively there is
probably little to no difference in the sentiments they express.</p>
<p>For our analysis this can become a problem if there is a tendency in either city
to write more of this kind of descriptive text. In fact there the reviews from Berlin
contain around 22 % more words compared to the reviews from Stockholm. Therefore
the difference we are seeing above may well be an artifact caused by the differences
in text lengths.</p>
</div>
</div>
<div id="sentimentr-r-package" class="section level1">
<h1><em>sentimentr</em> R Package</h1>
<p>We will now present a slightly more sophisticated method to arrive at a sentiment
score which is implemented in the <a href="https://github.com/trinker/sentimentr">sentimentr</a>
package. At the core this method is still using a lexicon lookup to assign scores
to each word. However, on top of that it applies a logic to identify <em>negations</em>
or <em>amplifiers/deamplifiers</em> (like “very” or “hardly”) to adapt the score of the
words.</p>
<p>The details of the algorithm are somewhat complex and can be found
<a href="https://github.com/trinker/sentimentr#the-equation">here</a>. In a simplified mental
model we can assume that this method will perform adjustments such as multiplying
the score of the word “good” with <span class="math inline">\(-1\)</span> if it is preceded by the word “not”.</p>
<p>We can therefore expect this method to produce better scores for the example
cases we described above:</p>
<pre><code>## # A tibble: 3 x 2
## review_text score_sentimentr
## <chr> <dbl>
## 1 This restaurant is good 0.375
## 2 This restaurant is really good 0.604
## 3 This restaurant is not good -0.335</code></pre>
<p>And indeed the scores seem to do a better job at measuring the sentiments now.
Both the negations as well as the amplification are picked up in a sensible
fashion.</p>
<p>In addition to that the algorithm contains mechanisms to reduce the impact of
text that does not contain any sentiment on the overall score.</p>
<p>When we use this approach instead of the simple lexicographical approach we obtain
the following score distribution:</p>
<p><img src="https://natural-blogarithm.com/post/restaurant-reviews-sentiment-analysis/index_files/figure-html/sentimentr_score_dist-1.png" width="90%" style="display: block; margin: auto;" /></p>
<p>We still observe the same qualitative pattern as with the simple lexicographical
approach: The point score is positively correlated with the sentiment score in
within cities.</p>
<p>Also the scores from Stockholm are still more pronounced than the Berlin scores
in each category but the relative difference is much lower than with our previous
calculations. This might be an indication that the differences we saw previously
were partly caused by the inability of the lexicographical approach to handle
the negations and amplifiers correctly or the presence of more descriptive text
in the Berlin data.</p>
</div>
<div id="a-neural-network-model" class="section level1">
<h1>A Neural Network Model</h1>
<p>The two methods we have presented so far can be regarded as <em>heuristic</em> as both
the scoring of the individual words in the lexicon as well as the structure and
coefficients of the equation in the <code>sentimentr</code> package are first and
foremost based on human assessment.</p>
<p>We will now turn to <em>supervised machine learning methods</em> as an alternative
paradigm to produce a sentiment score. This approach still requires human assessment
in the form of labels that determine whether a text contains a negative or positive.
However, once these labels have been established it will be up to the model’s
training algorithm to describe the relationship between the sentiment label
and the input text based on the data it processes.</p>
<p>For our review data we do not have the sentiment labels available to train our
own model (in fact these labels are what we are trying to obtain) but luckily there
are frameworks available that provide pre-trained models for sentiment analysis.
One such framework is the
<a href="https://github.com/flairNLP/flair">flair package</a> which provides easy access to
state-of-the-art machine learning models for NLP (Natural Language processing)
tasks.</p>
<p>In many areas of machine learning research the state-of-the-art models nowadays
tend to be <em>neural network models</em> and this definitely is the case for
Natural Language Processing (NLP) problems.</p>
<p>Neural networks have the ability to
<a href="http://neuralnetworksanddeeplearning.com/chap4.html">approximate any kind of function</a>
which makes them very suitable to represent the complex structures that can arise
in textual data and in our case we are interested in describing the functional
relationship between the raw input text and the associated sentiment. If we
allow the model enough complexity and training time it will build representations
of the input texts that will enable it to identify structures that indicate
negative or positive sentiments (such as negations or amplifiers). This means it
will derive rules similar to those encoded in the equation of the <code>sentimentr</code>
package by itself.</p>
<p>While the power of neural networks is indisputable they do have some
disadvantages. One of them is the fact that it is extremely difficult to understand
how the network arrives at its predictions. It may building decision rules similar
to the heuristics we have described above but translating these rules into a form
that a human can understand them is not necessarily possible.</p>
<p>Another drawback is that neural networks tend to be extremely computionally
expensive which is why in practice specialised machines are often used when
working with neural networks.</p>
<p>As a matter of fact the computational requirements made it impractical for us to
use the more advanced <em>transformer based neural network</em> in the flair package and
instead we had to use the somewhat simpler <em>recurrent neural network (RNN)</em> that
allowed for reasonable computation times on my machine.</p>
<p>The model we used was trained on a dataset consisting of movie and product reviews
where each text was assigned a negative or positive label. Therefore the
output from the model is the <em>estimated probability</em> that a given piece of text
expresses a negative or positive sentiment. This is an important difference
from our previous approaches where the score was an <em>expression of the strength</em>
of the sentiment. We will need to keep this in mind when interpreting the results
of the model.</p>
<p>For our analysis we took the output from the model and scaled it such that a
value of -1 corresponds to a 100 % probability of a negative sentiment and +1
to a 100 % probability of a positive sentiment. A score of 0 means that the
model is undecided about which sentiment to assign.</p>
<p>Using this approach the resulting distribution of sentiment scores looks as follows:</p>
<p><img src="https://natural-blogarithm.com/post/restaurant-reviews-sentiment-analysis/index_files/figure-html/flair_score_dist-1.png" width="90%" style="display: block; margin: auto;" /></p>
<p>As in the cases before the distribution for both cities follow the same pattern
in the sense that higher point ratings are associated with higher sentiment
scores (and vice versa for lower point ratings).</p>
<p>Again we focus our attention on comparing the sentiment scores between the two
cities to answer our initial analysis question for this blog post: We can see
that the scores from the neural network reproduce the positive difference between
the scores from Stockholm and Berlin in the four point category. While the
difference seems to be relatively smaller than what we saw with the previous methods
this can still be regarded as evidence to support our subjective observations that
motivated this analysis.</p>
<div id="interpreting-the-neural-network-scores" class="section level2">
<h2>Interpreting the Neural Network Scores</h2>
<p>Apart from that the scores from the neural network differ quite a bit from the
previous methods: On the one hand the sentiment scores from Berlin now have higher
magnitude than the ones from Stockholm (the four point category being the only
real exception as mentioned above).</p>
<p>Also the distributions for both cities are now much more symmetric in the sense
that in the low point categories the scores have almost the same magnitude as in
the high point categories (just with reversed signs). The previous distributions
were much less symmetric with skews towards higher scores.</p>
<p>To understand those differences we need to remind ourselves that the metric
produced from the neural network is not a measurement of the strength of
sentiment but reflects the <em>certainty of the model</em> about whether a given text
expresses a positive or negative sentiment.</p>
<p>Therefore the fact that the absolute magnitude of the scores in the one point and
five point category are almost the same does not imply that the sentiments expressed
in those categories are equally strong. It simply means that the model is approximately
equally sure about the cases in those rating groups reflect negative and positive
sentiments respectively. Intuitively a relationship between this measure of
certainty and the strength of the sentiments is plausible but we have no
justification to simply assume it exists.</p>
<p>In general the model seems to be quite confident in its predictions, which can
be confirmed by looking at a histogram of the its predictions:</p>
<p><img src="https://natural-blogarithm.com/post/restaurant-reviews-sentiment-analysis/index_files/figure-html/flair_score_hist-1.png" width="90%" style="display: block; margin: auto;" /></p>
<p>The distribution of the predicted probabilities is heavily focused on the extreme
ends of the spectrum for both cities.</p>
<p>As we already mentioned above it is difficult to understand how a neural network
arrives at its predictions. One thing we can do though is to look at some of its
predictions and the associated input text to get an idea of what it might be
doing. This might be especially useful to identify cases with which the model
seems to struggle.</p>
<p>For example, the following reviews all got a highly negative score but obviously
seem to express rather positive sentiments:</p>
<blockquote>
<p>“I can only recommend it”</p>
</blockquote>
<blockquote>
<p>“Great for business lunches, but also for dining out with friends. On request there are always vegetarian delicacies and I don’t know of any place in Berlin where a steak is cooked better to the point. The service is second to none. I can only recommend it!”</p>
</blockquote>
<blockquote>
<p>It’s very nice .. Amazingly cheap 🤷♂</p>
</blockquote>
<blockquote>
<p>The upper hammer! Unfortunately not cheap, but seriously, the evening was worth every penny.</p>
</blockquote>
<p>The occurrence of the phrase “I can only recommend it” in the first and second
review make you wonder whether there might be an issue with that particular
phrase that leads to the low scores.</p>
<p>The special characters in the third example might have caused issues in the
preprocessing step before passing the data to the model.</p>
<p>The results for the fourth example may be affected by the improper translation of
the positive German slang word “Oberhammer” but should probably still result in a
much higher score.</p>
<p>It is also possible to find some examples which are off in the opposite direction,
such as the following review which received a score of 0.95:</p>
<blockquote>
<p>Have to wait a long time for food.</p>
</blockquote>
<p>We also spotted some examples where mild criticism within a generally positive
review seems to be overemphasised leading to a low score:</p>
<blockquote>
<p>“Good treatment fresh premises … The pizza was good but the Pepperonin was strong.”</p>
</blockquote>
<blockquote>
<p>“Convincing cuisine at fair prices. Tell the cook: Please spicy, then it will be even tastier! The lamb was better than the chicken by the way …”</p>
</blockquote>
<p>These two reviews have a score of around 0 and -0.38 respectively which seems
too low compared to some of the other predictions of the model.</p>
<p>It should be pointed out that we don’t claim the cases above should not be viewed
as representative examples of the model’s performance in general. While performing
the face validity checks on the model’s predictions those cases were definitely
the exception rather than the norm. We should also keep in mind that the model
was not trained on restaurant reviews specifically and might therefore be
insensitive to a lot of the specific lingo (such as descriptions of taste or
atmosphere).</p>
<p>However, we should use these examples as a reminder that even state-of-the-art
neural network models are not infallible, especially when we are applying a
pre-trained model on data that is not necessarily comparable with the data it
was trained on.</p>
</div>
</div>
<div id="conclusion" class="section level1">
<h1>Conclusion</h1>
<p>In this blog post we presented and applied three different methods to analyse the
sentiments in restaurant reviews from Stockholm and Berlin. Each of the method
showed a difference in the sentiments within the category of ratings with four
points which we can interpret as evidence to support our subjective findings
that motivated this analysis. However, we should be cautious to lean on these findings
too much as all the methods we applied have their limitations as we pointed out
during the course of the analysis.</p>
<p>In our next post we will further explore the relationship between review texts
and the point score by building and analysing models that predict the point rating
from a given review text.</p>
</div>
Comparing Restaurant Reviews in Stockholm and Berlin
https://natural-blogarithm.com/post/restaurant-reviews-stockholm-vs-berlin/
Tue, 09 Mar 2021 00:00:00 +0000https://natural-blogarithm.com/post/restaurant-reviews-stockholm-vs-berlin/
<script src="https://natural-blogarithm.com/post/restaurant-reviews-stockholm-vs-berlin/index_files/header-attrs/header-attrs.js"></script>
<div id="TOC">
</div>
<p>Late last summer we spent a couple of weeks on vacation in beautiful Sweden. Back
then the Corona pandemic had not yet hit the country quite as hard and it was
still possible to go to restaurants without any restrictions. While looking online
for nice restaurants to go to we noted some differences in the user reviews from
what we were used to from Germany.</p>
<p>On the one hand we felt we encountered way less ratings that gave the full point
score. This does not necessarily mean that restaurant quality is worse in Sweden
than in Germany as there are many factors that make a direct comparison of these
ratings difficult (as we’ll see below). However, this finding made me curious to
investigate whether this difference can be substantiated in a larger dataset of
reviews or whether it might be just a subjective impression we got from our small
sample.</p>
<p>Another thing we noticed was an apparent difference in the wording of the review
texts and how they relate to the point score. For example, let’s look at some
actual reviews from Stockholm:</p>
<blockquote>
<p>“Extremely tasty, fresh spicy burgers and fantastic sweet potato fries.”</p>
<p>“Very good food but the best thing is the staff, they are absolutely wonderful here!!”</p>
<p>“Stockholm’s best kebab I think. The pizzas are also super here.”</p>
<p>“Very good pasta, wonderful service !!! Very cozy place, everything is perfect👍”</p>
</blockquote>
<p>While all of the review texts above seem to be overwhelmingly positive they were
all associated with a point rating of less than five (the full score). From our
personal experience with restaurant reviews from Germany (and Berlin specifically)
we would have expected these review texts to indicate a five point rating.</p>
<p>Again, all of our observations were obviously subjective and we were only looking
at a small, non-representative sample of restaurant reviews. Therefore I decided
to find out whether our observations could actually be backed up by an analysis
of a larger dataset of reviews.</p>
<p>To be precise, I would like to answer the following two questions:</p>
<ol style="list-style-type: decimal">
<li><em>Is there a difference in the distribution of point ratings of restaurants
when comparing Sweden and Germany?</em></li>
<li><em>Is there a difference in the relationship between the wording of the review
texts and the associated point rating?</em></li>
</ol>
<p>I will answer these two questions over the course of two blog posts. This first
post will describe how to obtain and process the review data. Once this is done
answering the first question will be straightforward.</p>
<p>Answering the second question will be a bit more tricky, partly due to its
somewhat vague formulation. It will require some Natural Language Processing (NLP)
and modeling techniques and will be done in a dedicated follow-up post.</p>
<div id="getting-the-data" class="section level1">
<h1>Getting the data</h1>
<p>First of all, we will need to collect a sample of restaurant reviews from both
countries that we can then analyse. For simplicity we will restrict our analysis
to the two capital cities Berlin and Stockholm. As a data source we will query
Google Maps through their public APIs.</p>
<div id="getting-a-random-sample-of-reviews" class="section level2">
<h2>Getting a Random Sample of Reviews</h2>
<p>For our analysis it would be preferable to obtain a random sample of restaurant
reviews from each of the cities as this will give us the most accurate estimation
of the actual reality. The Google Maps API is not really designed to provide this
kind of access so we will have to come up with a process that will at least approximate
a random sample as close as possible. We will describe this process in this section.</p>
</div>
<div id="the-place-details-and-nearby-search-apis" class="section level2">
<h2>The <em>Place Details</em> and <em>Nearby Search</em> APIs</h2>
<p>The API endpoint that will provide the actual restaurant reviews is the
<a href="https://developers.google.com/maps/documentation/places/web-service/details">Place Details</a>
endpoint. As one of its inputs it requires a <code>place_id</code> which uniquely identifies a
place (in our case a restaurant) within Google Maps. Given this identifier the
API will return some information about the restaurant, among them up to five user
reviews.</p>
<p>To obtain the <code>place_id</code>s of a number of restaurants in Berlin and Stockholm we
will use a second endpoint, the
<a href="https://developers.google.com/maps/documentation/places/web-service/search#nearby-search-and-text-search-responses">Place Search</a>. This endpoint provides the <em>Nearby Search</em>
functionality that accepts a set of latitude and longitude coordinates and will
return a list of up to 20 restaurants in proximity of these coordinates.</p>
<p>So to kick off our our process of sampling restaurant reviews all we need is a
number of coordinates in Berlin and Stockholm respectively.</p>
</div>
<div id="sampling-coordinates-within-the-cities" class="section level2">
<h2>Sampling Coordinates within the Cities</h2>
<p>Since we would like our final sample of reviews to be as random as possible it makes
sense to randomly select coordinates in the two cities. Possibly the simplest way
to do that would be to sample in the
<em><a href="https://wiki.openstreetmap.org/wiki/Bounding_Box">bounding box</a></em> of each
city which can be interpreted as a rectangular approximation of a city’s boundaries.</p>
<p>Using this approach we would get a sample of coordinates similar to the data we
can see below (note that the zoom levels of the plots are different):</p>
<p><img src="images/location_samples_bb_Stockholm.png" width="80%" style="display: block; margin: auto;" /><img src="images/location_samples_bb_Berlin.png" width="80%" style="display: block; margin: auto;" /></p>
<p>We can see that this approach seems to be quite inaccurate and inefficient as it
produces many points outside the actual city limits.</p>
<p>Therefore we will turn to a more precise method using polynomial city boundaries
which can be obtained from the
<a href="https://www.openstreetmap.org/">Open Street Map project</a>. Sampling coordinates
from this more complex geometric structure is not as straightforward as for the
rectangular bounding box but luckily it has already been implemented in the
<code>spsample</code> function from the R package <code>sp</code>.</p>
<p><img src="images/location_samples_bp_Stockholm.png" width="80%" style="display: block; margin: auto;" /><img src="images/location_samples_bp_Berlin.png" width="80%" style="display: block; margin: auto;" /></p>
<p>The results we get with this approach indeed seem to be preferable over the
rectangular bounding box.</p>
<p>So now that we have a way to sample random locations within each of the
cities we have defined the full process for obtaining restaurant reviews:</p>
<ol style="list-style-type: decimal">
<li>Sample a number of coordinates for each city</li>
<li>For each of the coordinates query the <em>Nearby</em> API for restaurants to obtain a
number of <code>place_id</code>s</li>
<li>For each of the <code>place_id</code>s query the <em>Details</em> API to get the restaurant reviews</li>
</ol>
<p>Since this way of sampling may return the same review more than once we will have
to deduplicate the results afterwards.</p>
</div>
<div id="language-of-reviews" class="section level2">
<h2>Language of Reviews</h2>
<p>One more thing to note is that the <em>Details</em> API has a parameter which
will control the language in which the reviews are returned. For our analysis it
would intuitively make sense to set this parameter to English as it will make it
easier to understand and process the data, especially for the second part of our
investigation in which we will look at the review texts specifically.</p>
<p>However, while experimenting with different settings we found out that this
parameter seems to do more than to simply return a one-to-one translation of the
same reviews. In fact, in some cases it would yield completely different reviews.</p>
<p>So for our analysis we decided to set this parameter to the native language of the
respective city. We did so in the hopes that we will obtain more reviews that were
written by actual residents of the city this way. This should somewhat mitigate
the influence of tourists on the ratings and allow us to get a clearer picture of
the actual difference between the two cities.</p>
<p>Another added benefit of this approach is that it will force us to perform the
translation ourselves which will also ensure that the review texts have gone
through the same translation step. This additional step will be done through the
<a href="https://cloud.google.com/translate/">Google Translation API</a>.</p>
</div>
<div id="implementation" class="section level2">
<h2>Implementation</h2>
<p>To actually query the Google Maps and Translation APIs we wrote simple one-liner
functions, such as the one shown below for the <em>Nearby</em> API:</p>
<pre class="r"><code>query_nearby_api <-
function(x,y,api_key) {
list(fromJSON(paste0("https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=",y,",",x,"&rankby=distance&type=restaurant","&key=",api_key)))
}</code></pre>
<p>Now if we have our sampled coordinates in a dataframe where each row corresponds to
one set of coordinates we can query the API simply by wrapping the function above
in an <code>mcmapply</code> call:</p>
<pre class="r"><code>results_nearby$nearby_response <-
mcmapply(FUN = query_nearby_api,
results_nearby$x,results_nearby$y,
api_key,
mc.cores = num_cores)</code></pre>
<p>This has the added benefit of straightforward parallelisation though it should be
noted that you may run into <a href="https://developers.google.com/maps/premium/previous-licenses/webservices/quota#usage_limits">rate limits</a>
if you turn up the number of concurrent requests too much.</p>
<p>We applied a similar approach for all the APIs used in this analysis. If you are
interested in the details you can find the full R implementation of the data
collection process <a href="https://github.com/jakobludewig/r_projects/blob/main/2021-03-17%20Restaurant%20Reviews%20Analysis/01%20-%20Get%20Restaurant%20Reviews.R">here</a>.</p>
<p>Please note that if you want to replicate the results you will need to set up a
Google Maps API key and running the script will incur costs depending on the
amount of reviews you wish to query. Please check the the latest API pricing of
Google Maps beforehand.</p>
<p>It should be mentioned that there are several R packages available which we
could have used for querying the Google Maps APIs such as
<a href="https://cran.r-project.org/package=mapsapi">mapsapi</a> or
<a href="https://github.com/SymbolixAU/googleway">googleway</a>. However, none of the packages
we found offered access to all the APIs we needed to use (Nearby, Places Details,
Translate) and since the APIs are pretty simple we decided to go with approach
outlined above. This also gives us more control over other aspects such as the
output format and the parallelisation.</p>
</div>
</div>
<div id="rating-distribution" class="section level1">
<h1>Rating Distribution</h1>
<p>Using the process described above we obtained a total of 32,376
restaurant reviews out of which 21,548
were from Berlin and 10,828
from Stockholm. As we can see our sample is not balanced even though we queried the same amount
of coordinates for each city which is probably due to less restaurants or a lower
restaurant density in Stockholm.</p>
<p>The review data comes out of our querying script in a handy dataframe format in
which each row corresponds to one review:</p>
<pre><code>## # A tibble: 32,376 x 9
## city_name text_translated rating country_code country_language name place_id
## <chr> <chr> <int> <chr> <chr> <chr> <chr>
## 1 Berlin "Today I got k… 4 de de Hacı… ChIJHZN…
## 2 Berlin "I can only ag… 1 de de Suma… ChIJL3j…
## 3 Stockholm "Since Friday'… 1 se sv T.G.… ChIJt9T…
## 4 Berlin "Super!!! Shor… 5 de de Curr… ChIJRQC…
## 5 Berlin "Delicious, fr… 5 de de Brus… ChIJlx7…
## 6 Berlin "Since there a… 4 de de Asia… ChIJf3D…
## 7 Berlin "Quality for d… 5 de de Rist… ChIJIyF…
## 8 Stockholm "Thursdays are… 5 se sv Inte… ChIJo89…
## 9 Berlin "Better than t… 3 de de Gril… ChIJX9r…
## 10 Berlin "Sun, good Aug… 4 de de Zoll… ChIJiXM…
## # … with 32,366 more rows, and 2 more variables: place_rating <dbl>, text <chr></code></pre>
<p>With this data answering our first question about whether there is a difference in
the point rating distributions of Stockholm and Berlin can easily be done by
creating a simple visualisation of the two distributions:</p>
<p><img src="https://natural-blogarithm.com/post/restaurant-reviews-stockholm-vs-berlin/index_files/figure-html/rating_dist_plot-1.png" width="90%" style="display: block; margin: auto;" /></p>
<p>So it seems that there is actually quite a difference in the distributions of the
two cities. In fact there are about
18 %
more five point reviews in Berlin than in Stockholm and the difference seems to
be spread out across all the remaining rating categories. The average rating in
Berlin is 4.1
vs 3.8 in Stockholm.</p>
<p>The differences in the distribution are highly significant which can be confirmed
by running a Chi Square test or (probably more appropriately) a Mann-Whitney U Test.</p>
<div id="comparability-of-ratings-from-berlin-and-stockholm" class="section level2">
<h2>Comparability of Ratings from Berlin and Stockholm</h2>
<p>So we have answered our first question whether the rating distribution for the
two city differs.</p>
<p>But can we conclude from this that the restaurant quality in
Berlin is higher than in Stockholm? At least from my personal experience this is
not likely to be the case and there are also many theoretical arguments that could
be made against this conclusion.</p>
<p>One pretty simple and obvious one can be made by looking a bit closer at the rating
process. The point rating in Google Maps is a one-dimensional measure that aims to
summarise the user’s impression of a location or service in one single number.</p>
<p>However, it is plausible to assume that in reality that a person’s opinion of a
restaurant is formed by many different aspects such as the taste of the food,
the atmosphere, value for money, quality of service etc.</p>
<p>For example, let’s assume the following simple model for the overall point rating a
user will give to a restaurant:</p>
<p><span class="math display">\[s = w_{food} \cdot s_{food} + w_{atmosphere} \cdot s_{atmosphere} + w_{price} \cdot s_{price} + w_{service} \cdot s_{service}\]</span></p>
<p>Here the <span class="math inline">\(w\)</span>s are weights that sum to one and indicate how much importance a user
attributes to a certain aspect of the restaurant experience. The <span class="math inline">\(s\)</span> variables hold
sub scores (ranging from one to five) that a user would give to the aspect of the
restaurant experience. So in our simple model the overall score a user gives to a
restaurant is just a weighted sum of all those sub scores.</p>
<p>Let’s further assume that for different individuals the aspects above vary in
importance for their overall satisfaction with their restaurant experience.
This would translate into different sets of weights <span class="math inline">\(w\)</span> for different users in
the equation above, resulting in user-specific rating models:</p>
<p><span class="math display">\[s_{user1} = 0.5*s_{food} + 0.2*s_{atmosphere} + 0.2*s_{price} + 0.1*s_{service} \\
s_{user2} = 0.3*s_{food} + 0.2*s_{atmosphere} + 0.3*s_{price} + 0.2*s_{service}\]</span></p>
<p>Now what that means is that even if these two users agree in their assessment for
each of the sub scores for a given restaurant, they might still come up with
different overall scores for the same place. Accordingly the rating of the first
user may have only limited relevance for the second user when choosing a restaurant
(and vice versa).</p>
<p>The same argument holds on the city level: If the rating models of the people in
Stockholm and Berlin differ too much from each other the ratings become incomparable.</p>
<p>And even if the rating models were to be the same in the sense that they share the
same values for the weights <span class="math inline">\(w\)</span> that still does not guarantee that the overall
scores are compatible. For this we would still need the sub scores <span class="math inline">\(s\)</span> to be
identical for each user which may very well not be the case. For example,
maybe people from Stockholm are (on average) more picky when it comes to the
atmosphere of a restaurant and may assign a score of four for that aspect where a
user from Berlin might assign a five point score.</p>
<p>The derivations above are of course only theoretical and hinge on many simplifying
and unproven assumptions. Establishing how the rating process works in reality is
probably not possible with the data we have available here.</p>
<p>However, the argument we are making above was only to discourage us from jumping
to premature conclusions about the actual restaurant quality in the two cities
when comparing their rating distribution. We have presented one possible scenario
in which this comparison could lead to the wrong conclusion but are not claiming
this is necessarily the case. In reality it might still very well be that Berlin simply has
better restaurants than Stockholm! 😉</p>
</div>
</div>
<div id="entropy-considerations" class="section level1">
<h1>Entropy Considerations</h1>
<p>So even though the rating distributions from the two cities may not tell us which
city has the better restaurants, is there maybe something else we can learn from
them?</p>
<p>In fact we could think about which of the two distributions provides a more
realistic picture of the restaurant quality in the respective city which would
have a direct impact on their usefulness for choosing a good restaurant.</p>
<p>As we have already discussed the Berlin distribution is heavily concentrated in
the five point category (around 64 % of all ratings are in that category) whereas
the Stockholm distribution is much more spread out among all categories.</p>
<p>We can think of the two distributions as more nuanced versions of two extreme
cases: the uniform distribution (Stockholm) and a polarised one that is only
concentrated in two categories (Berlin).</p>
<p>The plot below shows how these two distributions could look like:</p>
<p><img src="https://natural-blogarithm.com/post/restaurant-reviews-stockholm-vs-berlin/index_files/figure-html/polarised_v_uniform_plot-1.png" width="90%" style="display: block; margin: auto;" /></p>
<p>The fact that the Berlin distribution is closer to the polarised distribution and
the Stockholm distribution closer to the uniform can also be verified by
calculating their respective <a href="https://en.wikipedia.org/wiki/Entropy_(information_theory)">entropies</a>:</p>
<pre><code>## # A tibble: 2 x 2
## city_name entropy
## <chr> <dbl>
## 1 Berlin 1.13
## 2 Stockholm 1.40</code></pre>
<p>Entropy is a measure from information theory and can be interpreted as a measure
for how much information is encoded in a distribution.</p>
<p>Now how does this relate to the usefulness in picking a restaurant in the respective
cities? This depends a bit on what we believe the actual landscape of restaurants
in the two cities looks like.</p>
<p>If for example you believe that the vast majority of Berlin restaurants are top
notch then such a heavily polarised distribution may actually provide you with
an accurate picture of that reality.</p>
<p>Similarly, if you think that the Stockholm restaurant scene is closer to a pretty
uniform mix of restaurants of all kinds of quality then the somewhat uniform
shape of the rating distribution we have seen for Stockholm might be doing a good job.</p>
<p>On the other hand, if you doubt either of the two assessments above you could conclude
that the respective distribution is not accurately reflecting reality and therefore
not doing a good job at helping you to choose a nice restaurant.</p>
<p>A related side note from my personal experience: In Berlin I have been asked on several
occasions by restaurant owners, doctors or Uber drivers to give them an online
review in case I was happy with the service. However, they asked me to only leave
a review if I was planning to give the full five point score as otherwise it would
hurt their overall rating.</p>
<p>Having been made aware of these circumstances has actually changed my rating
behaviour in general. I now tend to give the highest rating when I am at all
satisfied in order to not hurt the restaurant’s or service provider’s overall
standing. Therefore I am personally contributing to the polarised and potentially
less informative rating distribution in Berlin. If this nudge effect is present
in a sufficient amount of people it can partly explain the high polarisation
in the distribution.</p>
</div>
<div id="conclusion" class="section level1">
<h1>Conclusion</h1>
<p>In this blog post we described how to obtain restaurant review data from the
Google Maps APIs and compared the point rating distribution of Stockholm and Berlin.</p>
<p>Based on our sample of restaurant reviews we were able to verify our subjective
impression that there are proportionally more five point ratings in Berlin than in
Stockholm.</p>
<p>While we demonstrated that it will be difficult to draw any conclusions about which
city has the better restaurants with this data, we could still discuss which
distribution might be more informative to the user.</p>
<p>In the next blog post we will investigate whether there are discernible differences
in the relationship between the review texts and the associated point rating between
the two cities.</p>
</div>
Does Predictor Order Affect the Results of Linear Regression?
https://natural-blogarithm.com/post/r-regression-predictor-order/
Tue, 02 Mar 2021 00:00:00 +0000https://natural-blogarithm.com/post/r-regression-predictor-order/
<script src="https://natural-blogarithm.com/post/r-regression-predictor-order/index.en_files/header-attrs/header-attrs.js"></script>
<div id="TOC">
</div>
<p>In the last few weeks I encountered the following question in two separate
discussions with friends and colleagues: <em>When doing linear regression in R, will
your coefficients vary if you change the order of the predictors in your data?</em></p>
<p>My intuitive answer to this was a clear <em>“No”</em>. Mostly because the practical
implications of the alternative would be too severe. It would make linear regression
unusable in many research settings in which you rely on the fitted coefficients
to understand the relationship between the predictors and the independent
variable. After all, what would be the correct, “canonical” order of inputs that
you should use to estimate these effects?</p>
<p>The problem might be less relevant in other settings where you are more focused
on building models with high predictive performance, like it is the case
in many business applications. In these scenarios you are likely to be using more powerful, black box models such as XGBoost or random forest models to begin with.
However, even then it is common practice to use a regression model as a baseline to
better understand your data and benchmark your production models against.</p>
<p>So the issue seemed very relevant to me but I realised I did not have many arguments
handy to support my intuitive answer. Therefore I decided to spend some time to explore
the issue further and in the process refresh some of the numerical theory behind
how the linear regression problem is solved on a computer.</p>
<p>Turns out that during the course of my investigation which I want to summarise in
this blog post my simple <em>“No”</em> has become more of an <em>“It depends…”</em></p>
<p>Note that I will only be looking at R’s standard linear regression function as
implemented in the <code>lm</code> function from the <code>stats</code> package.</p>
<div id="a-first-look-at-the-problem" class="section level1">
<h1>A First Look at the Problem</h1>
<p>First of all, let’s try to get some more clarity about the exact question that
we will be trying to answer here. Let’s say we are fitting two linear regression
models on the good old <code>mtcars</code> dataset:</p>
<pre class="r"><code>> m1 <-
+ lm(formula = mpg ~ cyl + disp + hp + drat + wt + qsec + gear, data = mtcars)
> m2 <-
+ lm(formula = mpg ~ qsec + hp + gear + cyl + drat + wt + disp, data = mtcars)</code></pre>
<p>These two models contain the same predictors, just in a different order.
Now if we compare the fitted coefficients of both models, can we expect them to
be the same?</p>
<p>So let’s put the coefficients of both models side by side and calculate their
<a href="https://en.wikipedia.org/wiki/Approximation_error">relative error</a>:</p>
<pre><code>## key coeff_m1 coeff_m2 rel_error
## 1 (Intercept) 18.58647751 18.58647751 1.338016e-15
## 2 cyl -0.50123400 -0.50123400 5.094453e-15
## 3 disp 0.01662624 0.01662624 1.043365e-15
## 4 hp -0.02424687 -0.02424687 1.430884e-16
## 5 drat 1.00091587 1.00091587 1.331049e-15
## 6 wt -4.33688975 -4.33688975 4.095923e-16
## 7 qsec 0.60667859 0.60667859 1.830002e-16
## 8 gear 1.04427289 1.04427289 6.378925e-16</code></pre>
<p>So while the way the values are displayed suggests that they might be identical,
we can see from the column <code>rel_error</code> that they are actually <em>not exactly the same</em>.</p>
<p>To make sure that this fluctuation is actually a result of the predictor permutation
and not a result of some potential non-determinism within the <code>lm</code> function,
let’s see what happens when we fit the exact same model twice:</p>
<pre class="r"><code>> all(coefficients(lm(formula = mpg ~ cyl + disp + hp + drat + wt + qsec + gear, data = mtcars)) ==
+ coefficients(lm(formula = mpg ~ cyl + disp + hp + drat + wt + qsec + gear, data = mtcars)))
## [1] TRUE</code></pre>
<p>It is reassuring to see that in this case the coefficients are exactly the same.</p>
<p>So here we have a first interesting finding: We can expect at least slight fluctuations
in the coefficients when we play around with the order of the predictor variables.</p>
<p>Later on we will get a better idea where these small fluctuations come from
when we look at how R solves the linear regression problem. However, at least
in our example above, fluctuations on an order of <span class="math inline">\(\approx 10^{-16}\)</span> will probably
have no practical relevance. They also happen to be on the same order of magnitude
as the machine precision for variables of type double.</p>
</div>
<div id="simulation-study" class="section level1">
<h1>Simulation Study</h1>
<p>So after our little experiment with the <code>mtcars</code> dataset, can we consider our
question answered? Not quite, as running two models on one single dataset can hardly
be sufficient to answer such a general question.</p>
<p>For getting a more solid answer we will need to take a look at the theory of
linear regression which we will do in a minute. But first let’s take an
intermediate step and scale our little <code>mtcars</code> experiment to a larger amount of
datasets. This will allow us to see if there are other, more severe cases of
fluctuating coefficients and will also serve as a nice transition into the more
theoretical parts of this post.</p>
<div id="getting-datasets-from-openml" class="section level2">
<h2>Getting Datasets from OpenML</h2>
<p>To do this simulation study we obtained a number of datasets from the great
<a href="https://www.openml.org/">OpenML</a> dataset repository using their provided <a href="https://cran.r-project.org/web/packages/OpenML/index.html">R package</a>. Overall
207 datasets with between 25 and 1000 observations and between 3 and 59 predictors
were obtained this way. This is obviously not a big sample of datasets to base
general statements on. However, it should provide us with enough educational
examples to get a better understanding and intuition of the kinds of issues you may
encounter when fitting linear models “in the wild”.</p>
<p>The details and results from the study can be reviewed and replicated from the
R Notebook <a href="https://github.com/jakobludewig/r_projects/tree/main/2021-02-11%20Predictor%20order%20in%20R%20linear%20regression">here</a>.</p>
<p>For each of the processed datasets we generated a number of predictor permutations
and fitted a model for each of them using the same target variable each time.
Overall we ended up with 3.764 fitted models with an average of 19.5 fitted
coefficients each.</p>
</div>
<div id="quantifying-the-coefficient-fluctuations" class="section level2">
<h2>Quantifying the Coefficient Fluctuations</h2>
<p>As we have already learned from our <code>mtcars</code> example above we can not expect the
coefficients across different fits to be exactly the same. So we need to come up
with a meaningful way to detect fluctuations that we would consider problematic.</p>
<p>For this we will use the <a href="https://en.wikipedia.org/wiki/Coefficient_of_variation">coefficient of variation (CV)</a> which is equal to
the standard deviation normalised by the mean. It is therefore a measure of
spread/variation of a set of observations but unlike the standard deviation it
is expressed on a unit-less scale relative to the mean. This will allow us to
compare the CVs of the different coefficients.</p>
<p>For every predictor in each dataset we calculated the CV across all the fit
coefficients from the different models (i.e. predictor permutations). To
identify extreme cases of fluctuations we flagged those predictors
for which a CV of more than <span class="math inline">\(10^{-10}\)</span> was measured. This limit has been chosen
somewhat arbitrarily but should represent a quite conservative approach as it
limits the maximum deviation for any given coefficient to <span class="math inline">\(\sqrt{N-1} * 10^{-10}\)</span>.
Here <span class="math inline">\(N\)</span> is the number of fits for that coefficient which in our case is between
2 and 20 depending on the dataset.</p>
</div>
<div id="simulation-results" class="section level2">
<h2>Simulation Results</h2>
<p>In our simulation around 14.3 % of the coefficients showed a CV value above the
chosen threshold of <span class="math inline">\(10^{-10}\)</span>, involving around 7.7 % of the datasets. The plot below also shows the overall distribution of the CV in a histogram on a log 10 scale.</p>
<p><img src="https://natural-blogarithm.com/post/r-regression-predictor-order/index.en_files/figure-html/cv_dist-1.png" width="864" style="display: block; margin: auto;" /></p>
<p>From the plot we can see that some of the coefficients fluctuate on a magnitude
of up to <span class="math inline">\(10^3\)</span>, way above our defined threshold of <span class="math inline">\(10^{-10}\)</span> (dashed red line).
Unlike the fluctuations on the <code>mtcars</code> dataset these kinds of fluctuations can
certainly not be ignored in practice.</p>
</div>
</div>
<div id="analysis" class="section level1">
<h1>Analysis</h1>
<p>So let’s try to do some troubleshooting to find out how these fluctuations come
about. First, let’s take a closer look at the datasets in which the problematic
coefficients occur:</p>
<pre><code>## # A tibble: 16 x 7
## dataset_id max_cv high_cv perc_coefficients_missi… dep_var n_obs n_cols
## <int> <dbl> <lgl> <dbl> <chr> <int> <int>
## 1 195 2.64 TRUE 0.0476 price 159 21
## 2 421 5321. TRUE 0.426 oz54 31 54
## 3 482 11.5 TRUE 0.0217 events 559 46
## 4 487 410. TRUE 0.268 response_1 30 41
## 5 513 3.91 TRUE 0.0217 events 559 46
## 6 518 1.02 TRUE 0.222 Violence_ti… 74 9
## 7 521 1284. TRUE 0.118 Team_1_wins 120 34
## 8 527 11.3 TRUE 0 Gore00 67 15
## 9 530 1.17 TRUE 0.0833 GDP 66 12
## 10 533 35.0 TRUE 0.0217 events 559 46
## 11 536 24.1 TRUE 0.0217 events 559 46
## 12 543 2462. TRUE 0.111 LSTAT 506 117
## 13 551 30.5 TRUE 0.188 Accidents 108 16
## 14 1051 70.4 TRUE 0.238 ACT_EFFORT 60 42
## 15 1076 194. TRUE 0.383 act_effort 93 107
## 16 1091 49.8 TRUE 0 NOx 59 16</code></pre>
<p>One thing that stands out is that for almost all of the problematic datasets there
seem to be coefficients that could not be fit (see column <code>perc_coefficients_missing</code>).
This suggests that the <code>lm</code> function ran into some major problems while trying
to fit some of the models on these datasets.</p>
<div id="underdetermined-problems" class="section level2">
<h2>Underdetermined Problems</h2>
<p>To understand how that happens let’s start by looking at a subset of the problematic
cases: some of the datasets (<code>dataset_id</code>s 421, 487, 1076) contain more predictors
than there are observations in the dataset (<code>n_pred</code> > <code>n_obs</code>). For linear
regression this poses a problem as it means that we are trying to estimate more
parameters than we have data for (i.e. the problem is <em>underdetermined</em>).</p>
<p>It is easy to understand the issue when thinking about a scenario in which we
have only one predictor <span class="math inline">\(x\)</span> and a target variable <span class="math inline">\(y\)</span>. Let’s say we try to fit
regression model including an intercept (the offset from the y axis) and a slope
for <span class="math inline">\(x\)</span>:</p>
<p><span class="math display">\[y = a + b*x\]</span></p>
<p>Further, let’s assume we only have one observation <span class="math inline">\((x,y) = (1,1)\)</span> in our data
to estimate the two coefficients <span class="math inline">\(a\)</span> and <span class="math inline">\(b\)</span> (<code>n_pred</code> > <code>n_obs</code>). It is easy to
see that there are infinitely many pairs <span class="math inline">\(a\)</span> and <span class="math inline">\(b\)</span> that will fit our data, some
of which are shown below:</p>
<p><img src="https://natural-blogarithm.com/post/r-regression-predictor-order/index.en_files/figure-html/unnamed-chunk-5-1.png" width="864" style="display: block; margin: auto;" /></p>
<p>So when you are asking R to fit a linear regression in this scenario you are giving
it somewhat of an impossible task. However, it will not throw an error (at least
not when using the default settings). It will produce a solution, in our particular
example the flat red line in the plot above:</p>
<pre class="r"><code>> lm(y ~ ., data = tibble(x = 1, y = 1))
##
## Call:
## lm(formula = y ~ ., data = tibble(x = 1, y = 1))
##
## Coefficients:
## (Intercept) x
## 1 NA</code></pre>
<p>So in the case of <code>n_pred > n_obs</code> we have an explanation for how the missing
coefficients occur: R is “running out of data” for the surplus coefficients and
will assign <code>NA</code> to them.</p>
<p>The <em>running out of data</em> part in the sentence above is a bit of a simplification.
However, the estimation of coefficients seems to actually happen sequentially.
At least that is the impression we get from a slightly more complicated example:</p>
<pre class="r"><code>> example_2predictors <-
+ tibble(x1 = c(1, 2),
+ x2 = c(-1, 1)) %>%
+ mutate(y = x1 + x2)
>
> lm(y ~ x1 + x2, data = example_2predictors)
##
## Call:
## lm(formula = y ~ x1 + x2, data = example_2predictors)
##
## Coefficients:
## (Intercept) x1 x2
## -3 3 NA
> lm(y ~ x2 + x1, data = example_2predictors)
##
## Call:
## lm(formula = y ~ x2 + x1, data = example_2predictors)
##
## Coefficients:
## (Intercept) x2 x1
## 1.5 1.5 NA</code></pre>
<p>We can see that at least in this example <code>lm</code> seems to prioritise the estimation
of the coefficients in the order that we supply it to the model formula and
assigns <code>NA</code> to the second predictor.</p>
<p>What is even better, the example also provides a case of fluctuating coefficients:
The intercept in the first model is -3 and becomes 1.5 in the second one.
And this intuitively makes sense as well. When we try to fit the model with too
little data to estimate all the predictors what <code>lm</code> ends up doing is fitting
a “reduced” model with fewer predictors. This limited set of predictors will
explain the data in a different way and therefore result in different coefficients.</p>
</div>
<div id="the-least-squares-problem" class="section level2">
<h2>The Least Squares Problem</h2>
<p>So now we have a pretty good understanding where the fluctuating
coefficients originate from in the case of underdetermined problems as the ones
above. The remaining datasets on the list above are not underdetermined but upon
closer inspection we can see that they are suffering from the same underlying
issue.</p>
<p>For this it is necessary to understand the theoretical foundation of how linear
regression problems are solved numerically, the <em>least squares method</em>. A
good and concise derivation of it can be found in these <a href="http://pillowlab.princeton.edu/teaching/statneuro2018/slides/notes03b_LeastSquaresRegression.pdf">lecture notes</a>.</p>
<p>The least squares formulation for linear regression turns out to be as follows:</p>
<p><span class="math display">\[X^tX\beta = X^tY\]</span>
In the equation above <span class="math inline">\(Y\)</span> is the target variable we want to fit our data to and
<span class="math inline">\(\beta\)</span> is the vector holding the coefficients we want to estimate. <span class="math inline">\(X\)</span> is the
<em>design matrix</em> representing our data. Each of its <span class="math inline">\(n\)</span> rows corresponds to one
observation and each of its <span class="math inline">\(p\)</span> columns to one predictor.</p>
<p>Now in order for us to solve the least squares problem we need the matrix <span class="math inline">\(X^tX\)</span>
to be invertible. If that is the case we can simply multiply both sides of the
equation above by <span class="math inline">\((X^tX)^-1\)</span> and have a solution for our coefficients <span class="math inline">\(\beta\)</span>.</p>
<p>However, for <span class="math inline">\(X^tX\)</span> to be invertible it needs to have <a href="https://en.wikipedia.org/wiki/Rank_(linear_algebra)"><em>full rank</em></a>. The rank is a
fundamental property of a matrix which has many different interpretations and
definitions. For our purposes it is only important to know that <span class="math inline">\(X^tX\)</span> would have
full rank if its rank would be equal to the number of its columns <span class="math inline">\(p\)</span> (which in
our case is the same as the number of predictors). By one of the properties of the
rank, the rank of <span class="math inline">\(X^tX\)</span> can at most be the minimum of <span class="math inline">\(n\)</span> and <span class="math inline">\(p\)</span>.</p>
<p>Now we can see why we ran into problems when we tried to fit a linear
regression model for the datasets that had less observations (<span class="math inline">\(n\)</span>) than predictors
(<span class="math inline">\(p\)</span>): The matrix <span class="math inline">\(X^tX\)</span> had rank <span class="math inline">\(< p\)</span> (i.e. not full rank) and was therefore
not invertible.</p>
<p>As a consequence the underlying least squares problem for these cases was not
solvable (at least not in a proper way as we will see in a bit).</p>
</div>
<div id="multicollinearity" class="section level2">
<h2>Multicollinearity</h2>
<p>With our understanding of the underdetermined problems and the issues they cause
in the least squares formulation we now have all the necessary tools to explain
all the other cases of fluctuating coefficients. As we have pointed out the
remaining cases are not underdetermined (<span class="math inline">\(p\)</span> > <span class="math inline">\(n\)</span>), instead they suffer from a
slightly more subtle issue called <em>multicollinearity</em>.</p>
<p>Multicollinearity arises when some of the columns in your data can be represented
as a <a href="https://en.wikipedia.org/wiki/Linear_combination"><em>linear combination</em></a> of
a subset of the other columns.</p>
<p>For example, the columns of the following dataset have multicollinearity:</p>
<pre><code>## # A tibble: 5 x 4
## y x1 x2 x3
## <dbl> <dbl> <dbl> <dbl>
## 1 1 1 -1 0
## 2 1 2 1 3
## 3 1 3 -1 2
## 4 0 4 1 5
## 5 0 5 -1 4</code></pre>
<p>If we know two of the three predictor columns we can always calculate the third (e.g.
<code>x1</code> = <code>x2 + x3</code>). One of the columns is simply representing the same
data as the other two in a redundant fashion. We could remove either one of them
and still retain the same information in the data.</p>
<p>It might not be clear at first sight why having additional columns that represent
the same data in a redundant way should pose problems but it becomes obvious when
we analyse it with the same tools that we used above.</p>
<p>One of the definitions for the rank of a matrix is the
<em>number of <a href="https://en.wikipedia.org/wiki/Linear_independence">linearly independent</a> columns</em>. But as we have just seen in a dataset
with multicollinearity some of the columns will be linear combinations of other
columns. Therefore if our matrix <span class="math inline">\(X\)</span> has multicollinearity in its columns its
rank will be less than the number of its columns <span class="math inline">\(p\)</span>. And since
<span class="math inline">\(rank(X^tX) = rank(X) < p\)</span> this means that <span class="math inline">\(X^tX\)</span> is not invertible.</p>
<p>Therefore we can trace the remaining cases of fluctuating coefficients back to
the <span class="math inline">\(n < p\)</span> case and explain them in the same way!</p>
<p>Multicollinearity can be a bit difficult to detect sometimes but it can be found
in all of the remaining datasets in which we have detected the coefficient
fluctuations (see the <a href="https://github.com/jakobludewig/r_projects/tree/main/2021-02-11%20Predictor%20order%20in%20R%20linear%20regression">accompanying R notebook document</a>). In the case of datasets 527 and 1091 we
have a special case where we have collinearity between the predictors and the
target variable which can also explain why there are no missing coefficients in
these cases.</p>
</div>
</div>
<div id="why-lm-still-returns-results-qr-decomposition" class="section level1">
<h1>Why <em>lm</em> still returns results: QR Decomposition</h1>
<p>So we have been able to explain all the cases of coefficient fluctuations by
violations of basic assumptions of linear regression that every undergraduate
textbook should warn you about:</p>
<ul>
<li>make sure you have at least as many observations as predictors you want to fit</li>
<li>make sure your data has no multicollinearity in it</li>
</ul>
<p>However when we tried to fit a linear regression R was able to provide us with
results (albeit some of the coefficients were missing) without any error or even
warning.</p>
<p>How can it come to any solution if the inverse of <span class="math inline">\(X^tX\)</span> in the matrix formulation
above does not exist?</p>
<p>To understand this it is necessary to dive a bit deeper into the internal workings
of the <code>lm</code> function in R. I highly recommend doing that as it will teach you a
lot about how R works internally and how it interfaces with other languages (the
actual code for doing the matrix inversion is using Fortran code from the 1970s).
<a href="http://madrury.github.io/jekyll/update/statistics/2016/07/20/lm-in-R.html">This blog post</a>
offers a great guide to follow along if you are interested in those implementation
details.</p>
<p>The bottom line for us here is that to solve the linear regression problem R will
try to decompose our design matrix <span class="math inline">\(X\)</span> into the product of two matrices <span class="math inline">\(Q\)</span> and
<span class="math inline">\(R\)</span>. With this factorization it becomes quite straightforward to solve the overall
matrix equation for our coefficient vector <span class="math inline">\(\beta\)</span>.</p>
<p>Now without going into too much detail here, we can say that the QR decomposition
can be executed even when <span class="math inline">\(X\)</span> is not invertible as the computations are essentially done for each coefficient sequentially.</p>
<p>So the QR and will produce meaningful values for the first few coefficients up
until the algorithm will abort the calculation and return <code>NA</code> for the remaining
coefficients. This is exactly the behaviour we have seen in our little example
datasets above and in our simulation study.</p>
<p>I was a bit surprised to see that R will just silently process the data even if
it encounters these kinds of problems with the QR decomposition (I certainly
remembered the behaviour differently). The <code>lm</code> function actually provides a way
to change this behaviour by setting the <code>singular.ok</code> to <code>FALSE</code>. In this case
<code>lm</code> will error out when it runs into a matrix with non-full rank and not return any results. The default for this parameter however is set to <code>TRUE</code> resulting in the behaviour we have witnessed in our analysis. Interestingly the help text of the <code>lm</code> function mentions that this parameter’s default was <code>FALSE</code> in R’s predecessor <em>S</em>.</p>
<p>The output of <code>lm</code> provides some information about the QR fit, most importantly
the rank of the matrix it encountered, which can be extracted from the output. So
let’s append this information to all our datasets in which we found fluctuations
in the coefficients:</p>
<pre><code>## # A tibble: 16 x 8
## dataset_id high_cv perc_coefficient… dep_var n_obs n_cols rank_qr singular_qr
## <int> <lgl> <dbl> <chr> <int> <int> <int> <lgl>
## 1 195 TRUE 0.0476 price 159 21 20 TRUE
## 2 421 TRUE 0.426 oz54 31 54 31 TRUE
## 3 482 TRUE 0.0217 events 559 46 45 TRUE
## 4 487 TRUE 0.268 respon… 30 41 30 TRUE
## 5 513 TRUE 0.0217 events 559 46 45 TRUE
## 6 518 TRUE 0.222 Violen… 74 9 7 TRUE
## 7 521 TRUE 0.118 Team_1… 120 34 30 TRUE
## 8 527 TRUE 0 Gore00 67 15 15 FALSE
## 9 530 TRUE 0.0833 GDP 66 12 11 TRUE
## 10 533 TRUE 0.0217 events 559 46 45 TRUE
## 11 536 TRUE 0.0217 events 559 46 45 TRUE
## 12 543 TRUE 0.111 LSTAT 506 117 104 TRUE
## 13 551 TRUE 0.188 Accide… 108 16 13 TRUE
## 14 1051 TRUE 0.238 ACT_EF… 60 42 32 TRUE
## 15 1076 TRUE 0.383 act_ef… 93 107 66 TRUE
## 16 1091 TRUE 0 NOx 59 16 16 FALSE</code></pre>
<p>We can see that almost all of the datasets with coefficient fluctuations have
encountered a singular QR decomposition. The only two exceptions we have found
(datasets 527 and 1091) have another defect are the ones we mentioned above where
there is collinearity between the target variable and the predictors. The issue
with those datasets has therefore a slightly different nature (see details in the
R notebook) but they definitely represent ill-posed linear regression problems.</p>
<p>As a side note: The QR decomposition is calculated only approximated using numerical algorithms (in the case of R <a href="https://en.wikipedia.org/wiki/Householder_transformation#QR_decomposition">Householder reflections</a> are used). The approximative nature of the calculations is likely to be the
reason for the minimal coefficient fluctuations we have seen in the <code>mtcars</code> example and our simulation study.</p>
</div>
<div id="pivoting-in-fortran" class="section level1">
<h1>Pivoting in Fortran</h1>
<p>One final thing I came across while trying to make sense of the Fortran code
which does the actual QR decomposition is that it is applying a
<em>pivoting strategy</em> to improve the <a href="https://en.wikipedia.org/wiki/Condition_number">numerical stability</a> of the problem.
You can find the code and some documentation about the pivoting
<a href="https://github.com/wch/r-source/blob/trunk/src/appl/dqrdc2.f">here</a>.</p>
<p>The application of the pivoting strategy implies that in certain cases the
Fortran function will try to improve the numerical stability of the QR
decomposition by moving some of the columns to the end of the matrix.</p>
<p>This behaviour is relevant for our analysis here as it means there is some
reordering of the columns going on which we do not have any control over in R.</p>
<p>This might influence our results as it might be the case that for those
datasets in which we found no fluctuations it was simply because our prescribed
column ordering was overwritten in the Fortran code and never had the chance to cause coefficient fluctuations.</p>
<p>Again the information whether pivoting was applied is returned in the output of
the <code>lm</code> function so let’s append this information to our dataset summary and
filter for all datasets on which pivoting was applied:</p>
<pre><code>## # A tibble: 18 x 7
## dataset_id max_cv high_cv perc_coefficients_m… rank_qr singular_qr pivoting
## <int> <dbl> <lgl> <dbl> <int> <lgl> <lgl>
## 1 195 2.64e+ 0 TRUE 0.0476 20 TRUE TRUE
## 2 199 6.72e-15 FALSE 0.143 6 TRUE TRUE
## 3 217 2.00e-14 FALSE 0.0357 27 TRUE TRUE
## 4 421 5.32e+ 3 TRUE 0.426 31 TRUE TRUE
## 5 482 1.15e+ 1 TRUE 0.0217 45 TRUE TRUE
## 6 494 0. FALSE 0.333 2 TRUE TRUE
## 7 513 3.91e+ 0 TRUE 0.0217 45 TRUE TRUE
## 8 518 1.02e+ 0 TRUE 0.222 7 TRUE TRUE
## 9 521 1.28e+ 3 TRUE 0.118 30 TRUE TRUE
## 10 530 1.17e+ 0 TRUE 0.0833 11 TRUE TRUE
## 11 536 2.41e+ 1 TRUE 0.0217 45 TRUE TRUE
## 12 543 2.46e+ 3 TRUE 0.111 104 TRUE TRUE
## 13 546 1.94e-13 FALSE 0.0769 24 TRUE TRUE
## 14 551 3.05e+ 1 TRUE 0.188 13 TRUE TRUE
## 15 703 1.17e-11 FALSE 0.104 302 TRUE TRUE
## 16 1051 7.04e+ 1 TRUE 0.238 32 TRUE TRUE
## 17 1076 1.94e+ 2 TRUE 0.383 66 TRUE TRUE
## 18 1245 8.26e-15 FALSE 0.115 23 TRUE TRUE</code></pre>
<p>As we can see pivoting was only applied in datasets where we also encountered
a singular QR decomposition which makes sense as putting a non-invertible matrix
into the algorithm probably has poor numerical stability. Out of those datasets
there are a total of 6 datasets where pivoting was applied but no coefficient
fluctuations were detected. Whether the pivoting had an impact in preventing the
fluctuations from happening or if it was another property of the dataset is hard
to say.</p>
<p>Either way we can conclude that the pivoting has not interfered with the results
of the vast majority of the datasets in which we did not see any fluctuations.</p>
</div>
<div id="conclusion" class="section level1">
<h1>Conclusion</h1>
<p>We started this post with a seemingly trivial question which in the end took
us on quite a journey which involved a simulation study, revisiting the linear
algebra behind the linear regression problem and taking a closer look at how the
resulting matrix equations are solved numerically in R (or rather in Fortran).</p>
<p>All the cases of coefficient fluctuations we could detect in our investigation
were caused by violations of basic assumptions of linear regression and not the
permutations of the input variables per se.</p>
<p>Therefore our conclusion is that as long as you do your due diligence when fitting
regression models you should not have to worry about the order of your predictors
having any kind of impact on your estimated coefficients.</p>
</div>
Inaugural Post
https://natural-blogarithm.com/post/inaugural-post/
Wed, 10 Feb 2021 23:50:29 +0100https://natural-blogarithm.com/post/inaugural-post/
<p>This is the first ever post on our freshly set up blog <strong>Natural Blogarithm</strong>.</p>
<p>This post is merely intended as a smoke test to make sure everything is working
as expected. However, it is also a promise of future things to come, so stay tuned!</p>
<p>In the meanwhile, if you would like to find out more about the idea behind
Natural Blogarithm and the people who will be writing here, please check out the
<a href="https://natural-blogarithm.com/about">About</a> section.</p>
<p>Until then, <br>Stephan & Jakob</p>