Artificial Intelligence & the Shape of Large Data
Google AI is increasingly. understanding the differences and similarities between complex large datasets
With vast datasets understanding the shape might be important. If you could understand the similarities, alignment or discontinuity of a large set of data would it be useful? A recent article at the Google AI blog named Understanding the Shape of Large-Scale Data approaches this problem.
It proposes that unsupervised representation learning for graphs is an important problem. So how can it be approached?
Different graphs can map out a variety of relationships, and in different context they might mean different things.
A network of web pages on different devices, social connections or between molecules may have different targets or bring different insight.
One typical approach is described.
- This could be formalised with a mathematical model for how items relate to each other.
- Then predict some property of each one as an aggregate (i.e., one label per graph).
This may be possible when there are few relationships and the dataset is small.
Yet what happens when the dataset gets large?
It can get hard to manage.
“Ideally, one would want a way to represent graphs as vectors without costly labelling. The problem becomes harder with increasing graph size — in the molecule case humans possess some knowledge about their properties, however, reasoning about larger, more complex datasets becomes increasingly difficult.”
In dealing with this the team has made two previous papers.
- “Just SLaQ When You Approximate: Accurate Spectral Distances for Web-Scale Graphs” (published at WWW’20), a publication that improves on the scalability of our earlier research.
- “DDGK: Learning Graph Representations for Deep Divergence Graph Kernels” (published at WWW’19).
In the first article in terms of scalability they say:
“…practice, however, the applicability of these methods is often limited by the scalability of eigendecomposition itself: it takes cubic time to compute all eigenvalues and eigenvectors of a given graph.”
They measured the runtime of approximation techniques on huge graphs with millions of nodes and billions of edges.
They managed to process a large dataset with 5 billion nodes in an hour.
- SLaQ allows us to compute principled representations for vast datasets.
- DDGK introduces a mechanism for automatically learning alignments between datasets.
What they are particularly aimed at is recommendation systems. This can be said to be bread and butter for Google — attempting to find what you need in a computationally efficient manner presented in a neat human understandable format tailored to you as much as possible.
They may want to understand changes to time-varying graph datasets.
They have the code out from both papers at GitHub, a platform for code.
At Google Research GitHub repository for graph embeddings.
Their experiments show how they can capture similarities and differences across graphs of different types (language, biology, and social interactions).
One example of the usage of SLaQ could be detection of changes in certain topics of Wikipedia over time.
“For example, we use SLaQ to monitor anomalous changes in the Wikipedia graph structure. SLaQ allows us to discern meaningful changes in the structure of the page graph from trivial ones such as mass page renames. Our experiments show two orders of magnitude improvement in approximation accuracy, on average.”
It is really interesting to follow the articles from the Google AI blog and I would of course recommend that you read the article and the aforementioned research papers in full.
So next time you see large data, you know what to say…
This is #500daysofAI and you are reading article 340. I am writing one new article about or related to artificial intelligence every day for 500 days. My focus for day 300–400 is about AI, hardware and the climate crisis.