Persona Detection

Privacy Oriented Persona Detection Through Social Communities
A method that can target any interest capable of forming a community on social platforms, an infrastructure that is completely privacy-preserving.
Our tool is designed as a pipeline of data-driven algorithms to present a novel solution for the problems addressed in the previous chapters. This section aims to explain fundamental design ideas we propose to obtain a privacy-oriented persona detection method built on public data.
Data-driven segmentation is mainly based on machine-learning models that rely on labeled data. Labeled data means discrete categorization and the possibility of false labeling.
Existing solutions required vast data to target any interest capable of forming a community, with millions of labels created by human agents. Where this is the case, our design is built in an unsupervised approach.
FirstBatch A.I. is an unsupervised, continuous segmentation technology, that can measure involvement of any node in a graph with a semantic vector.
Below there is a link for the A.I. whitepaper explaining the technology in detail.
Full-version A.I. whitepaper


Communities are a group of interacting people sharing the same interests. These interests can be a sports team or a specific technology, and they can be both something niche or general. This idea is called homophily.
The concept of community is crucial for decentralized web and next-generation marketing simply because they possess two essential features.
  • Reaching out is nearly zero effort by definition, considering it is a group of people sharing the same interests.
  • There is no demand for personal data because knowing the interests is possible without recognizing individual behaviors.
Therefore, finding or building a valuable community for your mission is of utmost importance. Centralized social media platforms like Twitter are home to numerous communities with a wide variety of interests. Communities interact, grow and create within these platforms. Influencers represent communal ideas and interests that we cannot define with words like sports, cuisine, etc. However, saying that ideas built on top of this notion are not ready, would be an understatement.


Homophily is an idea that binds a group of people together, creating communities.
FirstBatch A.I. is built around three main ideas:
  • Vectors as context: Searching through a large, high-dimensional vector space
  • Understanding the driving force: Detecting ideas creating homophily within a community
  • Involvement: Calculating the likelihood of a node's participation in a subgraph, where edges mutate over time

Vectors as context

Representing textual data as vectors are crucial for contextual search. Using large pre-trained models for fine-tuning specific tasks is widely accepted. However, discrete categorization is an intrinsic feature of this method. Besides, fine-tuning would need supervised data. Therefore, our contextualization approach was to use attention-based pre-trained models as they are to produce raw vector outputs.
The problem with pre-trained vectors is they can quickly get generalized and affected by many parameters. Aiming to avoid generalization, we used entities of specific types with a small window size to generate entity-oriented semantic vectors.
We are using RoBERTa-large as our base language model. Outputs of the last encoder layer are retrieved for contextual embeddings to achieve high accuracy and computational efficiency. Keyphrases are embedded with a context limited with window size on both ends. Dynamic window size for contextual vectors increases precision while segmenting concepts built around different concepts.

Understanding the driving force

Detecting homophilic context within a community depends on finding nodes with higher inter-cluster density and lower intra-cluster density, named inf-nodes, basically indicating local celebrities. Inf-nodes can be any type of entity depending on the data source.
FirstBatch's foci graph contains multiple semantic vectors directed to inf-nodes. Each semantic vector represents a certain context, that is weakly or strongly represented by the inf-node depending on frequency.
Using language models FirstBatch can search through foci space and calculate the cosine similarity of semantic vectors with an arbitrary vector with the same dimensionality. In other words, this allows FirstBatch to detect inf-nodes matching with a given context as representatives.
Vector-quantization through Voronoi cells is used to search within a million scale vector space with a dimensionality of 768.


Involvement is an idea heavily feeding on the powerful thesis developed by Mark Granovetter, Strength of weak ties. Paper proposes two significant concepts: triadic closure and local bridges.
Social graphs are temporal, meaning they change over time. Each fragment of data moves forward for a certain subgraph. But for things to happen in the future, it's a matter of likelihood.
As inf-nodes are detected, involvement of a node is the likelihood, if not already, of that node's participation in the communities represented by selected inf-nodes. This is a significant part of the pipeline (the algo) of FirstBatch.

Finding Interests

FirstBatch is able to search for any interest that is expressible by natural language. An operator can write a text prompt (just like GPT-3) and create interests available for FirstBatch users.
A use case for creating new interests detectable by FirstBatch API
Let's say a DApp wants to create a specific interest like "wave surfing".
-> Flow:
  1. 1.
    Text prompt about wave surfing to A.I.
  2. 2.
    Textual context is vectorized
  3. 3.
    Search through foci-space
  4. 4.
    Detect communities around wave surfing
In order to measure an arbitrary node's involvement in "wave surfing", FirstBatch runs involvement algo on the contextualized social graph, yielding a score.
More details on whitepaper.