Neural Networks Explained: Perceptrons to Vision Transformers | MachineFi

Understanding neural networks means understanding the last seventy years of AI history compressed into a single, remarkably consistent idea: arrange simple mathematical units in layers, let them adjust their connections based on errors, and watch something that looks a lot like intelligence emerge. That idea has scaled from a single artificial neuron running on 1950s hardware to billion-parameter Vision Transformers processing live video streams — and the core logic has barely changed.

This guide traces every major architectural leap, explains the engineering intuition behind each one, and connects the evolution to what it means for computer vision and real-time video AI today.

Key Takeaway

Key Takeaways:

Every neural network architecture — from the perceptron to the Vision Transformer — is built on the same foundational idea: weighted connections that adjust through gradient-based learning.
CNNs were the breakthrough that made computer vision practical by exploiting spatial structure in images with shared, learned filters.
The Transformer's attention mechanism removed the need for hand-engineered spatial assumptions, enabling models to learn arbitrary relationships across the full input.
Vision Transformers (ViTs) match or exceed CNN accuracy on most benchmarks while scaling more predictably with data and compute — making them the preferred backbone for modern video AI.
The architecture you choose has direct implications for latency, memory, and deployment cost in real-time video pipelines.

What Is a Neural Network, Really?

Neural Network: A neural network is a computational graph composed of layers of parameterized units (neurons) connected by weighted edges. Each neuron computes a weighted sum of its inputs, applies a nonlinear activation function, and passes the result forward. The network learns by adjusting weights to minimize a loss function using gradient descent and backpropagation.

Strip away every layer of jargon and a neural network is doing one thing: function approximation. Given inputs (pixel values, audio samples, sensor readings), it produces outputs (class labels, bounding boxes, embeddings) by passing data through a series of learned linear transformations with nonlinearities in between. The learning part — backpropagation — computes how much each weight contributed to the prediction error and nudges it in the direction that reduces that error.

What changes across architectures is not this core machinery but the structure of the computation graph: how many layers, how nodes connect to each other, whether connections are local or global, whether they're shared or independent. Each structural innovation was a response to a specific limitation of the architecture that came before it.

The Perceptron (1958): The First Artificial Neuron

Frank Rosenblatt's perceptron was a single neuron: a vector of input values multiplied by a weight vector, summed, and passed through a step function. If the sum exceeded a threshold, the neuron fired. The perceptron learning rule updated weights whenever the output was wrong.

It was genuinely revolutionary. But it had a fundamental limitation: a single perceptron could only learn linearly separable problems. Show it the XOR function — where the correct output depends on the combination of inputs, not any single one — and it fails completely.

Marvin Minsky and Seymour Papert's 1969 book Perceptrons formalized this limitation and effectively triggered the first AI winter. The fix was obvious in hindsight: stack multiple perceptrons in layers.

Multilayer Perceptrons and Backpropagation (1986)

The Multilayer Perceptron (MLP) added hidden layers between input and output. Each hidden layer could learn increasingly abstract representations of the data. The XOR problem, impossible for a single perceptron, is trivially solved by a two-layer MLP.

The real unlock wasn't the architecture — it was the learning algorithm. Rumelhart, Hinton, and Williams' 1986 paper popularizing backpropagation gave MLPs a practical way to train. The chain rule of calculus lets you compute how much each weight in the network contributed to the final prediction error, and gradient descent then adjusts every weight in the direction that reduces that error.

1986

The year backpropagation was popularized by Rumelhart, Hinton & Williams — making deep learning practically trainable for the first time

Source: Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323, 533–536.

MLPs work well for tabular data and simple classification tasks. But they have a critical weakness for image data: they treat every input pixel as independent. A 224×224 RGB image has 150,528 input values. Connecting those to even a modest hidden layer of 1,024 neurons requires 154 million weight parameters — just in the first layer. Worse, the network has no way to exploit the fact that nearby pixels are related. It has to learn spatial structure from scratch every time, with no prior that a particular feature (an edge, a curve, a texture) looks the same wherever it appears in the image.

This is the problem CNNs were built to solve.

Convolutional Neural Networks (1989–2012): The Computer Vision Revolution

Yann LeCun's 1989 work on Convolutional Neural Networks (CNNs) introduced a fundamentally different connectivity pattern. Instead of fully connected layers, CNNs use convolutional filters: small, learned weight matrices (typically 3×3 or 5×5) that slide across the input image, computing a dot product at each position.

This design encodes two critical inductive biases:

Translation invariance — The same filter detects the same feature (an edge, a corner, a texture element) wherever it appears in the image. A cat ear in the upper-left corner activates the same filter as a cat ear in the lower-right.

Parameter sharing — Instead of a unique weight for every input-output connection, the same filter weights are reused across every spatial position. A 3×3 filter applied to a 224×224 image uses 9 shared parameters instead of 224×224×9 = 451,584 independent parameters.

CNNs also introduced pooling layers (typically max pooling) that downsample feature maps, progressively reducing spatial resolution while increasing the receptive field of deeper features. Early layers detect low-level features like edges. Deeper layers combine those into textures, shapes, and eventually high-level semantic concepts.

The architecture worked well but remained difficult to train in deep configurations until 2012, when AlexNet — a deep CNN trained on GPUs — won the ImageNet competition by a margin that shocked the research community and triggered the modern deep learning era.

10.8%

AlexNet's error-rate improvement over the second-place entry in ImageNet 2012 — the result that launched the modern deep learning era

Source: Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). ImageNet Classification with Deep CNNs. NeurIPS.

The years that followed produced a rapid succession of CNN architectures, each pushing accuracy further: VGGNet (very deep, simple 3×3 stacks), GoogLeNet/Inception (parallel filter sizes), ResNet (residual skip connections enabling 100+ layer networks), EfficientNet (compound scaling of width, depth, and resolution).

CNNs remained the dominant computer vision architecture for a decade and remain extremely competitive today. If you're building real-time object detection in Python, you're almost certainly working with a CNN backbone like ResNet or EfficientDet.

Recurrent Neural Networks and LSTMs: Handling Sequences

Parallel to the CNN revolution in vision, a different architectural family was solving a different problem: data where order matters — text, speech, time-series sensor readings, and video.

Recurrent Neural Networks (RNNs) introduced a hidden state that persists across time steps. At each step, the network receives the current input and its own previous hidden state, allowing it to theoretically remember information from arbitrarily far back in the sequence.

In practice, vanilla RNNs suffered from the vanishing gradient problem: gradients shrink exponentially as they're backpropagated through time, making it nearly impossible to learn long-range dependencies. Sepp Hochreiter and Jürgen Schmidhuber's Long Short-Term Memory (LSTM) architecture (1997) solved this with gated memory cells — a sophisticated mechanism to selectively remember and forget information across many time steps.

For video AI, RNNs and LSTMs were the first architectures capable of modeling temporal relationships: understanding that an action happening now is related to what happened 30 frames ago. But they process sequences serially — step by step — which makes them slow to train and difficult to parallelize on modern GPU hardware. This limitation set the stage for the biggest architectural shift in AI history.

The Transformer (2017): Attention Is All You Need

The 2017 paper "Attention Is All You Need" by Vaswani et al. introduced the Transformer architecture and changed AI permanently. The core innovation was the self-attention mechanism: a way for every element in a sequence to attend to every other element simultaneously, learning which relationships matter most.

In a Transformer, a sequence of input tokens is projected into three vectors per token: Query, Key, and Value. The attention score between any two tokens is the dot product of one token's Query with another's Key (normalized and softmaxed). These scores determine how much each token's Value contributes to the output representation at every position. The entire computation is parallelizable — unlike RNNs, there is no sequential dependency.

Transformers don't assume anything about the structure of the input. They don't assume adjacent tokens are more related than distant ones. They learn relationships purely from data — which makes them extraordinarily flexible but also means they need large amounts of data to learn what CNNs get for free from their spatial inductive biases.

This flexibility is exactly what enabled the large language models powering multimodal AI today. GPT, BERT, T5, and their descendants are all Transformer architectures applied to text sequences. The video-to-LLM gap exists in part because these text-native Transformers need additional bridging infrastructure to consume visual inputs.

Source: Compiled from Dosovitskiy et al. 2021, Liu et al. 2021 (Swin), and MLCommons MLPerf benchmarks

Vision Transformers (2020): Applying Attention to Images

The natural question after Transformers took over NLP was: what if we apply attention to images? Dosovitskiy et al.'s 2020 Vision Transformer (ViT) paper answered it.

ViT's approach is elegant in its simplicity. Split a 224×224 image into 16×16 patches (196 patches total for a standard ViT-B). Flatten each patch into a vector and linearly project it to a fixed embedding dimension. Add a learnable position embedding to each patch token. Prepend a special classification token. Feed the sequence of 197 tokens through a standard Transformer encoder.

That's it. No convolutions. No pooling. No spatial priors. Just patch embeddings processed by self-attention.

The results were striking: with enough pretraining data (ImageNet-21K or JFT-300M), ViT matched or exceeded the best CNNs on ImageNet classification. More importantly, ViT's accuracy continues improving as model size and data scale up — a property CNN architectures don't share as cleanly.

Subsequent variants addressed ViT's limitations. Swin Transformer introduced hierarchical processing and shifted window attention, bringing strong performance on dense prediction tasks (detection, segmentation) where ViT's fixed-resolution patch tokens struggled. DeiT demonstrated that ViT can train effectively on ImageNet-1K alone with strong data augmentation and knowledge distillation. More recent work (ViTDet, EVA, InternViT) has extended ViTs to billion-parameter regimes.

ViT vs. CNN: Accuracy and Model Size Trade-offs

The practical question for any team building a vision system isn't which architecture is theoretically superior — it's which one delivers the right accuracy at the right inference cost for your deployment target.

Source: Papers With Code ImageNet benchmark, torchvision model zoo, timm library benchmarks (2024)

Several patterns stand out from this comparison:

For edge and mobile deployment, CNN architectures (MobileNet, EfficientNet) remain dominant. Their convolutional operations are highly optimized for mobile NPUs, DSPs, and edge accelerators. A ViT running on a Raspberry Pi or NVIDIA Jetson Orin will typically be slower and consume more power than a well-optimized CNN of equivalent accuracy — a critical consideration for edge AI vs. cloud AI decisions and model optimization for edge deployment.

For server-side inference at scale, ViTs and hybrid architectures (Swin, ConvNeXt) increasingly win on accuracy per parameter, especially when pretrained on large datasets and fine-tuned. Their attention mechanisms also generalize better to out-of-distribution inputs — an important property for real-world video feeds with variable lighting, occlusion, and camera angles.

For video understanding specifically, Transformer architectures have significant advantages: temporal attention can span across frames without the architectural gymnastics required to extend CNNs to video (3D convolutions, optical flow inputs, two-stream networks). Models like VideoMAE, Video Swin, and TimeSformer demonstrate that ViTs, pretrained on large video datasets, substantially outperform CNN baselines on action recognition and temporal localization benchmarks. This matters directly for vision language models that need to process video clips rather than individual frames.

Why Architecture Evolution Matters for Video AI

The progression from perceptron to ViT isn't just computer science history — it directly shapes what's possible in real-time video AI systems today.

CNNs made it practical to run vision models on edge hardware. The shared-filter design means a ResNet-50 fit in 100MB of RAM, enabling deployment on IP cameras and edge compute modules. The entire ecosystem of edge computing and on-device inference — including quantized models on NPUs — was built for CNN inference patterns. Understanding GPU vs. CPU inference for these workloads is essential for production deployments.

Transformers and ViTs enabled the vision language models that power modern multimodal AI. GPT-4V, Gemini, and Claude can answer open-ended questions about images because they combine a visual encoder (often a ViT) with a large language model Transformer through a learned projection. The same architecture that reads text can now reason about image patches — a breakthrough that makes it possible to ask natural-language questions about live video feeds without task-specific training.

The practical implication: building a production video AI system today means making deliberate choices about where in this architectural stack you sit. For a real-time safety alert system running on an edge device, a quantized EfficientDet CNN is probably your answer. For a system that needs to answer arbitrary questions about recorded video, a ViT-backed vision language model accessed through an API is the right layer.

The Through-Line: From Perceptron to Video AI

Every architecture discussed here was a response to a specific limitation. MLPs fixed the linearity problem. CNNs fixed the spatial inefficiency of MLPs on images. Transformers fixed the sequential bottleneck of RNNs. ViTs applied the Transformer's flexibility to vision. Each leap enabled new applications and new scale.

The current frontier is multimodal architectures that combine the spatial efficiency of CNNs (or patch-based ViTs) with the reasoning power of large Transformer language models. These are the vision language models powering the next generation of video AI — systems that don't just detect objects but understand scenes, answer questions, and generate natural-language reports from live feeds.

For teams building on top of APIs like Trio, the evolution means more capability accessible through simpler interfaces. You don't have to implement a ViT — but understanding why ViTs handle video better than CNNs helps you choose the right configuration, set the right expectations, and debug unexpected results when they inevitably occur.

Keep Reading

What Is Computer Vision? A Complete Guide — The foundational guide to how machines perceive and interpret visual data, from image classification to real-time video understanding.
Vision Language Models Explained — How modern VLMs combine ViT encoders with large language model decoders to enable open-ended visual reasoning.
Transfer Learning for Computer Vision — How to fine-tune pretrained CNN and ViT models on custom domains without training from scratch.