AI for Engineers

Pandas and NumPy Interview Questions for AI/ML Freshers: 2026 Guide

The 15 Pandas and 8 NumPy operations that recur in AI/ML fresher screening rounds in 2026, with answer patterns and key gotchas.

By FACE Prep Team May 2026 7 min read

pandas numpy ai-interview ml-fresher python data-manipulation interview-prep

Interviewers screening AI/ML freshers in 2026 lean on the same twenty-three operations: fifteen from Pandas, eight from NumPy.

That number comes from pattern-matching across the coding tasks and conceptual questions that appear in screening rounds for data analyst, ML engineer, and AI developer roles at IT services firms with AI tracks and product startups alike. The questions change in difficulty and framing between companies, but the underlying operations stay consistent. Knowing them cold — not just recognising them, but being able to explain the edge cases — is the difference between a smooth technical round and a painful one.

Why Pandas and NumPy still dominate AI/ML screening rounds

PyTorch, TensorFlow, and LangChain dominate the headlines, but every pipeline that feeds those frameworks starts with Pandas and NumPy. Pandas handles structured data: loading CSV files, cleaning missing values, joining datasets, reshaping tables. NumPy handles the numerical layer underneath: arrays, matrix operations, element-wise computation. Even when a model is built entirely in PyTorch, the preprocessing step is almost certainly in Pandas.

The Stack Overflow Developer Survey 2024 lists Pandas among the most-used professional libraries for data science and ML work globally. The Pandas official documentation notes that Pandas 2.x now ships with copy-on-write semantics — a behaviour change that interviewers at product companies have started probing.

For freshers preparing for AI/ML roles, this is good news. The prerequisite is not a PhD or three years of production experience. It is fluency in a specific, bounded set of operations — the twenty-three covered here.

If you want context on where data manipulation fits in the broader AI/ML learning path, the 2026 AI roadmap for Indian engineering students maps the full stack from Python basics to production deployment.

The 15 Pandas operations interviewers keep asking about

The fifteen operations below are grouped by theme. Within each group, the list gives the function name, the ML context where it appears, and the answer anchor — the one thing to say that separates a strong answer from a surface one.

Data loading and exploration

read_csv / read_json: Loads a CSV or JSON file into a DataFrame. In ML interviews, the follow-up is usually about the dtype parameter or handling encoding errors. Mention dtype explicitly — “I pass a dtype dict to avoid silent integer-to-float coercions on ID columns.”
head(), tail(), info(), describe(): EDA primitives. info() shows column dtypes and null counts; describe() gives percentile stats for numeric columns. Interviewers use these to check whether a candidate’s first instinct on a new dataset is systematic (print info, check nulls) or random.
isnull() / notnull(), dropna(), fillna(): Missing-value handling. The answer anchor is the strategy question: drop rows vs. impute, and why. In an ML pipeline, dropping is correct for target-variable nulls; filling with mean or forward-fill is common for feature nulls. Say which and why.

Aggregation and joins

groupby(): Groups rows by one or more columns, then applies an aggregation function. The common question is: “How does groupby handle NaN keys?” Answer: NaN keys are excluded from groups by default. Pass dropna=False to include them. This is a Pandas 2.x-era gotcha.
merge(): SQL-style join. Parameters that interviewers probe: how (inner, left, right, outer), on, left_on, right_on, suffixes. The answer anchor: “I default to inner joins and explicitly check for row-count changes after the merge to catch unintended duplicates.”
concat(): Stacks DataFrames vertically or horizontally. The key: axis=0 (row-wise, default) vs. axis=1 (column-wise). Distinguish this from merge — concat does not align on key columns, it aligns on index.
value_counts(): Returns a Series of frequency counts, sorted descending. In ML, it is the first check on a categorical column’s class distribution. Mention normalize=True to get proportions directly.

Indexing and selection

loc[] vs iloc[]: This is the most common conceptual gotcha. loc is label-based (uses the index label); iloc is integer position-based (uses 0-based position). They diverge when the index is not a default RangeIndex. A DataFrame with a date-string index makes the distinction tangible.
Boolean masking: Selecting rows where a condition is true, using syntax like df[df['score'] > 0.5]. Note: inside the bracket, write the condition using the > operator. The answer anchor: chaining multiple conditions uses & and | with parentheses around each condition, not Python’s and / or.
sort_values(): Sorts by one or more columns. Parameters to know: ascending (bool or list of bools for multi-column sort), na_position (where NaN ends up). The ML context: sorting by a probability score column before computing a top-K metric.

Reshaping and transformation

pivot_table(): Reshapes long-format data to wide. Parameters: values, index, columns, aggfunc. The answer anchor: when there are duplicate entries for an index-column pair, pivot raises an error, but pivot_table aggregates them using aggfunc (default is mean).
melt(): The inverse of pivot — converts wide to long. Common in ML feature engineering when multiple time-period columns need to be stacked into a single column.
apply(): Applies a function along an axis. The question is almost always about performance: apply with a Python function is slow on large DataFrames; prefer vectorised operations or map for element-wise work on Series.
astype() / dtypes: Explicit type conversion. In ML, the usual move is converting object columns that look numeric to float32 before feeding to a model. Mention memory: float32 uses half the memory of the default float64.

A quick lookup example for groupby with a NaN edge case:

import pandas as pd

data = {"dept": ["eng", None, "eng", "hr"], "score": [90, 85, 78, 92]}
df = pd.DataFrame(data)

# Default: NaN key is excluded
df.groupby("dept")["score"].mean()

# Include NaN groups (Pandas 1.1+)
df.groupby("dept", dropna=False)["score"].mean()

The 8 NumPy operations you need cold

NumPy questions in AI/ML screening rounds cluster around three themes: array mechanics, computation semantics, and memory behaviour. The NumPy official documentation is the primary reference; these eight operations are where the interview questions actually land.

Array creation — np.array(), np.zeros(), np.ones(), np.arange(), np.linspace(): Know the difference between arange (step-based, like Python range) and linspace (count-based, returns exactly N evenly spaced points including endpoints). Interviewers use these to test whether a candidate thinks about dtype at creation time.
reshape(), flatten(), ravel(): The classic gotcha: flatten() always returns a copy; ravel() returns a view when memory layout allows. Modifying the output of ravel() can silently modify the original array. This matters in ML when passing arrays to functions that modify in-place.
Broadcasting: When NumPy performs an operation on arrays of different shapes, it expands the smaller array to match the larger one without copying data. The rule: dimensions are compatible if they are equal or one of them is 1. Interviewers ask candidates to predict the output shape of an operation.
Boolean indexing: Select elements where a condition is true, e.g., arr[arr > threshold]. The result is always a copy, not a view. This is different from slice indexing, which returns a view.
Reduction operations — np.sum(), np.mean(), np.std(), np.cumsum(): Know the axis parameter. np.mean(arr, axis=0) computes the mean of each column; axis=1 computes the mean of each row. Getting the axis wrong is a common silent bug.
np.dot() / np.matmul(): Matrix multiplication. For 2D arrays, both are equivalent. For higher-dimensional arrays they differ. In ML, np.dot is used for the forward pass of a simple neural network layer: output = np.dot(X, W) + b.
np.where(condition, x, y): Returns x where condition is true, y elsewhere. Used to vectorise if-else logic over arrays. In ML, it appears in custom loss function implementations and label encoding.
Stacking — np.vstack(), np.hstack(), np.concatenate(): Stack arrays vertically, horizontally, or along a specified axis. The answer anchor: vstack and hstack are convenience wrappers around concatenate with fixed axis values.

A short demonstration of the ravel vs. flatten distinction:

import numpy as np

arr = np.array([[1, 2], [3, 4]])

flat = arr.flatten()  # returns a copy
rav = arr.ravel()     # returns a view (usually)

rav[0] = 99
print(arr[0, 0])  # prints 99 -- the original was modified
flat[0] = 88
print(arr[0, 0])  # still 99 -- flatten copy was not connected

Structuring your answers under interview pressure

A strong answer to any data-manipulation question has three parts: name what the function does, give the ML context where it matters, and name one edge case or limitation. Interviewers at product companies are not looking for syntax recall — they can read docs. They are checking whether the candidate thinks about what happens at the boundaries.

For Pandas questions, the edge cases that come up most:

NaN handling: most aggregation functions ignore NaN by default; count() counts non-null values; value_counts() excludes NaN by default unless dropna=False is passed.
Copy-on-write: in Pandas 2.x and the default behaviour in Pandas 3.x, chained indexing no longer silently modifies the original DataFrame. Code written for Pandas 1.x that uses df[condition]['col'] = value may behave differently now.
Integer types: Pandas uses Int64 (capital I, nullable) vs int64 (lowercase, NumPy-backed). Nullable integer types can hold NaN; NumPy integer arrays cannot.

For NumPy questions:

View vs. copy: slicing returns a view; fancy indexing (boolean or integer array indexing) returns a copy.
Axis confusion: always state the axis explicitly in an interview answer to show you are not guessing.
Dtype at creation: specifying dtype=np.float32 at array creation reduces memory by half vs. the default float64, relevant when handling large feature matrices.

These edge cases are not trivia. They are the things that break production pipelines. Mentioning them signals that a candidate has worked with real data, not just tutorial notebooks.

How these skills connect to the broader AI/ML picture

Mastery of these twenty-three operations is the foundation layer, not the finish line. A fresher who can handle missing data, join datasets, reshape arrays, and vectorise computations has the tooling to build the preprocessing and feature engineering stages of an ML pipeline. The layers above — model selection, hyperparameter tuning, evaluation, deployment — assume this foundation is in place.

The 2026 AI roadmap for Indian engineering students maps what comes after: where Pandas and NumPy hand off to scikit-learn and PyTorch, what the path from fresher to an ML role looks like in practice, and which skills product companies weight most in 2026.

Knowing twenty-three operations as syntax is table stakes. Applying them to a real preprocessing pipeline — one with messy data, type inconsistencies, and a groupby that behaves oddly on nulls — is what interviewers remember. TinkerLLM is where you build that applied fluency: at ₹299, it puts live LLM API calls and a Python environment in your hands, so the first end-to-end data pipeline you build is not in a tutorial notebook but in a project you can show a recruiter.

Primary sources

Frequently asked questions

Are Pandas and NumPy still relevant for AI/ML roles in 2026?

Yes. Even with higher-level ML frameworks like PyTorch and scikit-learn, Pandas and NumPy remain the standard tools for data loading, cleaning, and numerical computation. Screening rounds at both product startups and IT service firms continue to test them.

Which Pandas operation is asked most often in AI/ML interviews?

groupby and merge consistently top the frequency list, followed closely by the loc-vs-iloc distinction. Interviewers use them to test whether candidates understand data aggregation and join semantics, not just syntax.

What is the difference between flatten and ravel in NumPy?

flatten always returns a copy of the array; ravel returns a view when possible. Modifying a ravelled array can change the original -- a distinction interviewers use to probe memory and performance awareness.

Does Pandas 2.x change anything important for interviews?

Yes. Pandas 2.x introduced copy-on-write semantics, and Pandas 3.0 enforces them by default. This changes the behaviour of chained assignments -- code that ran silently in Pandas 1.x may now raise a warning or behave differently. Product-company interviewers now ask about it.

How many data-manipulation questions appear in a typical AI/ML fresher screening round?

Online assessment rounds typically include 3 to 5 data-manipulation coding questions. Technical interview rounds usually add 2 to 3 conceptual questions about operation internals, such as the copy-vs-view distinction or how groupby handles NaN values.

Build AI projects

A self-paced playground for building with LLMs.

TinkerLLM is FACE Prep's sister property. A guided environment for shipping real LLM applications, the kind of project that earns a paragraph on your resume, not a line.

Try TinkerLLM (₹299 launch)

Share WhatsApp LinkedIn Twitter