Unit 1: File Organization and Database Design

1. Introduction

Every information system needs to store and retrieve data efficiently. The choice of how to organize data on storage media (disk, SSD) affects:

Access speed (how fast we find a record)
Storage efficiency (how much space is used)
Flexibility (ability to change or query data in different ways)
Concurrency (multiple users accessing simultaneously)

Two broad approaches:

File-based systems – Data stored in operating system files, managed by application programs. (Legacy or simple systems.)
Database systems – Data stored in a structured, integrated repository managed by a Database Management System (DBMS). (Modern approach for most business systems.)

Key idea: File organization is about physical arrangement of records on disk. Database design is about logical structure and relationships among data.

2. File Organization

File organization refers to the way records are physically arranged inside a file on secondary storage. The choice determines which access methods (sequential, direct, indexed) are efficient.

2.1 Basic Concepts

Record – A collection of related data fields (e.g., one customer).
File – A collection of records (e.g., all customers).
Block – The smallest unit of data read/written from disk (typically 512 bytes to 64 KB).
Key field – A field used to identify a record (e.g., Customer_ID).
Access method – The algorithm used to locate a record.

![File Organization Types Placeholder]( File Organization Types Comparison )

2.2 Types of File Organization

Organization	Description	Access Methods	Best for
Sequential	Records stored one after another in order of a key field.	Sequential read (from start to end).	Batch processing, report generation.
Indexed Sequential (ISAM)	Sequential file plus an index that maps key to disk address.	Sequential or direct by key (via index).	Transaction processing with occasional direct lookups.
Direct / Random (Hashing)	Record address computed from key using a hash function.	Direct access by key.	Real‑time systems where speed is critical (e.g., airline reservations).
Clustered	Records from multiple related files stored together (e.g., customer and his orders).	Sequential or indexed.	Reducing disk I/O for joins.
Heap / Unordered	Records appended in order of arrival; no ordering.	Sequential scan only.	Log files, temporary data.

2.3 Detailed Description of Each Organization

2.3.1 Sequential File Organization

Structure: Records are stored contiguously, sorted by a key field (e.g., employee number).
Access: To find a record, you must read from the beginning until the key matches.
Operations: Insert/Delete requires rewriting the entire file. Update can be done in place if record size is fixed.
Advantages: Simple, efficient for batch processing (e.g., payroll). High storage efficiency.
Disadvantages: Slow for random access; poor for frequent inserts/deletes.

2.3.2 Indexed Sequential File Organization (ISAM)

Structure: A sequential file (data records) plus one or more indexes. The index maps key values to disk addresses.
Access: Search the index (fast, because index is small and ordered) then read that block.
Advantages: Supports both sequential and random access efficiently.
Disadvantages: Index adds storage and maintenance overhead.

![B-Tree Structure Placeholder]( B-Tree Index Physical Structure )

2.3.3 Direct / Hashing File Organization

Structure: Records are placed in buckets (blocks) based on the output of a hash function: Bucket_Address = hash(key) % number_of_buckets.
Collisions: When two keys hash to the same bucket. Handled by chaining or open addressing.
Advantages: Extremely fast for exact‑key lookups (O(1) average).
Disadvantages: Inefficient for range queries. Not sorted. Poor for sequential access.

![Hashing Mechanism Placeholder]( Hashing Mechanism and Collision Handling )

2.3.4 Clustered File Organization

Structure: Records from related tables are stored in the same disk blocks (e.g., a customer and all their orders).
Advantages: Dramatically improves join performance.
Disadvantages: May slow down operations on the “many” side alone.

2.4 Comparison Table

Feature	Sequential	Indexed	Direct (Hash)	Clustered
Sequential access	Very fast	Fast	Slow	Fast
Random by key	Slow	Fast	Very fast	Fast
Range queries	Fast	Fast	Very slow	Depends
Storage overhead	None	Index size	None	May waste space

3. Database Design

Database design deals with the logical structure of data, including entities, attributes, relationships, and constraints.

3.1 Design Phases

Phase	Output	Description
Conceptual design	ER diagram (Entity‑Relationship)	Technology‑independent model of data requirements.
Logical design	Relational schema (tables, keys)	Transform ER diagram into tables; apply normalization.
Physical design	Storage specs (indexes, partitioning)	Optimize for performance on a specific DBMS.

3.2 Conceptual Design (ER Modeling)

Entity: A distinct object (e.g., Customer).
Attribute: A property of an entity (e.g., Name).
Relationship: An association (e.g., places).

![Library ER Diagram Placeholder]( Library System ER Diagram )

3.3 Normalization

Normalization is the process of eliminating data redundancy and anomalies (insert, update, delete) by decomposing tables.

![Normalization Steps Placeholder]( Data Normalization Steps (1NF to 3NF) )

Normal Form	Rule	Problem it solves
1NF	Atomic values; no repeating groups.	Repeating groups cause storage inefficiency.
2NF	1NF + every non‑key attribute depends on the whole primary key.	Partial dependencies cause redundancy.
3NF	2NF + no transitive dependency (depends only on the PK).	Update anomalies.

4. File Organization vs. Database Design

Aspect	File Organization	Database Design
Level	Physical	Conceptual / Logical
What it defines	How records are placed on disk.	Entities, relationships, constraints.
Tools	File system, ISAM, hashing.	DBMS (Oracle, MySQL, SQL Server).
Change frequency	Relatively stable (tuned).	Changes with business rules.

5. Summary and Key Takeaways

File organization is about physical storage and access methods. Choose based on access patterns (sequential, indexed, hashed).
Database design is about logical structure (entities, normalization). Standard is 3NF for transactional systems.
Indexes are the primary tool for speeding up queries but slow down writes.
In modern SAD, most systems use a relational DBMS with B‑tree indexes, but understanding these principles is essential for performance tuning.

Part15of15

Online Study World

Navigation