Data storage
A stylized iconic depiction of a
CSV-formatted t ext file
Because of t heir simplicit y, t ext files are commonly used for st orage of informat ion. They avoid
some of t he problems encount ered wit h ot her file format s, such as endianness, padding byt es, or
differences in t he number of byt es in a machine word. Furt her, when dat a corrupt ion occurs in a t ext
file, it is oft en easier t o recover and cont inue processing t he remaining cont ent s. A disadvant age of
t ext files is t hat t hey usually have a low ent ropy, meaning t hat t he informat ion occupies more
st orage t han is st rict ly necessary.
A simple t ext file may need no addit ional met adat a (ot her t han knowledge of it s charact er set ) t o
assist t he reader in int erpret at ion. A t ext file may cont ain no dat a at all, which is a case of zero-byt e
file.
Encoding
The ASCII charact er set is t he most common compat ible subset of charact er set s for English-
language t ext files, and is generally assumed t o be t he default file format in many sit uat ions. It
covers American English, but for t he Brit ish pound sign, t he euro sign, or charact ers used out side
English, a richer charact er set must be used. In many syst ems, t his is chosen based on t he default
locale set t ing on t he comput er it is read on. Prior t o UTF-8, t his was t radit ionally single-byt e
encodings (such as ISO-8859-1 t hrough ISO-8859-16) for European languages and wide charact er
encodings for Asian languages.
Because encodings necessarily have only a limit ed repert oire of charact ers, oft en very small, many
are only usable t o represent t ext in a limit ed subset of human languages. Unicode is an at t empt t o
creat e a common st andard for represent ing all known languages, and most known charact er set s are
subset s of t he very large Unicode charact er set . Alt hough t here are mult iple charact er encodings
available for Unicode, t he most common is UTF-8, which has t he advant age of being backwards-
compat ible wit h ASCII; t hat is, every ASCII t ext file is also a UTF-8 t ext file wit h ident ical meaning.
UTF-8 also has t he advant age t hat it is easily aut o-det ect able. Thus, a common operat ing mode of
UTF-8 capable soft ware, when opening files of unknown encoding, is t o t ry UTF-8 first and fall back
t o a locale dependent legacy encoding when it definit ely is not UTF-8.
Formats
On most operat ing syst ems, t he name text file refers t o a file format t hat allows only plain t ext
cont ent wit h very lit t le format t ing (e.g., no bold or italic t ypes). Such files can be viewed and edit ed
on t ext t erminals or in simple t ext edit ors. Text files usually have t he MIME t ype text/plain ,
usually wit h addit ional informat ion indicat ing an encoding.
Microsoft Windows text files
DOS and Microsoft Windows use a common t ext file format , wit h each line of t ext separat ed by a
t wo-charact er combinat ion: carriage ret urn (CR) and line feed (LF). It is common for t he last line of
t ext not t o be t erminat ed wit h a CR-LF marker, and many t ext edit ors (including Not epad) do not
aut omat ically insert one on t he last line.
On Microsoft Windows operat ing syst ems, a file is regarded as a t ext file if t he suffix of t he name of
t he file (t he "filename ext ension") is .txt . However, many ot her suffixes are used for t ext files wit h
specific purposes. For example, source code for comput er programs is usually kept in t ext files t hat
have file name suffixes indicat ing t he programming language in which t he source is writ t en.
Most Microsoft Windows t ext files use ANSI, OEM, Unicode or UTF-8 encoding. What Microsoft
Windows t erminology calls "ANSI encodings" are usually single-byt e ISO/IEC 8859 encodings (i.e.
ANSI in t he Microsoft Not epad menus is really "Syst em Code Page", non-Unicode, legacy encoding),
except for in locales such as Chinese, Japanese and Korean t hat require double-byt e charact er set s.
ANSI encodings were t radit ionally used as default syst em locales wit hin Microsoft Windows, before
t he t ransit ion t o Unicode. By cont rast , OEM encodings, also known as DOS code pages, were defined
by IBM for use in t he original IBM PC t ext mode display syst em. They t ypically include graphical and
line-drawing charact ers common in DOS applicat ions. "Unicode"-encoded Microsoft Windows t ext
files cont ain t ext in UTF-16 Unicode Transformat ion Format . Such files normally begin wit h byt e
order mark (BOM), which communicat es t he endianness of t he file cont ent . Alt hough UTF-8 does
not suffer from endianness problems, many Microsoft Windows programs (i.e. Not epad) prepend t he
cont ent s of UTF-8-encoded files wit h BOM,[2] t o different iat e UTF-8 encoding from ot her 8-bit
encodings.[3]
Unix text files
On Unix-like operat ing syst ems, t ext files format is precisely described: POSIX defines a t ext file as
a file t hat cont ains charact ers organized int o zero or more lines,[4] where lines are sequences of zero
or more non-newline charact ers plus a t erminat ing newline charact er,[5] normally LF.
Addit ionally, POSIX defines a printable file as a t ext file whose charact ers are print able or space or
backspace according t o regional rules. This excludes most cont rol charact ers, which are not
print able.[6]
Apple Macintosh text files
Prior t o t he advent of macOS, t he classic Mac OS syst em regarded t he cont ent of a file (t he dat a
fork) t o be a t ext file when it s resource fork indicat ed t hat t he t ype of t he file was "TEXT".[7] Lines
of classic Mac OS t ext files are t erminat ed wit h CR charact ers.[8]
Being a Unix-like syst em, macOS uses Unix format for t ext files.[8] Uniform Type Ident ifier (UTI) used
for t ext files in macOS is "public.plain-t ext "; addit ional, more specific UTIs are: "public.ut f8-plain-
t ext " for ut f-8-encoded t ext , "public.ut f16-ext ernal-plain-t ext " and "public.ut f16-plain-t ext " for ut f-
16-encoded t ext and "com.apple.t radit ional-mac-plain-t ext " for classic Mac OS t ext files.[7]
Rendering
When opened by a t ext edit or, human-readable cont ent is present ed t o t he user. This oft en consist s
of t he file's plain t ext visible t o t he user. Depending on t he applicat ion, cont rol codes may be
rendered eit her as lit eral inst ruct ions act ed upon by t he edit or, or as visible escape charact ers t hat
can be edit ed as plain t ext . Though t here may be plain t ext in a t ext file, cont rol charact ers wit hin
t he file (especially t he end-of-file charact er) can render t he plain t ext unseen by a part icular
met hod.
Related concepts
The use of light weight markup languages such as TeX, markdown and wikit ext can be regarded as an
ext ension of plain t ext files, as marked-up t ext is st ill wholly or part ially human-readable in spit e of
cont aining machine-int erpret able annot at ions. Early uses of HTML could also be regarded in t his
way, alt hough t he HTML of modern websit es is largely unreadable by humans. Ot her file format s
such as enriched t ext and CSV can also be regarded as human-int erpret able t o some degree.
See also
ASCII – Charact er encoding st andard
EBCDIC – Eight -bit charact er encoding syst em invent ed by IBM
Filename ext ension – Filename suffix t hat indicat es t he file's t ype
List of file format s – List of comput er file t ypes
Newline – Special charact ers in comput ing signifying t he end of a line of t ext
Synt ax highlight ing – Tool of edit ors for programming, script ing, and markup
Text -based prot ocol – Syst em for exchanging messages bet ween comput ing syst ems
Text edit or – Comput er soft ware used t o edit plain t ext document s
Unicode – Charact er encoding st andard
Notes and references
1. Lewis, John (2006). Computer Science Illuminated (ht t ps://archive.org/det ails/comput erscience
i00nell) . Jones and Bart let t . ISBN 0-7637-4149-3.
2. "Using Byt e Order Marks" (ht t ps://docs.microsoft .com/en-gb/windows/win32/int l/using-byt e-o
rder-marks) . Internationalization for Windows Applications. Microsoft . Jan 7, 2021. Archived (h
t t ps://web.archive.org/web/20230221224807/ht t ps://learn.microsoft .com/en-gb/windows/win
32/int l/using-byt e-order-marks) from t he original on Feb 21, 2023. Ret rieved 2022-04-21.
3. Freyt ag, Asmus (2015-12-18). "FAQ – UTF-8, UTF-16, UTF-32 & BOM" (ht t ps://www.unicode.or
g/faq/ut f_ bom.ht ml#BOM) . The Unicode Consort ium. Ret rieved 2016-05-30. "Yes, UTF-8 can
cont ain a BOM. However, it makes no difference as t o t he endianness of t he byt e st ream. UTF-
8 always has t he same byt e order. An init ial BOM is only used as a signat ure — an indicat ion t hat
an ot herwise unmarked t ext file is in UTF-8. Not e t hat some recipient s of UTF-8 encoded dat a
do not expect a BOM. Where UTF-8 is used transparently in 8-bit environment s, t he use of a
BOM will int erfere wit h any prot ocol or file format t hat expect s specific ASCII charact ers at
t he beginning, such as t he use of "#!" of at t he beginning of Unix shell script s."
4. "3.403 Text File" (ht t p://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_ chap03.ht
ml#t ag_ 03_ 403) . IEEE Std 1003.1, 2017 Edition. IEEE Comput er Societ y. Ret rieved
2019-03-01.
5. "3.206 Line" (ht t p://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_ chap03.ht ml#t
ag_ 03_ 206) . IEEE Std 1003.1, 2013 Edition. IEEE Comput er Societ y. Ret rieved 2015-12-15.
6. "3.284 Print able File" (ht t p://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_ chap0
3.ht ml#t ag_ 03_ 284) . IEEE Std 1003.1, 2013 Edition. IEEE Comput er Societ y. Ret rieved
2015-12-15.
7. "Syst em-Declared Uniform Type Ident ifiers" (ht t ps://developer.apple.com/library/prerelease/c
ont ent /document at ion/Miscellaneous/Reference/UTIRef/Art icles/Syst em-DeclaredUniformT
ypeIdent ifiers.ht ml) . Guides and Sample Code. Apple Inc. 2009-11-17. Ret rieved 2016-09-12.
8. "Designing Script s for Cross-Plat form Deployment " (ht t ps://developer.apple.com/library/mac/
document at ion/OpenSource/Concept ual/ShellScript ing/Port ingScript st oMacOSX/Port ingScri
pt st oMacOSX.ht ml) . Mac Developer Library. Apple Inc. 2014-03-10. Ret rieved 2016-09-12.
External links
Power of Plain Text on C2 wiki