0% found this document useful (0 votes)
33 views5 pages

Text File - Wikipedia

Text files are simple and commonly used for data storage due to their ease of recovery from corruption, though they often occupy more storage than necessary. Various encoding standards exist, with UTF-8 being the most common due to its compatibility with ASCII and ability to represent all known languages. Different operating systems have specific conventions for text file formats, including line endings and encoding types, which affect how text files are created and interpreted.

Uploaded by

vlogdesi138
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views5 pages

Text File - Wikipedia

Text files are simple and commonly used for data storage due to their ease of recovery from corruption, though they often occupy more storage than necessary. Various encoding standards exist, with UTF-8 being the most common due to its compatibility with ASCII and ability to represent all known languages. Different operating systems have specific conventions for text file formats, including line endings and encoding types, which affect how text files are created and interpreted.

Uploaded by

vlogdesi138
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Data storage

A stylized iconic depiction of a


CSV-formatted t ext file

Because of t heir simplicit y, t ext files are commonly used for st orage of informat ion. They avoid
some of t he problems encount ered wit h ot her file format s, such as endianness, padding byt es, or
differences in t he number of byt es in a machine word. Furt her, when dat a corrupt ion occurs in a t ext
file, it is oft en easier t o recover and cont inue processing t he remaining cont ent s. A disadvant age of
t ext files is t hat t hey usually have a low ent ropy, meaning t hat t he informat ion occupies more
st orage t han is st rict ly necessary.

A simple t ext file may need no addit ional met adat a (ot her t han knowledge of it s charact er set ) t o
assist t he reader in int erpret at ion. A t ext file may cont ain no dat a at all, which is a case of zero-byt e
file.

Encoding

The ASCII charact er set is t he most common compat ible subset of charact er set s for English-
language t ext files, and is generally assumed t o be t he default file format in many sit uat ions. It
covers American English, but for t he Brit ish pound sign, t he euro sign, or charact ers used out side
English, a richer charact er set must be used. In many syst ems, t his is chosen based on t he default
locale set t ing on t he comput er it is read on. Prior t o UTF-8, t his was t radit ionally single-byt e
encodings (such as ISO-8859-1 t hrough ISO-8859-16) for European languages and wide charact er
encodings for Asian languages.

Because encodings necessarily have only a limit ed repert oire of charact ers, oft en very small, many
are only usable t o represent t ext in a limit ed subset of human languages. Unicode is an at t empt t o
creat e a common st andard for represent ing all known languages, and most known charact er set s are
subset s of t he very large Unicode charact er set . Alt hough t here are mult iple charact er encodings
available for Unicode, t he most common is UTF-8, which has t he advant age of being backwards-
compat ible wit h ASCII; t hat is, every ASCII t ext file is also a UTF-8 t ext file wit h ident ical meaning.
UTF-8 also has t he advant age t hat it is easily aut o-det ect able. Thus, a common operat ing mode of
UTF-8 capable soft ware, when opening files of unknown encoding, is t o t ry UTF-8 first and fall back
t o a locale dependent legacy encoding when it definit ely is not UTF-8.

Formats

On most operat ing syst ems, t he name text file refers t o a file format t hat allows only plain t ext
cont ent wit h very lit t le format t ing (e.g., no bold or italic t ypes). Such files can be viewed and edit ed
on t ext t erminals or in simple t ext edit ors. Text files usually have t he MIME t ype text/plain ,
usually wit h addit ional informat ion indicat ing an encoding.

Microsoft Windows text files

DOS and Microsoft Windows use a common t ext file format , wit h each line of t ext separat ed by a
t wo-charact er combinat ion: carriage ret urn (CR) and line feed (LF). It is common for t he last line of
t ext not t o be t erminat ed wit h a CR-LF marker, and many t ext edit ors (including Not epad) do not
aut omat ically insert one on t he last line.

On Microsoft Windows operat ing syst ems, a file is regarded as a t ext file if t he suffix of t he name of
t he file (t he "filename ext ension") is .txt . However, many ot her suffixes are used for t ext files wit h
specific purposes. For example, source code for comput er programs is usually kept in t ext files t hat
have file name suffixes indicat ing t he programming language in which t he source is writ t en.

Most Microsoft Windows t ext files use ANSI, OEM, Unicode or UTF-8 encoding. What Microsoft
Windows t erminology calls "ANSI encodings" are usually single-byt e ISO/IEC 8859 encodings (i.e.
ANSI in t he Microsoft Not epad menus is really "Syst em Code Page", non-Unicode, legacy encoding),
except for in locales such as Chinese, Japanese and Korean t hat require double-byt e charact er set s.
ANSI encodings were t radit ionally used as default syst em locales wit hin Microsoft Windows, before
t he t ransit ion t o Unicode. By cont rast , OEM encodings, also known as DOS code pages, were defined
by IBM for use in t he original IBM PC t ext mode display syst em. They t ypically include graphical and
line-drawing charact ers common in DOS applicat ions. "Unicode"-encoded Microsoft Windows t ext
files cont ain t ext in UTF-16 Unicode Transformat ion Format . Such files normally begin wit h byt e
order mark (BOM), which communicat es t he endianness of t he file cont ent . Alt hough UTF-8 does
not suffer from endianness problems, many Microsoft Windows programs (i.e. Not epad) prepend t he
cont ent s of UTF-8-encoded files wit h BOM,[2] t o different iat e UTF-8 encoding from ot her 8-bit
encodings.[3]

Unix text files

On Unix-like operat ing syst ems, t ext files format is precisely described: POSIX defines a t ext file as
a file t hat cont ains charact ers organized int o zero or more lines,[4] where lines are sequences of zero
or more non-newline charact ers plus a t erminat ing newline charact er,[5] normally LF.

Addit ionally, POSIX defines a printable file as a t ext file whose charact ers are print able or space or
backspace according t o regional rules. This excludes most cont rol charact ers, which are not
print able.[6]

Apple Macintosh text files

Prior t o t he advent of macOS, t he classic Mac OS syst em regarded t he cont ent of a file (t he dat a
fork) t o be a t ext file when it s resource fork indicat ed t hat t he t ype of t he file was "TEXT".[7] Lines
of classic Mac OS t ext files are t erminat ed wit h CR charact ers.[8]

Being a Unix-like syst em, macOS uses Unix format for t ext files.[8] Uniform Type Ident ifier (UTI) used
for t ext files in macOS is "public.plain-t ext "; addit ional, more specific UTIs are: "public.ut f8-plain-
t ext " for ut f-8-encoded t ext , "public.ut f16-ext ernal-plain-t ext " and "public.ut f16-plain-t ext " for ut f-
16-encoded t ext and "com.apple.t radit ional-mac-plain-t ext " for classic Mac OS t ext files.[7]

Rendering

When opened by a t ext edit or, human-readable cont ent is present ed t o t he user. This oft en consist s
of t he file's plain t ext visible t o t he user. Depending on t he applicat ion, cont rol codes may be
rendered eit her as lit eral inst ruct ions act ed upon by t he edit or, or as visible escape charact ers t hat
can be edit ed as plain t ext . Though t here may be plain t ext in a t ext file, cont rol charact ers wit hin
t he file (especially t he end-of-file charact er) can render t he plain t ext unseen by a part icular
met hod.

Related concepts

The use of light weight markup languages such as TeX, markdown and wikit ext can be regarded as an
ext ension of plain t ext files, as marked-up t ext is st ill wholly or part ially human-readable in spit e of
cont aining machine-int erpret able annot at ions. Early uses of HTML could also be regarded in t his
way, alt hough t he HTML of modern websit es is largely unreadable by humans. Ot her file format s
such as enriched t ext and CSV can also be regarded as human-int erpret able t o some degree.

See also

ASCII – Charact er encoding st andard

EBCDIC – Eight -bit charact er encoding syst em invent ed by IBM

Filename ext ension – Filename suffix t hat indicat es t he file's t ype

List of file format s – List of comput er file t ypes

Newline – Special charact ers in comput ing signifying t he end of a line of t ext

Synt ax highlight ing – Tool of edit ors for programming, script ing, and markup

Text -based prot ocol – Syst em for exchanging messages bet ween comput ing syst ems

Text edit or – Comput er soft ware used t o edit plain t ext document s

Unicode – Charact er encoding st andard

Notes and references

1. Lewis, John (2006). Computer Science Illuminated (ht t ps://archive.org/det ails/comput erscience
i00nell) . Jones and Bart let t . ISBN 0-7637-4149-3.

2. "Using Byt e Order Marks" (ht t ps://docs.microsoft .com/en-gb/windows/win32/int l/using-byt e-o


rder-marks) . Internationalization for Windows Applications. Microsoft . Jan 7, 2021. Archived (h
t t ps://web.archive.org/web/20230221224807/ht t ps://learn.microsoft .com/en-gb/windows/win
32/int l/using-byt e-order-marks) from t he original on Feb 21, 2023. Ret rieved 2022-04-21.

3. Freyt ag, Asmus (2015-12-18). "FAQ – UTF-8, UTF-16, UTF-32 & BOM" (ht t ps://www.unicode.or
g/faq/ut f_ bom.ht ml#BOM) . The Unicode Consort ium. Ret rieved 2016-05-30. "Yes, UTF-8 can
cont ain a BOM. However, it makes no difference as t o t he endianness of t he byt e st ream. UTF-
8 always has t he same byt e order. An init ial BOM is only used as a signat ure — an indicat ion t hat
an ot herwise unmarked t ext file is in UTF-8. Not e t hat some recipient s of UTF-8 encoded dat a
do not expect a BOM. Where UTF-8 is used transparently in 8-bit environment s, t he use of a
BOM will int erfere wit h any prot ocol or file format t hat expect s specific ASCII charact ers at
t he beginning, such as t he use of "#!" of at t he beginning of Unix shell script s."
4. "3.403 Text File" (ht t p://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_ chap03.ht
ml#t ag_ 03_ 403) . IEEE Std 1003.1, 2017 Edition. IEEE Comput er Societ y. Ret rieved
2019-03-01.

5. "3.206 Line" (ht t p://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_ chap03.ht ml#t


ag_ 03_ 206) . IEEE Std 1003.1, 2013 Edition. IEEE Comput er Societ y. Ret rieved 2015-12-15.

6. "3.284 Print able File" (ht t p://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_ chap0


3.ht ml#t ag_ 03_ 284) . IEEE Std 1003.1, 2013 Edition. IEEE Comput er Societ y. Ret rieved
2015-12-15.

7. "Syst em-Declared Uniform Type Ident ifiers" (ht t ps://developer.apple.com/library/prerelease/c


ont ent /document at ion/Miscellaneous/Reference/UTIRef/Art icles/Syst em-DeclaredUniformT
ypeIdent ifiers.ht ml) . Guides and Sample Code. Apple Inc. 2009-11-17. Ret rieved 2016-09-12.

8. "Designing Script s for Cross-Plat form Deployment " (ht t ps://developer.apple.com/library/mac/


document at ion/OpenSource/Concept ual/ShellScript ing/Port ingScript st oMacOSX/Port ingScri
pt st oMacOSX.ht ml) . Mac Developer Library. Apple Inc. 2014-03-10. Ret rieved 2016-09-12.

External links

Power of Plain Text on C2 wiki

You might also like