0% found this document useful (0 votes)
186 views4 pages

Coursera Notes From Lectures

The document provides information on cleaning raw data from various sources like databases, websites, and files for analysis. It discusses different data formats, the steps for cleaning including creating a codebook and instruction list. Tips are given on including unique identifiers. Information is also given on reading different file types like XML, attributes in tags, structure of MySQL databases, HDF5 format, and web scraping. Additional online country databases are suggested. Instructions are proposed for accessing online academic resources like detailed step-by-step guides and addressing queries through cohort sessions or a shared document. Providing background on CWC instructors' expertise is recommended to help students choose the best instructor for their needs.

Uploaded by

Ashika
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
186 views4 pages

Coursera Notes From Lectures

The document provides information on cleaning raw data from various sources like databases, websites, and files for analysis. It discusses different data formats, the steps for cleaning including creating a codebook and instruction list. Tips are given on including unique identifiers. Information is also given on reading different file types like XML, attributes in tags, structure of MySQL databases, HDF5 format, and web scraping. Additional online country databases are suggested. Instructions are proposed for accessing online academic resources like detailed step-by-step guides and addressing queries through cohort sessions or a shared document. Providing background on CWC instructors' expertise is recommended to help students choose the best instructor for their needs.

Uploaded by

Ashika
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

NOTES:

Databases: MySQL and MongoDB; they have raw data in various


formats which has to be cleaned for use.
Websites having data: OPEN DATA Baltimore
,infochimps/marketplace, [Link], asdfree, Kaggle,
Order of cleaning:
1. Raw data
2. Tidy data
3. Code book explaining each variable; its values and units.
4. Instruction List: A record of each step involved in going from 1 to 2
and 3. (R script)
<<Different types of Files: Binary, Excel, XML, JSON, HDF5 from
API, manual data>>
Tips: Include something that connects all the data in a column namely an
id number or something.
CODEBOOK: (Word or txt file )
Info about variables and units
Summary choices ex: mean etc
Info about study design or where you got the data from
R Code: in the Data Cleaning folder
Local flat files= text, comma delimited , etc.

Reading XML file: Markup


Starting tags: <text>
End tags </text>
Empty tags: <line break/>
Elements are specific tags :
<greeting> hello </greeting>
Attributes are components of the label: <step number=”3”> connect A to
B </step>
MySQL database:
Rows are called records.
Structure: diff tables linked together
It’s damn complicated for windows so I skipped this one.
HDF5: Heirarchal Data Format
Stores large data
Webscraping
Extracting data from the HTML of website
API’s :Application Programming Interfaces
Given the current situation, the only way to expand academic resources
is through online sources. The existing list collated by the Ministry does
not seem to include websites which contain country-wise databases like,
[Link]
[Link]
which contain data useful for country-wise research.
Any additional databases which include separate and comparative
country wise data would be helpful for the student body.

1. For CWC and the e-library,the department could create detailed


instructions like the ones given for the Summer Ball event in Minecraft:
Pictures of the respective sites attached with step by step instructions to
be followed. Since the registration process for CWC requires the
Ashoka email ids, it is highly unlikely that there will be exceptional
cases. A document with pictures and instructions would be easier to
download and refer to than videos in case any student faces connectivity
issues.

2. In case there are any concerns or queries these could be addressed in


their respective cohort sessions as the most comprehensive method to
understand is from your peers. For larger concerns an excel sheet could
be circulated and the queries could be answered via a mass email.

Rather than the resource itself, I was dissatisfied with the information I
was given on the CWC instructors in my first year. There were so many
instructors I could go to, but I didn't know who would be able to guide
me the best. I understand that they all could help me with my
writing(which they did!), but if I had some background information on
their field of expertise I could've made a better choice. I think the
incoming first years would find it easier to approach CWC for help if
they could find an instructor to help for their specific problems.
Collating the field of experience and interests of individual instructors
and updating the CWC website with this information could help.

You might also like