Basicly, the way we did this:
1. split the whole dataset of 493 students into two parts: approxmately 50%/50%, 
   training part for training a regression model, the other part for testing. 
   Training data is organized as one row per student;
   Testing data is organized as one row per student per month, from Sep. 16th to May 16th
2. Train up a regression model, including pretest, online metrics, interaction and quadratic terms as paramters
3. Apply the regression equation on testing data, and calcuated a new variable: Assistance_predicted_score
4. Longitudinally track the change of percent correct on original question and Assistance_predicted_score
   Result: 
	Sum of squares: 31876.1773  vs 13256.7697
	BIC: 9656 vs. 9090
	under predicting
5. Compute the predicted score of the last month for each student for both models
   Do simple linear regression to predict MCAS score using the predicted score from each model seperately. 
   Use the regression-predicted score to calculate MAD
   Result: 
 	MAD:  6.6 vs. 5.76
6. To see how good our prediction is:
   pick 34 Mulitple-choice and short answer questions from MCAS 2005. 
   Use half of them (17 items to predict the other half.
   This was done 5 times with randomly picked items in both halves.
   Average MAD: 1.892382979 (remember the full score is 17 here, which the full score in our prediction is 54)

--------------------------------------------------------- 
Notes for myself of the whole process: 

original file: 662 students;
with pretest: 493 students; 
494 students included in files in this folder. 
Training: 250, Testing: 243

start from student_all.sav
filter out students for whom we don't have pretest score --> student_with_pretest_regression.sav
randomly select approxmately 50% students from above file and save as full.xls
in excel, split full.txt into two files: training.txt (250 rows) and testing.txt (244) rows


open training.txt in spss and save it as Training.sav. 
run stepwise regression to predict MCAS score. (posttest score is not used)

Model reached:
(Constant)	27.811	
INTER_PRE_PERCENT_CORRECT	0.593	AJ
INTER_AVG_HINT_REQ_AVG_ATTEMPT	-4.122	BM
INTER_PRE_AVG_HINT_REQ	0.346	AK
QUA_ORIGINAL_CORRECT	28.727	AZ
AVG_ATTEMPT	-5.302	AB

Result: TRAINING_REGRESSION_RESULT.HTM


create table assistment_reporting.tmp_user; and import from full.xls user_id/$filter into this table, 
so that we can select data from testing users. (filter = 1 for testing user)

create view (find it also in ../combine-longi-online/db_view.sql):
create or replace view assistment_reporting.vStudentProgressTesting as
select a.* from assistment_reporting.vStudentProgress a, assistment_reporting.tmp_user b
where a.studentID = b.user_id and b.flag = 1;


dump data from view, result stored in file: student_progress_test_set.txt.output.txt, 243 students included.

open the test_set in SPSS, run ../predictMCASScore-Assessmeng/compute_column.SPS to compute quadratic 
and interaction terms and save it as Testing_longitudinal.sav and testing_longitudinal.xls.

In excel, apply the coefficient to compute a new column called "Assistance_predicted_score" = if (27.811+0.593*AJ2-4.122*BM2+0.346*AK2+28.727*AZ2-5.302*AB2 >0, 27.811+0.593*AJ2-4.122*BM2+0.346*AK2+28.727*AZ2-5.302*AB2, 0)
If the result is lower than zero, use zero.
Save the finished file as Testing_longitudinal.txt
(Just realized that this can be done directly in SPSS, no need to save as xls. Compute
Assistance_predicted_score = 27.811+0.593*INTER_PRE_PERCENT_CORRECT-4.122*INTER_AVG_HINT_REQ_AVG_ATTEMPT+0.346*INTER_PRE_AVG_HINT_REQ+28.727*QUA_ORIGINAL_CORRECT-5.302*AVG_ATTEMPT


Rum mixed linear model:
1. to predict %correct(called predictedscore in file) on original question. 
MIXED
  PredictedScore  BY schoolID TeacherID  WITH CenteredMonth
  /CRITERIA = CIN(95) MXITER(100) MXSTEP(5) SCORING(1) SINGULAR(0.000000000001) HCONVERGE(0, ABSOLUTE) LCONVERGE(0, ABSOLUTE)
  PCONVERGE(0.000001, ABSOLUTE)
  /FIXED = CenteredMonth schoolID CenteredMonth*schoolID  | SSTYPE(3)
  /METHOD = ML
  /PRINT = SOLUTION TESTCOV
 /RANDOM INTERCEPT CenteredMonth  | SUBJECT(studentID) COVTYPE(UN)
  /SAVE = PRED .
Result:   
MIXED_PREDICTEDSCORE.HTM, BIC = 9656  

[
MIXED
  PredictedScore  BY  schoolID specialed   WITH CenteredMonth
  /CRITERIA = CIN(95) MXITER(100) MXSTEP(5) SCORING(1) SINGULAR(0.000000000001) HCONVERGE(0, ABSOLUTE) LCONVERGE(0, ABSOLUTE)
  PCONVERGE(0.000001, ABSOLUTE)
  /FIXED = CenteredMonth schoolID specialed  CenteredMonth*schoolID CenteredMonth*specialed  | SSTYPE(3)
  /METHOD = ML
  /PRINT = SOLUTION TESTCOV
 /RANDOM INTERCEPT CenteredMonth  | SUBJECT(studentID) COVTYPE(UN) .
 BIC = 9648, all significant
 
 
 MIXED
  PredictedScore  BY  schoolID specialed  freelunch WITH CenteredMonth
  /CRITERIA = CIN(95) MXITER(100) MXSTEP(5) SCORING(1) SINGULAR(0.000000000001) HCONVERGE(0, ABSOLUTE) LCONVERGE(0, ABSOLUTE)
  PCONVERGE(0.000001, ABSOLUTE)
  /FIXED = CenteredMonth schoolID specialed  freelunch CenteredMonth*schoolID CenteredMonth*specialed  | SSTYPE(3)
  /METHOD = ML
  /PRINT = SOLUTION TESTCOV
 /RANDOM INTERCEPT CenteredMonth  | SUBJECT(studentID) COVTYPE(UN) .
 BIC = 9653, all significant
 
 ]

2. to predicte Assistance_predicted_score.
MIXED
  Assistance_predicted_score  BY schoolID  WITH CenteredMonth
  /CRITERIA = CIN(95) MXITER(100) MXSTEP(5) SCORING(1) SINGULAR(0.000000000001) HCONVERGE(0, ABSOLUTE) LCONVERGE(0, ABSOLUTE)
  PCONVERGE(0.000001, ABSOLUTE)
  /FIXED = CenteredMonth schoolID | SSTYPE(3)
  /METHOD = ML
  /PRINT = SOLUTION TESTCOV
 /RANDOM INTERCEPT CenteredMonth  | SUBJECT(studentID) COVTYPE(UN)
  /SAVE = PRED .
Result: 
	MIXED_ASSISTANCE_PREDICTEDSCORE.HTM, BIC = 9090
Note: convergence not reached if use schoolID as predictor of learning rate	

[
MIXED
  Assistance_predicted_score  BY  schoolID specialed   WITH CenteredMonth
  /CRITERIA = CIN(95) MXITER(100) MXSTEP(5) SCORING(1) SINGULAR(0.000000000001) HCONVERGE(0, ABSOLUTE) LCONVERGE(0, ABSOLUTE)
  PCONVERGE(0.000001, ABSOLUTE)
  /FIXED = CenteredMonth schoolID specialed  | SSTYPE(3)
  /METHOD = ML
  /PRINT = SOLUTION TESTCOV
 /RANDOM INTERCEPT CenteredMonth  | SUBJECT(studentID) COVTYPE(UN) .
 BIC = 9080	
 
 
 MIXED
  Assistance_predicted_score  BY  schoolID specialed  freelunch WITH CenteredMonth
  /CRITERIA = CIN(95) MXITER(100) MXSTEP(5) SCORING(1) SINGULAR(0.000000000001) HCONVERGE(0, ABSOLUTE) LCONVERGE(0, ABSOLUTE)
  PCONVERGE(0.000001, ABSOLUTE)
  /FIXED = CenteredMonth schoolID specialed freelunch | SSTYPE(3)
  /METHOD = ML
  /PRINT = SOLUTION TESTCOV
 /RANDOM INTERCEPT CenteredMonth  | SUBJECT(studentID) COVTYPE(UN) .
 
 BIC = 9082, all significant
]

Predicted value for all months for both model saved in specific as
Track_predicted_score and Track_Assistance_predicted_score

Save the .sav file back as xls file: Testing_longitudinal_with_predicted_value.xls; 
student ID and predicted scores by month are extracted
to TXT file called Testing_predicted_scores(by month).TXT. 

compute intercept and slope given the predicted score. 
first, import predicted scores in file Testing_predicted_scores(by month).TXT into database: tmp_LTA
create table tmp_LTA (
user_id number,
centered_month number,
original_predicted_score number,
original_intercept number,
original_slope number,
assistance_predicted_score number,
assistance_intercept number,
assistance_slope number);

-- compute slope
update tmp_LTA d set original_slope = 
(select (a.original_predicted_score - b.original_predicted_score)/(a.centered_month - b.centered_month)
from tmp_LTA a, tmp_LTA b, 
(select user_id, max(centered_month) max_month, min(centered_month) min_month from tmp_LTA group by user_id) c
where a.user_id = c.user_id and b.user_id = c.user_id
and a.centered_month = c.max_month and b.centered_month = c.min_month
and a.centered_month != b.centered_month
and a.user_id = d.user_id);

update tmp_LTA d set assistance_slope = 
(select (a.assistance_predicted_score - b.assistance_predicted_score)/(a.centered_month - b.centered_month)
from tmp_LTA a, tmp_LTA b, 
(select user_id, max(centered_month) max_month, min(centered_month) min_month from tmp_LTA group by user_id) c
where a.user_id = c.user_id and b.user_id = c.user_id
and a.centered_month = c.max_month and b.centered_month = c.min_month
and a.centered_month != b.centered_month
and a.user_id = d.user_id);

-- compute intercept
update tmp_LTA set original_intercept = original_predicted_score - centered_month*original_slope;
update tmp_LTA set assistance_intercept = assistance_predicted_score - centered_month*assistance_slope;


Compare the two predictions:
1) use last month's score, and compute SS for both of them

-- single student grouped
select distinct user_id,  (original_final - mcasscore)*(original_final- mcasscore),  
(ASSISTANCE_final - mcasscore) * (ASSISTANCE_final - mcasscore)
from (select user_id, avg(original_intercept) +7*avg(original_slope) original_final, avg(ASSISTANCE_intercept) + 7*avg(ASSISTANCE_slope) ASSISTANCE_final
from tmp_LTA group by user_id) LTA, vdataframe4_05
where user_id = studentID and original_final is not null; 
(241 rows selected)

-- all together SS
select sum((original_final - mcasscore)*(original_final- mcasscore)),  
sum((ASSISTANCE_final - mcasscore) * (ASSISTANCE_final - mcasscore))
from (select user_id, avg(original_intercept) +7*avg(original_slope) original_final, avg(ASSISTANCE_intercept) + 7*avg(ASSISTANCE_slope) ASSISTANCE_final
from tmp_LTA group by user_id) LTA, vdataframe4_05
where user_id = studentID and original_final is not null;

SUM((ORIGINAL_FINAL-MCASSCORE)*(ORIGINAL_FINAL-MCASSCORE)) SUM((ASSISTANCE_FINAL-MCASSCORE)*(ASSISTANCE_FINAL-MCASSCORE))
---------------------------------------------------------- --------------------------------------------------------------
                                                31876.1773                                                     13256.7697
                                                
                                                
2) do regression
get final predicted scores and MCAS score for regression
predict MCAS score using original_final and assistance_final, and calculate MAD
file saved from SPSS as: Testing_regression.sav
avg(original_MAD) = 6.6, avg(assistance_MAD)=5.76
paired T-test result: Testing_T-test_MAD.htm



compare to prediction based on MCAS items themselves: 
avg: 1.892382979
705 students; 
34 Items used (39 excludes 4 open response items to make the two part balanced and each item got full score of 1, just like
in the Assistment system)
randomly select 17 items out of 34, and apply linear regressio to use students' performance on the selected 17 items 
to predict their performance on the rest items. This was run 5 times and the average value is 1.892382979.